如何运行我们的蜘蛛爬虫(3)python SCRAPY教程1.51以上版本
要让我们的蜘蛛工作,请转到项目的顶级目录并运行:
scrapy crawl quotes
此命令运行quotes
我们刚添加的名称的spider ,它将发送一些quotes.toscrape.com
域请求。您将获得类似于此的输出:
... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...
现在,检查当前目录中的文件。您应该注意到已经创建了两个新文件:quotes-1.html和quotes-2.html,以及各个URL的内容,正如我们的parse
方法所指示的那样。
注意
如果您想知道为什么我们还没有解析HTML,请继续,我们将很快介绍。