爬虫蜘蛛Scrapy shell之运行使用shell详解 (26)python SCRAPY最新教程1.51以上版本

发表于： 2020年9月6日 2022年12月8日
分类： Python, scrapy
标签： Dec, Domain, fetch, GMT, org, python, reddit, Request, Scrapy, scrapy shell, scrapy教程, shell, shell例子, url, 快捷方式, 爬虫, 蜘蛛

Scrapy shell只是一个常规的Python控制台（如果有的话，它可以是IPython控制台），它提供了一些额外的快捷功能以方便使用。

可用的快捷方式

shelp() – 使用可用对象和快捷方式列表打印帮助

fetch(url[, redirect=True]) – 从给定的URL获取新响应并相应地更新所有相关对象。您可以选择要求HTTP 3xx重定向，然后不要传递redirect=False

fetch(request) – 从给定请求中获取新响应并相应地更新所有相关对象。

view(response) – 在本地Web浏览器中打开给定的响应，以进行检查。这将向响应主体添加<base>标记，以便正确显示外部链接（如图像和样式表）。但请注意，这将在您的计算机中创建一个临时文件，该文件不会自动删除。

可用的Scrapy对象

Scrapy shell自动从下载的页面创建一些方便的对象，如Response对象和 Selector对象（对于HTML和XML内容）。

那些对象是：

crawler– 当前Crawler对象。

spider– 已知处理URL的Spider，或者Spider当前URL没有找到蜘蛛时的对象

request– Request最后一个获取页面的对象。您可以replace() 使用fetch 快捷方式使用或获取新请求（不离开shell）来修改此请求。

response– Response包含最后一个提取页面的对象

settings– 目前的Scrapy设置

shell会话的例子

下面是一个典型shell会话的示例，我们首先抓取https://scrapy.org页面，然后继续刮取https://reddit.com 页面。最后，我们将（Reddit）请求方法修改为POST并重新获取它以获得错误。我们通过在Windows中键入Ctrl-D（在Unix系统中）或Ctrl-Z来结束会话。

请记住，此处提取的数据在您尝试时可能不一样，因为这些页面不是静态的，并且在您测试时可能已更改。此示例的唯一目的是让您熟悉Scrapy shell的工作原理。

首先，我们启动shell：

scrapy shell 'https://scrapy.org' --nolog

然后，shell获取URL（使用Scrapy下载程序）并打印可用对象列表和有用的快捷方式（您会注意到这些行都以[s]前缀开头）：

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f07395dd690>
[s]   item       {}
[s]   request    <GET https://scrapy.org>
[s]   response   <200 https://scrapy.org/>
[s]   settings   <scrapy.settings.Settings object at 0x7f07395dd710>
[s]   spider     <DefaultSpider 'default' at 0x7f0735891690>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

>>>

之后，我们可以开始玩对象了：

>>> response.xpath('//title/text()').extract_first()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

>>> fetch("https://reddit.com")

>>> response.xpath('//title/text()').extract()
['reddit: the front page of the internet']

>>> request = request.replace(method="POST")

>>> fetch(request)

>>> response.status
404

>>> from pprint import pprint

>>> pprint(response.headers)
{'Accept-Ranges': ['bytes'],
 'Cache-Control': ['max-age=0, must-revalidate'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'],
 'Server': ['snooserv'],
 'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loidcreated=2016-12-08T16%3A21%3A19.445Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loid=vi0ZVe4NkxNWdlH7r7; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loidcreated=2016-12-08T16%3A21%3A19.459Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure'],
 'Vary': ['accept-encoding'],
 'Via': ['1.1 varnish'],
 'X-Cache': ['MISS'],
 'X-Cache-Hits': ['0'],
 'X-Content-Type-Options': ['nosniff'],
 'X-Frame-Options': ['SAMEORIGIN'],
 'X-Moose': ['majestic'],
 'X-Served-By': ['cache-cdg8730-CDG'],
 'X-Timer': ['S1481214079.394283,VS0,VE159'],
 'X-Ua-Compatible': ['IE=edge'],
 'X-Xss-Protection': ['1; mode=block']}
>>>

爬虫蜘蛛采集请求和回应Request和Response之响应对象scrapy.Response(34)p… 2020年9月10日
抓取采集网页并提取数据(5)python SCRAPY最新教程1.51以上版本 2020年8月27日
爬虫蜘蛛Scrapy如何使用信号Signals API延迟信号处理程序？(69)python… 2020年9月29日
蜘蛛采集选择器xpath的详细使用讲解python… 2020年9月1日
爬虫蜘蛛基准测试scrapy bench(53)python Scrapy教程1.51以上版本 2020年9月21日
爬虫蜘蛛使用python内置日志记录系统Logging(38)python Scrapy教程1.51以上版本 2020年9月12日
爬虫蜘蛛项目加载器Item Loader类详解之ItemLoader对象详解 (21)python… 2020年9月4日
爬虫蜘蛛scrapy.Item类详解 (17)python SCRAPY最新教程1.51以上版本 2020年9月2日
Scrapy调试内存泄漏及常见问题(49)python Scrapy教程1.51以上版本 2020年9月19日
爬虫蜘蛛Scrapy内置蜘蛛中间件SPIDER_MIDDLEWARES的详细介绍(61)python… 2020年9月25日
爬虫蜘蛛Scrapy内置下载中间件详细分析DOWNLOADER_MIDDLEWARES(58)pytho… 2020年9月23日