运行爬虫蜘蛛crawl参数(6)python SCRAPY最新教程1.51以上版本

发表于： 2020年8月28日 2022年12月7日
分类： Python, scrapy
标签： crawl, def, HTTP, http_pass, http_user, humor, None, python, quotes, Scrapy, scrapy教程, self, Spider, spider参数, start, start_urls, tag, tag=humor, url, user_agent, yield, 参数, 基本概念, 爬虫, 蜘蛛, 配置文件

您可以-a 在运行蜘蛛时使用该选项为您的蜘蛛提供命令行参数：

scrapy crawl quotes -o quotes-humor.json -a tag=humor

这些参数传递给Spider的__init__方法，默认情况下变为spider属性。

在此示例中，为参数提供的值tag将通过self.tag。您可以使用此选项使您的蜘蛛只获取具有特定标记的引号，并根据参数构建URL：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

如果您将tag=humor参数传递给此蜘蛛，您会注意到它只会访问humor标记中的URL ，例如http://quotes.toscrape.com/tag/humor。

您可以了解更多关于此处理蜘蛛参数。

蜘蛛可以接收修改其行为的参数。spider参数的一些常见用途是定义起始URL或将爬网限制到站点的某些部分，但它们可用于配置spider的任何功能。

crawl使用该-a选项通过命令传递Spider参数。例如：

scrapy crawl myspider -a category=electronics

蜘蛛可以在__init__方法中访问参数：

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category=None, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://www.example.com/categories/%s' % category]
        # ...
默认的__init__方法将接受任何spider参数并将它们作为属性复制到spider。上面的例子也可以写成如下：

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/categories/%s' % self.category)

请记住，蜘蛛参数只是字符串。蜘蛛本身不会进行任何解析。如果要从命令行设置start_urls属性，则必须使用ast.literal_eval 或json.loads 等方法将其自行解析为列表，然后将其设置为属性。否则，您将导致对start_urls字符串的迭代（一个非常常见的python陷阱），导致每个字符被视为一个单独的url。

有效的用例是设置由以下所用HttpAuthMiddleware 的用户代理使用的http身份验证凭据UserAgentMiddleware：

scrapy crawl myspider -a http_user=myuser -a http_pass=mypassword -a user_agent=mybot

Spider 参数也可以通过Scrapyd schedule.jsonAPI 传递。请参阅Scrapyd文档。

使用蒙特卡洛方案为奇异期权定价的观察 2022年9月1日
如何在WordPress边栏中显示随机引号 2018年12月28日
爬虫蜘蛛Scrapy内置蜘蛛中间件SPIDER_MIDDLEWARES的详细介绍(61)python… 2020年9月25日
爬虫蜘蛛项目加载器Item Loader类详解之可用的内置处理器详解 (24)python… 2020年9月5日
爬虫蜘蛛采集请求和回应Request和Response之请求对象scrapy.Request(33)py… 2020年9月10日
爬虫蜘蛛项目加载器Item Loader类详解之ItemLoader对象详解 (21)python… 2020年9月4日
Scrapy最新简介 2020年8月24日
运行Scrapy爬虫蜘蛛的方法大全(45)python Scrapy教程1.51以上版本 2020年9月17日
爬虫蜘蛛Scrapy如何使用信号Signals API延迟信号处理程序？(69)python… 2020年9月29日
- ElementTree XML API结构化标记处理工具（Python教程）（参考资料） 2019年3月26日
爬虫蜘蛛常见问题解答(42)python Scrapy教程1.51以上版本 2020年9月15日