Scrapy Splash Crawler Reator NotRestartable

2022-04-18 00:00:00 python twisted scrapy scrapy-splash

问题描述

我已经在Windows 10上使用Visual Studio代码开发了一个SRapy Splash Screper。

当我在没有runner.py文件的情况下像这样运行我的刮取器时,它会工作并生成抓取的内容int";out.json";:scrapy crawl mytest -o out.json

但是,当我使用runner.py文件在Visual Studio代码中以调试模式运行刮除器时,它在execute行(下面的完整代码)上失败:

Exception has occurred: ReactorNotRestartable
exception: no description
  File "C:scrapyhw_spidersspidersunner.py", line 8, in <module>
    execute(    

我已经检查过了:

  • Scrapy - Reactor not Restartable
  • Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice
  • ReactorNotRestartable error in while loop with scrapy

从这些帖子来看,如果我启动第二个爬行器(例如,多次调用Crawl,而只启动一次),似乎是一个问题,然而,我看不到我应该从哪里开始。

我还在那里看到while循环和Twisted reactor存在潜在问题,但我在代码中也看不到这些问题。

所以我现在不知道需要在哪里修复代码。

runner.py

#https://newbedev.com/debugging-scrapy-project-in-visual-studio-code
import os
from scrapy.cmdline import execute

os.chdir(os.path.dirname(os.path.realpath(__file__)))

try:
    execute(
        [
            'scrapy',
            'crawl',
            'mytest',
            '-o',
            'out.json',
        ]
    )
except SystemExit:
    pass

Launch.json

{
    "version": "0.1.0",
    "configurations": [
        {
            "name": "Python: Launch Scrapy Spider",
            "type": "python",
            "request": "launch",
            "module": "scrapy",
            "args": [
                "runspider",
                "${file}"
            ],
            "console": "integratedTerminal"
        }
    ]
}

settings.json

{
    "python.analysis.extraPaths": [
        "./hw_spiders"
    ]
}   

Midlewares.py

from scrapy import signals
from itemadapter import is_item, ItemAdapter

class MySpiderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class MyDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Pipelines.py

from itemadapter import ItemAdapter


class MyPipeline:
    def process_item(self, item, spider):
        return item

settings.py

BOT_NAME = 'hw_spiders'
SPIDER_MODULES = ['hw_spiders.spiders']
NEWSPIDER_MODULE = 'hw_spiders.spiders'
ROBOTSTXT_OBEY = True

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    # 'hw_spiders.middlewares.MySpiderMiddleware': 543,
}

DOWNLOADER_MIDDLEWARES = {
    # 'hw_spiders.middlewares.MyDownloaderMiddleware': 543,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
} 

SPLASH_URL = 'http://localhost:8050/' 
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
ROBOTSTXT_OBEY = False

myest.py

import json
import re
import os

import scrapy
import time
from scrapy_splash import SplashRequest
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

from ..myitems import CarItem

class MyTest_Spider(scrapy.Spider):
    name = 'mytest'
    start_urls = ['<hidden>']

    def start_requests(self):
        yield SplashRequest(
            self.start_urls[0], self.parse
        )

    def parse(self, response):
        object_links = response.css('div.wrapper div.inner33 > a::attr(href)').getall()

        for link in object_links:
            yield scrapy.Request(link, self.parse_object)

        next_page = response.css('div.nav-links a.next.page-numbers::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)


    def parse_object(self, response):
        item = RentalItem()

        item['url'] = response.url

        object_features = response.css('table.info tr')
        for feature in object_features:
            try:
                feature_title = feature.css('th::text').get().strip()
                feature_info = feature.css('td::text').get().strip()
            except:
                continue
        item['thumbnails'] = response.css("ul#objects li a img::attr(src)").getall()

更新%1

所以我现在从我的项目中删除了runner.py,只有.vcodelaunch.json:

当我在Visual Studio代码中打开文件mytest.py并按F5进行调试时,我看到以下输出:

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Try the new cross-platform PowerShell https://aka.ms/pscore6

PS C:scrapyhw_spiders>  & 'C:UsersAdamAppDataLocalProgramsPythonPython38-32python.exe' 'c:UsersAdam.vscodeextensionsms-python.python-2021.11.1422169775pythonFileslibpythondebugpylauncher' '51812' '--' '-m' 'scrapy' 'runspider' 'c:scrapyhw_spidersspidersmytest.py'
2021-11-19 14:19:02 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: hw_spiders)
2021-11-19 14:19:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 
15:43:08) [MSC v.1926 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform Windows-10-10.0.19041-SP0
2021-11-19 14:19:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
Usage
=====
  scrapy runspider [options] <spider_file>

runspider: error: Unable to load 'c:\scrapy\hw_spiders\spiders\mytest.py': attempted relative import with no known parent package

这肯定是第from ..myitems import RentalItem行,但我不知道为什么失败。


解决方案

您应该创建runner.py文件并使用默认的runner.py配置来运行runner.py文件,或者不是拥有runner.py文件并使用scrapylaunch.json(如您的问题所示),而不是两者都有。

看起来您问题中的article只是复制了this Stackoverflow question中的所有答案,并在没有上下文的情况下将它们组合在一起。

相关文章