使用Srapy和Splash跟踪javascript分页

2022-02-22 00:00:00 python scrapy scrapy-splash

问题描述

我使用Scrapy和Splash来提取数据。我希望找到一种方法来遵循与javascript供电的分页。URL不会更改，无论您在哪个页面上，它始终是相同的。

<li class="btn-next"><a href="javascript:ctrl.set_pageReload(2)">Next</a></li>

我已尝试使用Lua脚本和Splash单击该元素，但不起作用：

    """function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(1))
assert(splash:runjs('document.getElementsByClassName("btn-next")[0].click()'))
assert(splash:wait(0.75))
-- return result as a JSON object
return {html = splash:html()}                
end """



def parse(self, response):
    section = response.css('li.li-result')
    for item in section:
        yield{
            'manufacturer' :  item.css('span.brand::text').extract_first(),
            'model' : item.css('span.sub-title::text').extract_first(),
            'engine_size' :  item.css('span.nowrap::text').extract_first(),
            'model_type' : item.css('span span.nowrap::text').extract_first(),
            'old_price' : item.css('li.li-result p.old-prix span::text').extract_first(),    
            'price' : item.css('li.li-result p.prix::text').extract_first(),
            'consumption' : item.css('li.li-result div.desc::text').extract_first(),
            'date' : item.css('p.btn-publication::text').extract_first(),
            'fuel_type' : item.css('div.bc-info div.upper::text').extract_first(),
            'mileage' : item.css('li.li-result div.bc-info ul div::text')[1].extract(),
            'year' : item.css('li.li-result div.bc-info ul div::text')[2].extract(),
            'transmission_type' : item.css('li.li-result div.bc-info ul div::text')[3].extract(),
            'add_number' : item.css('li.li-result div.bc-info ul div::text')[4].extract(),

        }

    next_page = response.css('li.btn-next').extract_first() #pagination    
    if next_page != 0:
        print(response)
        yield(SplashRequest(next_page, self.parse,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},  

        ))

这样做有可能吗？感谢帮助。

解决方案

首先，lua脚本有两个问题：

.btn-next是非交互式li元素，因此单击它不会执行任何操作。
参考：https://developer.mozilla.org/en-US/docs/Web/API/Element/click_event
分页速度较慢，因此等待0.75秒太短。

要解决此问题，请执行以下操作：

单击其a子元素(请参见下面的备选方案)。
等待1.5秒或更长时间。

function main(splash)
  local url = splash.args.url
  assert(splash:go(url))
  assert(splash:wait(1))
  -- assert(splash:runjs('document.getElementsByClassName("btn-next")[0].click()'))          -- Change this
  assert(splash:runjs('document.getElementsByClassName("btn-next")[0].children[0].click()')) -- to this
  -- assert(splash:wait(0.75)) -- Change this
  assert(splash:wait(1.5))     -- to this
  -- return result as a JSON object
  return {html = splash:html()}                
end

(上图显示在Splash中可以使用JavaScript导航到页面2，但我们需要更多工作才能使用Scrapy-Splash抓取后续页面。)

接下来，parse方法有两个问题：

next_page是li元素的HTML字符串，因此不能作为url参数传递给SplashRequest。
next_page可以是None，但绝不是0。

若要解决该问题，请参阅解决方案。

解决方案

传递response.url和response.text，然后调用splash:set_content()恢复下一次请求中的状态。
- 传递dontfilter=True跳过url的重复检查。
- 等待2秒，但有时仍然太短(请参见下面的备选方案)。
请检查next_page不是None。

script = """function main(splash)
  local url = splash.args.url
  local content = splash.args.content
  assert(splash:set_content(content, "text/html; charset=utf-8", url))
  assert(splash:runjs('document.getElementsByClassName("btn-next")[0].children[0].click()'))
  assert(splash:wait(2))
  return {html = splash:html()}
end"""

def parse(self, response, **kwargs):
    section = response.css('li.li-result')
    for item in section:
        yield {
            'manufacturer': item.css('span.brand::text').extract_first(),
            'model': item.css('span.sub-title::text').extract_first(),
            'engine_size': item.css('span.nowrap::text').extract_first(),
            'model_type': item.css('span span.nowrap::text').extract_first(),
            'old_price': item.css('li.li-result p.old-prix span::text').extract_first(),
            'price': item.css('li.li-result p.prix::text').extract_first(),
            'consumption': item.css('li.li-result div.desc::text').extract_first(),
            'date': item.css('p.btn-publication::text').extract_first(),
            'fuel_type': item.css('div.bc-info div.upper::text').extract_first(),
            'mileage': item.css('li.li-result div.bc-info ul div::text')[1].extract(),
            'year': item.css('li.li-result div.bc-info ul div::text')[2].extract(),
            'transmission_type': item.css('li.li-result div.bc-info ul div::text')[3].extract(),
            'add_number': item.css('li.li-result div.bc-info ul div::text')[4].extract(),
        }

    next_page = response.css('li.btn-next').extract_first()
    # print(next_page)
    if next_page is not None:
        yield SplashRequest(
            response.url,
            self.parse,
            endpoint='execute',
            args={
                'lua_source': script,
                'content': response.text,
            },
            cache_args=['lua_source'],
            dont_filter=True,
        )

(注释掉for item in section:挡路，取消注释print(next_page)，轻松验证解决方案。)

一些替代方案

单击该按钮的另一种方法是直接调用函数：

-- assert(splash:runjs('document.getElementsByClassName("btn-next")[0].children[0].click()'))
assert(splash:runjs('ctrl.set_pageReload(ctrl.context.cur_page + 1)'))

硬编码可能不足的等待时间的另一种方法是定期检查SET变量，然后等待jQuery就绪回调(可选地硬编码初始等待时间)：

assert(splash:runjs('window.notReloaded = 1'))
-- assert(splash:wait(2)) -- Optional initial wait time
local exit = false
while (exit == false)
do
  result, error = splash:wait_for_resume([[
    function main(splash) {
      window.notReloaded ? splash.error() : splash.resume();
    }
  ]])
  if result then
    exit = true
  else
    splash:wait(0.2) -- Adjust resolution as desired
  end
end
assert(splash:wait_for_resume([[
  function main(splash) {
    $(() => splash.resume());
  }
]]))

相关文章