Python Scrapy - 从 mysql 填充 start_urls

2022-01-04 00:00:00 python scrapy web-crawler mysql

我正在尝试使用 spider.py 从 MYSQL 表中使用 SELECT 填充 start_url.当我运行scrapy runpider spider.py"时，我没有得到任何输出，只是它没有错误地完成了.

I am trying to populate start_url with a SELECT from a MYSQL table using spider.py. When i run "scrapy runspider spider.py" i get no output, just that it finished with no error.

我已经在 python 脚本中测试了 SELECT 查询，并且 start_url 填充了 MYSQL 表中的条目.

I have tested the SELECT query in a python script and start_url get populated with the entrys from the MYSQL table.

spider.py

from scrapy.spider import BaseSpider from scrapy.selector import Selector import MySQLdb class ProductsSpider(BaseSpider): name = "Products" allowed_domains = ["test.com"] start_urls = [] def parse(self, response): print self.start_urls def populate_start_urls(self, url): conn = MySQLdb.connect( user='user', passwd='password', db='scrapy', host='localhost', charset="utf8", use_unicode=True ) cursor = conn.cursor() cursor.execute( 'SELECT url FROM links;' ) rows = cursor.fetchall() for row in rows: start_urls.append(row[0]) conn.close()

推荐答案

更好的方法是覆盖 start_requests 方法.

A better approach is to override the start_requests method.

这可以查询您的数据库，很像populate_start_urls，并返回一个请求对象.

This can query your database, much like populate_start_urls, and return a sequence of Request objects.

您只需要将 populate_start_urls 方法重命名为 start_requests 并修改以下几行:

You would just need to rename your populate_start_urls method to start_requests and modify the following lines:

for row in rows: yield self.make_requests_from_url(row[0])

相关文章