Python中使用代理IP进行网络爬虫的指南

2023-04-17 00:00:00 爬虫代理指南

使用代理IP进行网络爬虫可以避免被封禁或限制访问的情况，提高爬虫的效率和稳定性。下面是Python中使用代理IP进行网络爬虫的指南：

获取代理IP

可以在网上购买代理IP，也可以使用免费代理IP网站提供的IP。这里以'http://www.xicidaili.com/nn/'为例，获取前10个免费代理IP。

import requests
from bs4 import BeautifulSoup

url = 'http://www.xicidaili.com/nn/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', attrs={'id': 'ip_list'})
proxies = []
for tr in table.find_all('tr')[1:]:
    tds = tr.find_all('td')
    ip = tds[1].text.strip()
    port = tds[2].text.strip()
    proxy = {'http': 'http://{}:{}'.format(ip, port)}
    proxies.append(proxy)
    if len(proxies) == 10:
        break

请求网页时使用代理IP

从上面获取到了10个代理IP，可以循环遍历使用。这里以爬取“pidancode.com”页面内容为例。

import requests

url = 'https://pidancode.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

for proxy in proxies:
    try:
        response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
        if response.status_code == 200:
            print(response.text)
            break
    except:
        continue

验证代理IP是否可用

虽然获取到了代理IP，但不是所有的IP都可用。可以在请求时设置超时时间，如果超时或请求失败，则认为这个代理IP不可用。可以在循环遍历IP时进行验证，如果验证通过，则使用这个IP请求。

import requests

url = 'https://pidancode.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

for proxy in proxies:
    try:
        response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
        if response.status_code == 200:
            print(response.text)
            break
    except:
        continue

封装为函数

将上面的代码封装为函数，方便以后使用。

import requests
from bs4 import BeautifulSoup

def get_proxies():
    url = 'http://www.xicidaili.com/nn/'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table', attrs={'id': 'ip_list'})
    proxies = []
    for tr in table.find_all('tr')[1:]:
        tds = tr.find_all('td')
        ip = tds[1].text.strip()
        port = tds[2].text.strip()
        proxy = {'http': 'http://{}:{}'.format(ip, port)}
        proxies.append(proxy)
        if len(proxies) == 10:
            break
    return proxies

def crawl_with_proxy(url, headers, proxies):
    for proxy in proxies:
        try:
            response = requests.get(url, headers=headers, proxies=proxy, timeout=10)
            if response.status_code == 200:
                return response.text
        except:
            continue
    return None

url = 'https://pidancode.com/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
proxies = get_proxies()
html = crawl_with_proxy(url, headers, proxies)
print(html)

注意事项

使用代理IP进行爬虫要注意以下几点：

不要频繁地使用同一个IP，可以循环使用多个IP，或者每次请求时随机选择一个IP。
不要使用同一个IP请求同一个网站的过多页面，可以随机选择网站或者页面。
不要使用不稳定或者速度很慢的IP，会影响爬取效率。

上面的代码只是一个示例，具体实现要根据具体情况进行调整。如果有需要，可以加入IP池、定时更新IP等功能。

相关文章