Python3中urllib模块使用代理服务器抓取网页

2022-03-11 00:00:00 模块抓取代理服务器

本代码演示了python3的urllib模块如何使用代理，以及需要登录验证的proxy，python3使用了urllib代替python2的urllib2，如果您之前开发使用的是urllib2，在python3里请使用urllib替换，相关用法也有所改变。

"""
作者：皮蛋编程(https://www.pidancode.com)
创建日期：2022/3/17
修改日期：2022/3/17
功能描述：Python3中urllib模块使用代理服务器抓取网页
"""

from urllib import request

# 代理服务器的地址和端口号，请更换为自己的代理服务器地址和端口:
proxy_info = {'host': '106.110.111.70',
              'port': 23564
              }
# 为代理创建一个处理器
proxy_support = request.ProxyHandler({"http": "http://%(host)s:%(port)d" % proxy_info})
# 使用代理处理器创建一个opener:
opener = request.build_opener(proxy_support)
# 安装opener到urllib的request:
request.install_opener(opener)
# 发送http请求，读取页面内容:
html_page = request.urlopen("https://www.pidancode.com/").read(200000)
print(html_page)

如果代理需要验证，代理请替换成自己的代理服务器

proxy_info = {'host': '106.110.111.70',
              'port': 23564,
              'user': 'pidancode.com',
              'pass': 'pidancode'
              }
proxy_support = request.ProxyHandler({"http": "http://%(user)s:%(pass)s@%(host)s:%(port)d" % proxy_info})
opener = request.build_opener(proxy_support)
request.install_opener(opener)
html_page = request.urlopen("https://www.pidancode.com/").read(200000)
print(html_page)

相关文章