原创内容,爬取请指明出处:https://www.cnblogs.com/Lucy151213/p/10968868.html
阳光采购平台每月初会把当月的价格挂到平台上,现模拟用户登录平台,将需要的数据保存到csv文件和数据库,并且发送给指定人员。python初学者,遇见很多坑,这里记录一下。
环境 | Python2.7 |
开发工具 | PyCharm |
运行环境 | Centos7 |
运行说明 | 设置定时任务每月1号凌晨1点执行这个python代码 |
实现功能 | 根据账号密码及解析处理的验证码自动登录系统,解析需要的数据,并保存在csv文件和Mysql数据库中,爬取完成后将csv文件发给指定的人。支持请求断开后自动重连。 |
开发环境搭建:
网上教程一大堆,不赘述了。安装好后需要安装一些必须的库,如下:
bs4(页面html解析)
csv(用于保存csv文件)
smtplib(用于发送邮件)
mysql.connector(用于连接数据库)
部分需要下载的内容我放在网盘共享,包括leptonica-1.72.tar.gz,Tesseract3.04.00.tar.gz,以及语言包:
链接:Https://pan.baidu.com/s/1J4SZDgmn6DpuQ1EHxE6zkw
提取码:crbl
图像识别:
网上也有很多教程,整理了一版在centos7上能正常安装图像识别库的操作步骤。
- 因为是下载源码编译安装,所有需要安装响应的编译工具:
yum install GCc gcc-c++ make
yum install autoconf automake libtool
- 安装对图片识别相关支持工具,没有这些在后续执行Tesseract命令时会报错:
yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel
- 安装leptonica,首先去leptonica下载,下载后放到服务器解压并编译,leptonica是一个tesseract必须的工具:
下载地址:http://www.leptonica.org/
#到leptonica的目录执行
./configure
make
make install
- 下载对应的Tesseract
下载地址:https://link.jianshu.com/?t=https://GitHub.com/tesseract-ocr/tesseract/wiki/Downloads
#到tesseract-3.04.00目录执行
./autogen.sh
./configure
make
make install
ldconfig
- 下载语言包
下载地址:https://github.com/tesseract-ocr/tessdata
下载后的文件放在目录tessdata下面
- 环境配置
拷贝tessdata:cp tessdata /usr/local/share –R
修改环境变量:
打开配置文件:vi /etc/profile
添加一行:export TESSDATA_PREFIX=/usr/local/share/tessdata
生效:source /etc/profile
- 测试
tesseract –v 查看tesseract的版本相关信息。如果没有报错,那么表示安装成功了。
放入找到一张图片image.png,然后执行:tesseract image.png 123
当前目录下会生成123.txt文件,这个文件就记录了识别的文字。
- 安装库pytesseract
这个库是用于在python代码里面调用tesseract
命令:pip install pytesseract
测试代码如下:
1 import pytesseract
2 from PIL import Image
3
4 im1=Image.open('image.png')
5 print(pytesseract.image_to_string(im1))
代码:
我要获取的数据长相如下:
首先获取一共有多少页,循环访问每一页,将每一页数据保存到csv和数据库里面,如果在访问某页的时候抛出异常,那么记录当前broken页数,重新登录,从broken那页继续爬取数据。
写了一个gl.py,用于保存全局变量:
1 #!/usr/bin/python
2 # -*- coding: utf-8 -*-
3 #coding=utf-8
4 import time
5
6 timeStr = time.strftime('%Y%m%d', time.localtime(time.time()))
7 monthStr = time.strftime('%m', time.localtime(time.time()))
8 yearStr = time.strftime('%Y', time.localtime(time.time()))
9 LOG_FILE = "log/" + timeStr + '.log'
10 csvFileName = "csv/" + timeStr + ".csv"
11 fileName = timeStr + ".csv"
12 fmt = '%(asctime)s - %(filename)s:%(lineno)s - %(message)s'
13 loginUrl = "http://yourpath/Login.aspx"
14 productUrl = 'http://yourpath/aaa.aspx'
15 username = 'aaaa'
16 passWord = "aaa"
17 preCodeurl = "yourpath"
18 host="yourip"
19 user="aaa"
20 passwd="aaa"
21 db="mysql"
22 charset="utf8"
23 postData={
24 '__VIEWSTATE':'',
25 '__EVENTTARGET':'',
26 '__EVENTARGUMENT':'',
27 'btnLogin':"登录",
28 'txtUserId':'aaaa',
29 'txtUserPwd':'aaa',
30 'txtCode':'',
31 'hfip':'yourip'
32 }
33 tdd={
34 '__VIEWSTATE':'',
35 '__EVENTTARGET':'ctl00$ContentPlaceHolder1$AspNetPager1',
36 'ctl00$ContentPlaceHolder1$AspNetPager1_input':'1',
37 'ctl00$ContentPlaceHolder1$AspNetPager1_pagesize':'50',
38 'ctl00$ContentPlaceHolder1$txtYear':'',
39 'ctl00$ContentPlaceHolder1$txtMonth':'',
40 '__EVENTARGUMENT':'',
41 }
42 vs={
43 '__VIEWSTATE':''
44 }
主代码中设置日志,csv,数据库连接,cookie:
1 handler = logging.handlers.RotatingFileHandler(gl.LOG_FILE, maxBytes=1024 * 1024, backupCount=5)
2 fORMatter = logging.Formatter(gl.fmt)
3 handler.setFormatter(formatter)
4 logger = logging.getLogger('tst')
5 logger.addHandler(handler)
6 logger.setLevel(logging.DEBUG)
7 csvFile = codecs.open(gl.csvFileName, 'w+', 'utf_8_sig')
8 writer = csv.writer(csvFile)
9 conn = mysql.connector.connect(host=gl.host, user=gl.user, passwd=gl.passwd, db=gl.db, charset=gl.charset)
10 cursor = conn.cursor()
11
12 cookiejar = cookielib.MozillaCookieJar()
13 cookieSupport = urllib2.HTTPCookieProcessor(cookiejar)
14 httpsHandLer = urllib2.HTTPSHandler(debuglevel=0)
15 opener = urllib2.build_opener(cookieSupport, httpsHandLer)
16 opener.addheaders = [('User-Agent','Mozilla/5.0 (windows NT 6.1) AppleWEBKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')]
17 urllib2.install_opener(opener)
登录方法:
首先是识别验证码,转为数字。然后用(密码+用户名+验证)提交到登录方法,可能会失败,因为识别验证码有时候识别的不正确。如果登录失败,那么重新获取验证码,再次识别,再次登录,直到登录成功。
1 def get_logined_Data(opener,logger,views):
2 print "get_logined_Data"
3 indexCount = 1
4 retData = None
5 while indexCount <= 15:
6 print "begin login ", str(indexCount), " time"
7 logger.info("begin login " + str(indexCount) + " time")
8 vrifycodeUrl = gl.preCodeurl + str(random.random())
9 text = get_image(vrifycodeUrl)#封装一个方法,传入验证码URL,返回识别出的数字
10 postData = gl.postData
11 postData["txtCode"] = text
12 postData["__VIEWSTATE"]=views
13
14
15 data = urllib.urlencode(postData)
16 try:
17 headers22 = {
18 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
19 'Accept-Encoding': 'gzip, deflate, br',
20 'Accept-Language': 'zh-CN,zh;q=0.9',
21 'Connection': 'keep-alive',
22 'Content-Type': 'application/x-www-form-urlencoded',
23 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
24 }
25 request = urllib2.Request(gl.loginUrl, data, headers22)
26 opener.open(request)
27 except Exception as e:
28 print "catch Exception when login"
29 print e
30
31 request = urllib2.Request(gl.productUrl)
32 response = opener.open(request)
33 dataPage = response.read().decode('utf-8')
34
35 bsObj = BeautifulSoup(dataPage,'html.parser')
36 tabcontent = bsObj.find(id="tabcontent") #登录成功后,页面才有tabcontent这个元素,所以更具这个来判断是否登录成功
37 if (tabcontent is not None):
38 print "login succesfully"
39 logger.info("login succesfully")
40 retData = bsObj
41 break
42 else:
43 print "enter failed,try again"
44 logger.info("enter failed,try again")
45 time.sleep(3)
46 indexCount += 1
47 return retData
分析代码发现,每次请求获取数据都需要带上’__VIEWSTATE’这个参数,这个参数是存放在页面,所以需要把‘__VIEWSTATE’提出出来,用于访问下一页的时候带到参数里面去。
验证码解析:
通过验证码的url地址,将验证码保存到本地,因为验证码是彩色的,所有需要先把验证码置灰,然后再调用图像识别转为数字。这个验证码为4位数字,但是调用图像识别的时候,可能会转成字母,所有手动将字母转为数字,转换后识别率还能接受。
1 #获取验证码对应的数字,返回为4个数字才为有效
2 def get_image(codeurl):
3 print(time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + " begin get code num")
4 index = 1
5 while index<=15:
6 file = urllib2.urlopen(codeurl).read()
7 im = cStringIO.StringIO(file)
8 img = Image.open(im)
9 imgName = "vrifycode/" + gl.timeStr + "_" + str(index) + ".png"
10 print 'begin get vrifycode'
11 text = convert_image(img, imgName)
12 print "vrifycode", index, ":", text
13 # logger.info('vrifycode' + str(index) + ":" + text)
14
15 if (len(text) != 4 or text.isdigit() == False): # 如果验证码不是4位那么肯定是错误的。
16 print 'vrifycode:', index, ' is wrong'
17 index += 1
18 time.sleep(2)
19 continue
20 return text
21
22 #将图片转为数字
23 def convert_image(image,impName):
24 print "enter convert_image"
25 image = image.convert('L') # 灰度
26 image2 = Image.new('L', image.size, 255)
27 for x in range(image.size[0]):
28 for y in range(image.size[1]):
29 pix = image.getpixel((x, y))
30 if pix < 90: # 灰度低于120 设置为 0
31 image2.putpixel((x, y), 0)
32 print "begin save"
33 image2.save(impName) # 将灰度图存储下来看效果
34 print "begin convert"
35 text = pytesseract.image_to_string(image2)
36 print "end convert"
37 snum = ""
38 for j in text:#进行简单转换
39 if (j == 'Z'):
40 snum += "2"
41 elif (j == 'T'):
42 snum += "7"
43 elif (j == 'b'):
44 snum += "5"
45 elif (j == 's'):
46 snum += "8"
47 elif (j == 'S'):
48 snum += "8"
49 elif (j == 'O'):
50 snum += "0"
51 elif (j == 'o'):
52 snum += "0"
53 else:
54 snum += j
55 return snum
数据转换:
将html数据转换为数组,供保存csv文件和数据库时使用
1 def paras_data(nameList,logger):
2 data = []
3 mainlist = nameList
4 rows = mainlist.findAll("tr", {"class": {"row", "alter"}})
5 try:
6 if (len(rows) != 0):
7 for name in rows:
8 tds = name.findAll("td")
9 if tds == None:
10 print "get tds is null"
11 logger.info("get tds is null")
12 else:
13 item = []
14 for index in range(len(tds)):
15 s_span = (tds[index]).find("span")
16 if (s_span is not None):
17 tmp = s_span["title"]
18 else:
19 tmp = (tds[index]).get_text()
20 # tmp=(tds[index]).get_text()
21 item.append(tmp.encode('utf-8')) # gb2312 utf-8
22 item.append(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))#本条数据获取时间
23 data.append(tuple(item))
24
25 except Exception as e:
26 print "catch exception when save csv", e
27 logger.info("catch exception when save csv" + e.message)
28 return data
保存csv文件:
def save_to_csv(data ,writer):
for d in data:
if d is not None:
writer.writerow(d)
保存数据库:
1 def save_to_mysql(data,conn,cursor):
2 try:
3 cursor.executemany(
4 "INSERT INTO `aaa`(aaa,bbb) VALUES (%s,%s)",
5 data)
6 conn.commit()
7
8 except Exception as e:
9 print "catch exception when save to mysql",e
10 else:
11 pass
保存指定页数据:
1 def get_appointed_page(snum,opener,vs,logger):
2 tdd = get_tdd()
3 tdd["__VIEWSTATE"] = vs['__VIEWSTATE']
4 tdd["__EVENTARGUMENT"] = snum
5 tdd=urllib.urlencode(tdd)
6 # print "tdd",tdd
7 op = opener.open(gl.productUrl, tdd)
8 if (op.getcode() != 200):
9 print("the" + snum + " page ,state not 200,try connect again")
10 return None
11 data = op.read().decode('utf-8', 'ignore')
12 # print "data",data
13 bsObj = BeautifulSoup(data,"lxml")
14 nameList = bsObj.find("table", {"class": "mainlist"})
15 # print "nameList",nameList
16 if len(nameList) == 0:
17 return None
18 viewState = bsObj.find(id="__VIEWSTATE")
19 if viewState is None:
20 logger.info("the other page,no viewstate,try connect again")
21 print("the other page,no viewstate,try connect again")
22 return None
23 vs['__VIEWSTATE'] = viewState["value"]
24 return nameList
Main方法:
1 while flag == True and logintime <50:
2 try:
3 print "global login the ", str(logintime), " times"
4 logger.info("global login the " + str(logintime) + " times")
5 bsObj = get_logined_Data(opener, logger,views)
6 if bsObj is None:
7 print "try login 15 times,but failed,exit"
8 logger.info("try login 15 times,but failed,exit")
9 exit()
10 else:
11 print "global login the ", str(logintime), " times succesfully!"
12 logger.info("global login the " + str(logintime) + " times succesfully!")
13 viewState_Source = bsObj.find(id="__VIEWSTATE")
14 if totalNum == -1:
15 totalNum = get_totalNum(bsObj)
16 print "totalNum:",str(totalNum)
17 logger.info("totalnum:"+str(totalNum))
18 vs = gl.vs
19 if viewState_Source != None:
20 vs['__VIEWSTATE'] = viewState_Source["value"]
21
22 # 获取指定snum页的数据
23 # while snum<=totalNum:
24 while snum<=totalNum:
25 print "begin get the ",str(snum)," page"
26 logger.info("begin get the "+str(snum)+" page")
27 nameList = get_appointed_page(snum, opener, vs, logger)
28 if nameList is None:
29 print "get the nameList failed,connect agian"
30 logger.info("get the nameList failed,connect agian")
31 raise Exception
32 else:
33 print "get the ", str(snum), " successfully"
34 logger.info("get the " + str(snum) + " successfully")
35
36
37 mydata = paras_data(nameList,logger)
38 #保存CSV文件
39 save_to_csv(mydata, snum, writer)
40 #保存到数据库
41 save_to_mysql(mydata, conn, cursor)
42
43 snum+=1
44 time.sleep(3)
45
46 flag = False
47 except Exception as e:
48 logintime+=1
49 print "catch exception",e
50 logger.error("catch exception"+e.message)
定时任务设置:
cd /var/spool/cron/
crontab –e#编辑定时任务
输入:1 1 1 * * /yourpath/normal_script.sh>>/yourpath/cronlog.log 2>&1
(上面定时任务的意思是每月1号1点1分执行文件normal_script.sh,日志存放在cronlog.log)
目录结构:
源码下载:helloworld.zip