如何在多个线程中运行`selenium-chromedriver`

问题描述

我正在使用 seleniumchrome-driver 从某些页面 scrape 数据,然后使用该信息运行一些额外的任务(例如,在某些页面上输入一些评论)

I am using selenium and chrome-driver to scrape data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)

我的程序有一个按钮.每次按下它都会调用 thread_(self)(如下),开始一个新线程.目标函数 self.main 具有在 chrome-driver 上运行所有 selenium 工作的代码.

My program has a button. Every time it's pressed it calls the thread_(self) (bellow), starting a new thread. The target function self.main has the code to run all the selenium work on a chrome-driver.

def thread_(self):
    th = threading.Thread(target=self.main)
    th.start()

我的问题是用户第一次按下后.这个 th 线程将打开浏览器 A 并做一些事情.当浏览器 A 正在做一些事情时,用户将再次按下按钮并打开运行相同 self.main 的浏览器 B.我希望每个打开的浏览器同时运行.我遇到的问题是,当我运行那个线程函数时,第一个浏览器停止并且第二个浏览器打开.

My problem is that after the user press the first time. This th thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same self.main. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.

我知道我的代码可以无限创建线程.我知道这会影响电脑性能,但我可以接受.我想加快 self.main 完成的工作!

I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by self.main!


解决方案

Threading for selenium 加速

考虑以下函数来举例说明与单一驱动程序方法相比,使用 selenium 的线程如何提供一些加速.下面的代码 scraps 来自 selenium 使用 BeautifulSoup 打开的页面的 html 标题.页面列表是links.

Threading for selenium speed up

Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup. The list of pages is links.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading

def create_driver():
   """returns a new chrome webdriver"""
   chromeOptions = webdriver.ChromeOptions()
   chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
   return webdriver.Chrome(options=chromeOptions)  

def get_title(url, webdriver=None):  
   """get the url html title using BeautifulSoup 
   if driver is None uses a new chrome-driver and quit() after
   otherwise uses the driver provided and don't quit() after"""
   def print_title(driver):
      driver.get(url)
      soup = BeautifulSoup(driver.page_source,"lxml")
      item = soup.find('title')
      print(item.string.strip())

   if webdriver:
      print_title(webdriver)  
   else: 
      webdriver = create_driver()
      print_title(webdriver)   
      webdriver.quit()

links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/", 
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]

现在在上面的 links 上调用 get_tile.

Calling now get_tile on the links above.

顺序方法

单个 chrome 驱动程序并按顺序传递所有链接.我的机器需要 22.3 秒(注意:windows).

A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).

start_time = time.time()
driver = create_driver()

for link in links: # could be 'like' clicks 
  get_title(link, driver)  

driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")

多线程方法

为每个链接使用一个线程.结果在 10.5 秒内 >快 2 倍.

Using a thread for each link. Results in 10.5 s > 2x faster.

start_time = time.time()    
threads = [] 
for link in links: # each thread could be like a new 'click' 
    th = threading.Thread(target=get_title, args=(link,))    
    th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
    threads.append(th)        
for th in threads:
    th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")

这里和这个更好是其他一些工作示例.第二个在 ThreadPool 上使用固定数量的线程.并建议存储在每个线程上初始化的 chrome-driver 实例比每次都创建-启动它更快.

This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool. And suggests that storing the chrome-driver instance initialized on each thread is faster than creating-starting it every time.

我仍然不确定这是否是 selenium 的最佳方法有相当大的加速. 因为 threadingin-python?rq=1">无 IO 绑定代码 将结束顺序执行(一个线程一个接一个).由于 Python GIL(全局解释器锁),Python 进程无法并行运行线程(利用多个 cpu 核).

Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).

使用包multiprocessing

To try to overcome the Python GIL limitation using the package multiprocessing and Processes class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title function above. Additional code is here.

start_time = time.time() 

processes = [] 
for link in links: # each thread a new 'click' 
    ps = multiprocessing.Process(target=get_title, args=(link,))    
    ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
    processes.append(ps)        
for ps in processes:
    ps.join() # Main wait for processes finish

return (time.time() - start_time)

与我的预期相反 基于 Python multiprocessing.Processselenium 平均并行度 比 threading.Thread 慢大约 8%. 但很明显,booth 的平均速度比顺序方法快两倍多.刚刚发现 selenium chrome-driver 命令使用 HTTP-Requets (如 POST, GET) 所以它是I/O 受限,因此它释放了 Python GIL,确实使其在线程中并行.

Contrary of what I would expect Python multiprocessing.Process based parallelism for selenium in average was around 8% slower than threading.Thread. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium chrome-driver commands uses HTTP-Requets (like POST, GET) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.

这不是一个确定的答案,因为我的测试只是一个很小的例子.此外,我使用的是 Windows 和 multiprocessing 在这种情况下有很多限制.每个新的 Process 都不像 Linux 中的分叉,这意味着除了其他缺点外,还浪费了大量内存.

This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing have many limitations in this case. Each new Process is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.

考虑到所有这些:根据用例,线程可能与尝试更重的进程方法(特别是对于 Windows 用户)一样好或更好.

Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).

相关文章