通过 Python 使用 Selenium 进行多处理时，Chrome 在几个小时后崩溃

2022-01-12 00:00:00 python selenium selenium-chromedriver multiprocessing google-chrome

问题描述

这是几个小时抓取后的错误回溯:

This is the error traceback after several hours of scraping:

The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.

这是我的 selenium python 设置:

This is my setup of selenium python:

#scrape.py from selenium.common.exceptions import * from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.chrome.options import Options def run_scrape(link): chrome_options = Options() chrome_options.add_argument('--no-sandbox') chrome_options.add_argument("--headless") chrome_options.add_argument('--disable-dev-shm-usage') chrome_options.add_argument("--lang=en") chrome_options.add_argument("--start-maximized") chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) chrome_options.add_experimental_option('useAutomationExtension', False) chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36") chrome_options.binary_location = "/usr/bin/google-chrome" browser = webdriver.Chrome(executable_path=r'/usr/local/bin/chromedriver', options=chrome_options) browser.get(<link passed here>) try: #scrape process except: #other stuffs browser.quit()

#multiprocess.py import time, from multiprocessing import Pool from scrape import * if __name__ == '__main__': start_time = time.time() #links = list of links to be scraped pool = Pool(20) results = pool.map(run_scrape, links) pool.close() print("Total Time Processed: "+"--- %s seconds ---" % (time.time() - start_time))

Chrome、ChromeDriver 设置、Selenium 版本

Chrome, ChromeDriver Setup, Selenium Version

ChromeDriver 79.0.3945.36 (3582db32b33893869b8c1339e8f4d9ed1816f143-refs/branch-heads/3945@{#614}) Google Chrome 79.0.3945.79 Selenium Version: 4.0.0a3

我想知道为什么 chrome 正在关闭但其他进程正在运行?

Im wondering why is the chrome is closing but other processes are working?

解决方案

我拿了你的代码，稍微修改了一下以适应我的测试环境，下面是执行结果:

I took your code, modified it a bit to suit to my Test Environment and here is the execution results:

代码块:

Code Block:

multiprocess.py:

import time from multiprocessing import Pool from multiprocessingPool.scrape import run_scrape if __name__ == '__main__': start_time = time.time() links = ["https://selenium.dev/downloads/", "https://selenium.dev/documentation/en/"] pool = Pool(2) results = pool.map(run_scrape, links) pool.close() print("Total Time Processed: "+"--- %s seconds ---" % (time.time() - start_time))

scrape.py:

from selenium import webdriver from selenium.common.exceptions import NoSuchElementException, TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.chrome.options import Options def run_scrape(link): chrome_options = Options() chrome_options.add_argument('--no-sandbox') chrome_options.add_argument("--headless") chrome_options.add_argument('--disable-dev-shm-usage') chrome_options.add_argument("--lang=en") chrome_options.add_argument("--start-maximized") chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) chrome_options.add_experimental_option('useAutomationExtension', False) chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36") chrome_options.binary_location=r'C:Program Files (x86)GoogleChromeApplicationchrome.exe' browser = webdriver.Chrome(executable_path=r'C:UtilityBrowserDriverschromedriver.exe', options=chrome_options) browser.get(link) try: print(browser.title) except (NoSuchElementException, TimeoutException): print("Error") browser.quit()

控制台输出:

Downloads The Selenium Browser Automation Project :: Documentation for Selenium Total Time Processed: --- 10.248600006103516 seconds ---

很明显你的程序在逻辑上完美无缺.

It is pretty much evident your program is logically flawless and just perfect.

正如您在几个小时的抓取后提到的这个错误，我怀疑这是因为 WebDriver 不是线程安全的.话虽如此，如果您可以序列化对底层驱动程序实例的访问，则可以在多个线程中共享一个引用.这是不可取的.但是你总是可以实例化一个 WebDriver 每个线程的实例.

As you mentioned this error surfaces after several hours of scraping, I suspect this due to the fact that WebDriver is not thread-safe. Having said that, if you can serialize access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.

理想情况下，线程安全的问题不在于您的代码，而在于实际的浏览器绑定.他们都假设一次只有一个命令(例如，像真实用户一样).但另一方面，您始终可以为每个将启动多个浏览选项卡/窗口的线程实例化一个 WebDriver 实例.到目前为止，您的程序似乎很完美.

Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.

现在，不同的线程可以在同一个Webdriver 上运行，但是测试的结果不会是你所期望的.背后的原因是，当您使用多线程在不同的选项卡/窗口上运行不同的测试时，需要一点线程安全编码，否则您将执行的操作如 click() 或 send_keys() 将转到当前具有焦点的打开的选项卡/窗口，而不管您希望运行的线程.这实质上意味着所有测试将在具有焦点但不在预期选项卡/窗口上的同一选项卡/窗口上同时运行.

Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.

相关文章