等到页面用 Selenium WebDriver for Python 加载

2022-01-30 00:00:00 python selenium execute-script

问题描述

我想抓取无限滚动实现的页面的所有数据.以下 python 代码有效.

I want to scrape all the data of a page implemented by a infinite scroll. The following python code works.

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

这意味着每次我向下滚动到底部时,我都需要等待 5 秒,这通常足以让页面完成加载新生成的内容.但是,这可能没有时间效率.页面可能会在 5 秒内完成加载新内容.每次向下滚动时,如何检测页面是否完成加载新内容?如果我能检测到这一点,我可以在知道页面完成加载后再次向下滚动以查看更多内容.这样更省时.

This means every time I scroll down to the bottom, I need to wait 5 seconds, which is generally enough for the page to finish loading the newly generated contents. But, this may not be time efficient. The page may finish loading the new contents within 5 seconds. How can I detect whether the page finished loading the new contents every time I scroll down? If I can detect this, I can scroll down again to see more contents once I know the page finished loading. This is more time efficient.


解决方案

webdriver 默认会通过.get() 方法等待页面加载.

The webdriver will wait for a page to load by default via .get() method.

正如@user227215 所说,您可能正在寻找某些特定元素,您应该使用 WebDriverWait 来等待页面中的元素:

As you may be looking for some specific element as @user227215 said, you should use WebDriverWait to wait for an element located in your page:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
    print "Page is ready!"
except TimeoutException:
    print "Loading took too much time!"

我用它来检查警报.您可以使用任何其他类型的方法来查找定位器.

I have used it for checking alerts. You can use any other type methods to find the locator.

编辑 1:

我应该提到 webdriver 默认会等待页面加载.它不等待加载内部框架或 ajax 请求.这意味着当您使用 .get('url') 时,您的浏览器将等待页面完全加载,然后转到代码中的下一个命令.但是,当您发布 ajax 请求时,webdriver 不会等待,您有责任等待适当的时间来加载页面或页面的一部分;所以有一个名为 expected_conditions 的模块.

I should mention that the webdriver will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.

相关文章