如何在数据库中使用Selify，并访问和移动下载的文件到挂载存储中，并保持Chrome和ChromeDriver版本的同步？

2022-04-11 00:00:00 python selenium databricks azure-databricks pyspark

问题描述

我看过几篇关于使用%sh在数据库中使用Selify来安装Chrome驱动程序和Chrome的帖子。这对我来说很好，但当我需要下载文件时，我遇到了很多麻烦。文件可以下载，但我在Databricks的文件系统中找不到它。即使我在将Chrome实例化到Azure Blob存储上的挂载文件夹时更改了下载路径，下载后文件也不会放在那里。还有一个问题是，在不手动更改版本号的情况下自动保持Chrome浏览器和ChromeDriver的版本同步。

以下链接显示有相同问题但没有明确答案的人：

https://forums.databricks.com/questions/19376/if-my-notebook-downloads-a-file-from-a-website-by.html

https://forums.databricks.com/questions/45388/selenium-in-databricks-with-add-experimental-optio.html

Is there a way to identify where the file gets downloaded in Azure Databricks when I do web automation using Selenium Python?

还有一些人根本就在努力让Selify正常运行： https://forums.databricks.com/questions/14814/selenium-in-databricks.html

不在路径中错误： https://webcache.googleusercontent.com/search?q=cache:NrvVKo4LLdIJ:https://stackoverflow.com/questions/57904372/cannot-get-selenium-webdriver-to-work-in-azure-databricks+&cd=5&hl=en&ct=clnk&gl=us

是否有明确的指南来指导在数据库上使用Selify和管理下载的文件？如何使Chrome浏览器和ChromeDriver版本自动保持同步？

解决方案

以下是安装Selify、Chrome和ChromeDriver的指南。这还会在通过Selify下载文件后将其移动到挂载的存储中。每个数字应位于其自己的单元格中。

安装Selify

%pip install selenium

进行导入

import pickle as pkl
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

将最新的ChromeDriver下载到dBFS根存储/tmp/。Curl命令将获取最新的Chrome版本并存储在version变量中。注意$前面的转义。

%sh
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip

将文件解压缩到dBFS根目录下的新文件夹/tmp/。我尝试使用非根路径，但不起作用。

%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/

下载并安装最新的Chrome。

%sh
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

**步骤3-5可以合并为一个命令。还可以使用以下命令创建外壳脚本并将其用作初始文件，以便为集群进行配置，并且在使用使用临时集群的作业集群时特别有用，因为初始化脚本适用于所有工作节点，而不仅仅是驱动程序节点。这也会安装SelSelum，允许您跳过第一步。只需在新笔记本中粘贴一个单元格，运行，然后将您的init脚本指向dbfs:/init/init_selenium.sh。现在，每当集群或临时集群启动时，都会在作业开始运行之前在所有工作节点上安装Chrome、ChromeDriver和Selify。

%sh
# dbfs:/init/init_selenium.sh
cat > /dbfs/init/init_selenium.sh <<EOF
#!/bin/sh
echo Install Chrome and Chrome driver
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
pip install selenium
EOF
cat /dbfs/init/init_selenium.sh

配置您的存储帐户。示例是使用ADLSGen2的Azure Blob存储。

service_principal_id = "YOUR_SP_ID"
service_principle_key = "YOUR_SP_KEY"
tenant_id = "YOUR_TENANT_ID"
directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id":  service_principal_id,
       "fs.azure.account.oauth2.client.secret": service_principle_key,
       "fs.azure.account.oauth2.client.endpoint": directory,
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}

配置您的装载位置并装载。

mount_point = "/mnt/container-data/"
mount_point_main = "/dbfs/mnt/container-data/"
container = "container-data"
storage_account = "adlsgen2"
storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"
utils_folder = mount_point + "utils/selenium/"
raw_folder = mount_point + "raw/"

if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()):
  dbutils.fs.mount(
    source = storage,
    mount_point = mount_point,
    extra_configs = configs)
  print(mount_point + " has been mounted.")
else:
  print(mount_point + " was already mounted.")
print(f"Utils folder: {utils_folder}")
print(f"Raw folder: {raw_folder}")

创建实例化Chrome浏览器的方法。我需要在utils文件夹中加载指向mnt/container-data/utils/selenium的Cookie文件。确保参数相同(无沙箱、无头、禁用-dev-shm-用法)

def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url):
    """
    Instatiates a Chrome browser.

    Parameters
    ----------
    download_path : str
        The download path to place files downloaded from this browser session.
    chrome_driver_path : str
        The path of the chrome driver executable binary (.exe file).
    cookies_path : str
        The path of the cookie file to load in (.pkl file).
    url : str
        The URL address of the page to initially load.

    Returns
    -------
    Browser
        Returns the instantiated browser object.
    """
    
    options = Options()
    prefs = {'download.default_directory' : download_path}
    options.add_experimental_option('prefs', prefs)
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--start-maximized')
    options.add_argument('window-size=2560,1440')
    print(f"{datetime.now()}    Launching Chrome...")
    browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
    print(f"{datetime.now()}    Chrome launched.")
    browser.get(url)
    print(f"{datetime.now()}    Loading cookies...")
    cookies = pkl.load(open(cookies_path, "rb"))
    for cookie in cookies:
        browser.add_cookie(cookie)
    browser.get(url)
    print(f"{datetime.now()}    Cookies loaded.")
    print(f"{datetime.now()}    Browser ready to use.")
    return browser

安装浏览器。将下载位置设置为dBFS根文件系统/tmp/downloads。确保Cookie路径前面有/dbfs，以便完整的Cookie路径类似/dbfs/mnt/...

browser = init_chrome_browser(
    download_path="/tmp/downloads",
    chrome_driver_path="/tmp/chromedriver/chromedriver",
    cookies_path="/dbfs"+ utils_folder + "cookies.pkl",
    url="YOUR_URL"
)

进行您的导航和所需的任何下载。
可选：检查您的下载位置。在本例中，我下载了一个CSV文件，并将在下载的文件夹中搜索，直到找到该文件格式。

import os
import os.path
for root, directories, filenames in os.walk('/tmp'):
    print(root)
    if any(".csv" in s for s in filenames):
        print(filenames)
        break

将文件从dBFS根tMP复制到您的挂载存储(/mnt/container-data/raw/)。您也可以在此操作过程中重命名。使用dbutils时，只能使用file:前缀访问根文件系统。

dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')

相关文章