Python如何使用多进程.pool并行下载多个文件

2022-04-10 00:00:00 python python-multiprocessing

问题描述

我正在尝试使用multiprocessing.Pool下载并解压缩Zip文件。但每次我执行该脚本时,只会下载3个Zip,目录中看不到剩余的文件(CPU%也达到100%)。有人能帮我解决这个问题吗/建议更好的方法,并遵循我尝试过的片段。我对多处理完全陌生。我的目标是在不达到最大CPU的情况下并行下载多个文件。

import StringIO
import os
import sys
import zipfile
from multiprocessing import Pool, cpu_count

import requests

filePath = os.path.dirname(os.path.abspath(__file__))
print("filePath is %s " % filePath)
sys.path.append(filePath)
url = ["http://mlg.ucd.ie/files/datasets/multiview_data_20130124.zip",
       "http://mlg.ucd.ie/files/datasets/movielists_20130821.zip",
       "http://mlg.ucd.ie/files/datasets/bbcsport.zip",
       "http://mlg.ucd.ie/files/datasets/movielists_20130821.zip",
       "http://mlg.ucd.ie/files/datasets/3sources.zip"]


def download_zips(url):
    file_name = url.split("/")[-1]
    response = requests.get(url)
    sourceZip = zipfile.ZipFile(StringIO.StringIO(response.content))
    print("
 Downloaded {} ".format(file_name))
    sourceZip.extractall(filePath)
    print("extracted {} 
".format(file_name))
    sourceZip.close()


if __name__ == "__main__":
    print("There are {} CPUs on this machine ".format(cpu_count()))
    pool = Pool(cpu_count())
    results = pool.map(download_zips, url)
    pool.close()
    pool.join()

下面的输出

filePath is C:UsersDocumentsGitHubPython-Examples-Internetmulti_processing 
There are 4 CPUs on this machine 
filePath is C:UsersDocumentsGitHubPython-Examples-Internetmulti_processing 
filePath is C:UsersDocumentsGitHubPython-Examples-Internetmulti_processing 
filePath is C:UsersDocumentsGitHubPython-Examples-Internetmulti_processing 
filePath is C:UsersDocumentsGitHubPython-Examples-Internetmulti_processing 

 Downloaded bbcsport.zip 
extracted bbcsport.zip 


 Downloaded 3sources.zip 
extracted 3sources.zip 


 Downloaded multiview_data_20130124.zip 

 Downloaded movielists_20130821.zip 

 Downloaded movielists_20130821.zip 
extracted multiview_data_20130124.zip 

extracted movielists_20130821.zip 

extracted movielists_20130821.zip 

解决方案

我在您的函数中做了一些小调整,它运行得很好。请注意:

  1. 文件".../movielists_20130821.zip"在您的列表中出现了两次,因此您正在两次加载相同的内容(可能是打字错误?)
  2. 文件".../multiview_data_20130124.zip"".../movielists_20130821.zip"".../3sources.zip"解压后会生成一个新目录。不过,文件".../bbcsport.zip"在解压时会将其文件放在根文件夹中,即您当前的工作目录(见下图)。也许你错过了这张支票?
  3. 我在donwload函数中添加了一个try/Except块。为什么?多处理的工作原理是创建新的(子)进程来运行程序。如果子进程引发异常,则父进程不会捕获该异常。因此,如果此子流程中出现任何错误,则必须在那里进行记录/处理。

import sys, os
import zipfile
import requests
from multiprocessing import Pool, cpu_count
from functools import partial
from io import BytesIO


def download_zip(url, filePath):
    try:
        file_name = url.split("/")[-1]
        response = requests.get(url)
        sourceZip = zipfile.ZipFile(BytesIO(response.content))
        print(" Downloaded {} ".format(file_name))
        sourceZip.extractall(filePath)
        print(" extracted {}".format(file_name))
        sourceZip.close()
    except Exception as e:
        print(e)


if __name__ == "__main__":
    filePath = os.path.dirname(os.path.abspath(__file__))
    print("filePath is %s " % filePath)
    # sys.path.append(filePath) # why do you need this?
    urls = ["http://mlg.ucd.ie/files/datasets/multiview_data_20130124.zip",
            "http://mlg.ucd.ie/files/datasets/movielists_20130821.zip",
            "http://mlg.ucd.ie/files/datasets/bbcsport.zip",
            "http://mlg.ucd.ie/files/datasets/movielists_20130821.zip",
            "http://mlg.ucd.ie/files/datasets/3sources.zip"]

    print("There are {} CPUs on this machine ".format(cpu_count()))
    pool = Pool(cpu_count())
    download_func = partial(download_zip, filePath = filePath)
    results = pool.map(download_func, urls)
    pool.close()
    pool.join()

相关文章