多线程文件复制比多核 CPU 上的单线程慢得多

2022-01-20 00:00:00 python 复制 file multithreading queue

问题描述

我正在尝试用 Python 编写一个多线程程序来加速(1000 个以下).csv 文件的复制.多线程代码的运行速度甚至比顺序方法还要慢.我用 profile.py 对代码进行了计时.我确定我一定做错了什么,但我不确定是什么.

I am trying to write a multithreaded program in Python to accelerate the copying of (under 1000) .csv files. The multithreaded code runs even slower than the sequential approach. I timed the code with profile.py. I am sure I must be doing something wrong but I'm not sure what.

环境:

  • 四核 CPU.
  • 2 个硬盘驱动器,其中一个包含源文件.另一个是目的地.
  • 1000 个 csv 文件,大小从几 KB 到 10 MB 不等.

方法:

我把所有的文件路径放在一个Queue中,并创建4-8个工作线程从队列中拉取文件路径并复制指定的文件.在任何情况下,多线程代码都不会更快:

I put all the file paths in a Queue, and create 4-8 worker threads pull file paths from the queue and copy the designated file. In no case is the multithreaded code faster:

  • 连续复制需要 150-160 秒
  • 线程复制需要超过 230 秒

我假设这是一个 I/O 绑定任务,所以多线程应该有助于提高操作速度.

I assume this is an I/O bound task, so multithreading should help the operation speed.

守则:

    import Queue
    import threading
    import cStringIO 
    import os
    import shutil
    import timeit  # time the code exec with gc disable
    import glob    # file wildcards list, glob.glob('*.py')
    import profile # 

    fileQueue = Queue.Queue() # global
    srcPath  = 'C:\temp'
    destPath = 'D:\temp'
    tcnt = 0
    ttotal = 0

    def CopyWorker():
        while True:
            fileName = fileQueue.get()
            fileQueue.task_done()
            shutil.copy(fileName, destPath)
            #tcnt += 1
            print 'copied: ', tcnt, ' of ', ttotal

    def threadWorkerCopy(fileNameList):
        print 'threadWorkerCopy: ', len(fileNameList)
        ttotal = len(fileNameList)
        for i in range(4):
            t = threading.Thread(target=CopyWorker)
            t.daemon = True
            t.start()
        for fileName in fileNameList:
            fileQueue.put(fileName)
        fileQueue.join()

    def sequentialCopy(fileNameList):
        #around 160.446 seconds, 152 seconds
        print 'sequentialCopy: ', len(fileNameList)
        cnt = 0
        ctotal = len(fileNameList)
        for fileName in fileNameList:
            shutil.copy(fileName, destPath)
            cnt += 1
            print 'copied: ', cnt, ' of ', ctotal

    def main():
        print 'this is main method'
        fileCount = 0
        fileList = glob.glob(srcPath + '\' + '*.csv')
        #sequentialCopy(fileList)
        threadWorkerCopy(fileList)

    if __name__ == '__main__':
        profile.run('main()')


解决方案

当然慢.硬盘驱动器必须不断地在文件之间寻找.您认为多线程会使这项任务更快的信念是完全没有道理的.限制速度是您可以从磁盘读取数据或将数据写入磁盘的速度,从一个文件到另一个文件的每次寻道都会浪费本可以用于传输数据的时间.

Of course it's slower. The hard drives are having to seek between the files constantly. Your belief that multi-threading would make this task faster is completely unjustified. The limiting speed is how fast you can read data from or write data to the disk, and every seek from one file to another is a loss of time that could have been spent transferring data.

相关文章