python的多处理和concurrent.futures有什么区别?
问题描述
在python中实现多处理的一种简单方法是
A simple way of implementing multiprocessing in python is
from multiprocessing import Pool
def calculate(number):
return number
if __name__ == '__main__':
pool = Pool()
result = pool.map(calculate, range(4))
基于期货的替代实现是
from concurrent.futures import ProcessPoolExecutor
def calculate(number):
return number
with ProcessPoolExecutor() as executor:
result = executor.map(calculate, range(4))
这两种选择基本上做同样的事情,但一个显着的区别是我们不必使用通常的 if __name__ == '__main__'
子句来保护代码.这是因为 futures 的实现解决了这个问题还是我们有不同的原因?
Both alternatives do essentially the same thing, but one striking difference is that we don't have to guard the code with the usual if __name__ == '__main__'
clause. Is this because the implementation of futures takes care of this or us there a different reason?
更广泛地说,multiprocessing
和 concurrent.futures
之间有什么区别?什么时候优先于另一个?
More broadly, what are the differences between multiprocessing
and concurrent.futures
? When is one preferred over the other?
我最初的假设是 if __name__ == '__main__'
仅对多处理是必需的,这是错误的.显然,Windows 上的两种实现都需要这种保护,而在 unix 系统上则不需要.
My initial assumption that the guard if __name__ == '__main__'
is only necessary for multiprocessing was wrong. Apparently, one needs this guard for both implementations on windows, while it is not necessary on unix systems.
解决方案
实际上你也应该使用 if __name__ == "__main__"
守卫和 ProcessPoolExecutor
:它是使用 multiprocessing.Process
来填充它的 Pool
在幕后,就像 multiprocessing.Pool
所做的那样,所以关于可挑选性的所有相同的警告(尤其是在Windows)等适用.
You actually should use the if __name__ == "__main__"
guard with ProcessPoolExecutor
, too: It's using multiprocessing.Process
to populate its Pool
under the covers, just like multiprocessing.Pool
does, so all the same caveats regarding picklability (especially on Windows), etc. apply.
根据 ProcessPoolExecutor 最终将取代 multiprocessing.Pool
" rel="nofollow noreferrer">当被问到为什么 Python 有这两种 API 时,Jesse Noller(Python 核心贡献者)发表的这句话:
I believe that ProcessPoolExecutor
is meant to eventually replace multiprocessing.Pool
, according to this statement made by Jesse Noller (a Python core contributor), when asked why Python has both APIs:
Brian 和我需要努力进行我们打算(ed)进行的整合随着人们对 API 感到满意.我的最终目标是删除除了 MP 中的基本 multiprocessing.Process/Queue 之外的任何东西并进入并发.*并支持它的线程后端.
Brian and I need to work on the consolidation we intend(ed) to occur as people got comfortable with the APIs. My eventual goal is to remove anything but the basic multiprocessing.Process/Queue stuff out of MP and into concurrent.* and support threading backends for it.
目前,ProcessPoolExecutor
与 multiprocessing.Pool
做的事情大多完全相同,但 API 更简单(也更有限).如果您可以不使用 ProcessPoolExecutor
,请使用它,因为我认为从长远来看它更有可能获得增强.请注意,您可以将 multiprocessing
中的所有帮助程序与 ProcessPoolExecutor
一起使用,例如 Lock
、Queue
、Manager
等,因此需要这些不是使用 multiprocessing.Pool
的理由.
For now, ProcessPoolExecutor
is mostly doing the exact same thing as multiprocessing.Pool
with a simpler (and more limited) API. If you can get away with using ProcessPoolExecutor
, use that, because I think it's more likely to get enhancements in the long-term. Note that you can use all the helpers from multiprocessing
with ProcessPoolExecutor
, like Lock
, Queue
, Manager
, etc., so needing those isn't a reason to use multiprocessing.Pool
.
不过,它们的 API 和行为存在一些显着差异:
There are some notable differences in their APIs and behavior though:
如果
ProcessPoolExecutor
中的进程突然终止,引发了一个BrokenProcessPool
异常,中止任何等待池工作的调用,并阻止提交新工作.如果multiprocessing.Pool
发生同样的事情,它将默默地替换终止的进程,但在该进程中完成的工作将永远不会完成,这可能会导致调用代码挂起永远等待工作完成.
If a Process in a
ProcessPoolExecutor
terminates abruptly, aBrokenProcessPool
exception is raised, aborting any calls waiting for the pool to do work, and preventing new work from being submitted. If the same thing happens to amultiprocessing.Pool
it will silently replace the process that terminated, but the work that was being done in that process will never be completed, which will likely cause the calling code to hang forever waiting for the work to finish.
如果您运行的是 Python 3.6 或更低版本,则 ProcessPoolExecutor
缺少对 initializer
/initargs
的支持.仅在 3.7 中添加了对此的支持).
If you are running Python 3.6 or lower, support for initializer
/initargs
is missing from ProcessPoolExecutor
. Support for this was only added in 3.7).
ProcessPoolExecutor
中不支持 maxtasksperchild
.
concurrent.futures
在 Python 2.7 中不存在,除非您手动安装 backport.
concurrent.futures
doesn't exist in Python 2.7, unless you manually install the backport.
如果您在 Python 3.5 以下运行,根据this question,multiprocessing.Pool.map
优于 ProcessPoolExecutor.map
.请注意,每个工作项的性能差异非常小,因此如果您在非常大的可迭代对象上使用 map
,您可能只会注意到较大的性能差异.造成性能差异的原因是 multiprocessing.Pool
会将传递给 map 的 iterable 分批成 chunk,然后将 chunk 传递给 worker 进程,这样可以减少父子之间的 IPC 开销.ProcessPoolExecutor
总是(或默认情况下,从 3.5 开始)一次将一个项目从可迭代对象传递给子对象,由于 IPC 开销增加,这可能导致大型可迭代对象的性能更慢.好消息是这个问题在 Python 3.5 中得到修复,因为 chunksize
关键字参数已添加到 ProcessPoolExecutor.map
中,可用于在以下情况下指定更大的块大小你知道你正在处理大型迭代.有关详细信息,请参阅此 错误.
If you're running below Python 3.5, according to this question, multiprocessing.Pool.map
outperforms ProcessPoolExecutor.map
. Note that the performance difference is very small per work item, so you'll probably only notice a large performance difference if you're using map
on a very large iterable. The reason for the performance difference is that multiprocessing.Pool
will batch the iterable passed to map into chunks, and then pass the chunks to the worker processes, which reduces the overhead of IPC between the parent and children. ProcessPoolExecutor
always (or by default, starting in 3.5) passes one item from the iterable at a time to the children, which can lead to much slower performance with large iterables, due to the increased IPC overhead. The good news is this issue is fixed in Python 3.5, as the chunksize
keyword argument has been added to ProcessPoolExecutor.map
, which can be used to specify a larger chunk size when you know you're dealing with large iterables. See this bug for more info.
相关文章