我在做什么的 Python 多处理进程或池?
问题描述
我是 Python 中多处理的新手,并试图弄清楚是否应该使用 Pool 或 Process 来异步调用两个函数.我有两个函数进行 curl 调用并将信息解析为 2 个单独的列表.根据互联网连接,每个功能可能需要大约 4 秒.我意识到瓶颈在于 ISP 连接,多处理不会加快速度,但让它们都启动异步会很好.另外,这对我来说是一个很好的学习经验,可以让我进入 python 的多处理,因为我以后会更多地使用它.
I'm new to multiprocessing in Python and trying to figure out if I should use Pool or Process for calling two functions async. The two functions I have make curl calls and parse the information into a 2 separate lists. Depending on the internet connection, each function could take about 4 seconds each. I realize that the bottleneck is in the ISP connection and multiprocessing won't speed it up much, but it would be nice to have them both kick off async. Plus, this is a great learning experience for me to get into python's multi-processing because I will be using it more later.
我已阅读 Python multiprocessing.Pool: 什么时候使用 apply、apply_async 或 map? 这很有用,但我还是有自己的问题.
I have read Python multiprocessing.Pool: when to use apply, apply_async or map? and it was useful, but still had my own questions.
所以我可以做到的一种方法是:
So one way I could do it is:
def foo():
pass
def bar():
pass
p1 = Process(target=foo, args=())
p2 = Process(target=bar, args=())
p1.start()
p2.start()
p1.join()
p2.join()
我对此实施的疑问是:1)由于连接阻塞直到调用进程完成......这是否意味着p1进程必须在p2进程启动之前完成?我一直认为 .join() 与 pool.apply() 和 pool.apply_sync().get() 相同,其中父进程在当前运行完成之前无法启动另一个进程(任务).
Questions I have for this implementation is: 1) Since join blocks until calling process is completed...does this mean p1 process has to finish before p2 process is kicked off? I always understood the .join() be the same as pool.apply() and pool.apply_sync().get() where the parent process can not launch another process(task) until the current one running is completed.
另一种选择是:
def foo():
pass
def bar():
pass
pool = Pool(processes=2)
p1 = pool.apply_async(foo)
p1 = pool.apply_async(bar)
我对此实施的疑问是:1) 我需要 pool.close()、pool.join() 吗?2) pool.map() 在我得到结果之前会让它们全部完成吗?如果是这样,它们是否仍然异步运行?3) pool.apply_async() 与使用 pool.apply() 执行每个进程有何不同4) 这与之前的 Process 实现有何不同?
Questions I have for this implementation would be: 1) Do I need a pool.close(), pool.join()? 2) Would pool.map() make them all complete before I could get results? And if so, are they still ran asynch? 3) How would pool.apply_async() differ from doing each process with pool.apply() 4) How would this differ from the previous implementation with Process?
解决方案
您列出的两个方案完成相同的事情,但方式略有不同.
The two scenarios you listed accomplish the same thing but in slightly different ways.
第一个场景启动两个单独的进程(称为 P1 和 P2)并启动 P1 运行 foo
和 P2 运行 bar
,然后等待直到两个进程都完成它们各自的任务.
The first scenario starts two separate processes (call them P1 and P2) and starts P1 running foo
and P2 running bar
, and then waits until both processes have finished their respective tasks.
第二个场景启动两个进程(称为 Q1 和 Q2),首先在 Q1 或 Q2 上启动 foo
,然后在 Q1 或 Q2 上启动 bar
.然后代码等待,直到两个函数调用都返回.
The second scenario starts two processes (call them Q1 and Q2) and first starts foo
on either Q1 or Q2, and then starts bar
on either Q1 or Q2. Then the code waits until both function calls have returned.
所以最终结果实际上是相同的,但在第一种情况下,您可以保证在不同的进程上运行 foo
和 bar
.
So the net result is actually the same, but in the first case you're guaranteed to run foo
and bar
on different processes.
至于您对并发的具体问题,Process
上的 .join()
方法确实会阻塞,直到进程完成,但是因为您调用了 .start()
在加入之前在 P1 和 P2(在您的第一个场景中)上,然后两个进程将异步运行.但是,解释器会等到 P1 完成后再尝试等待 P2 完成.
As for the specific questions you had about concurrency, the .join()
method on a Process
does indeed block until the process has finished, but because you called .start()
on both P1 and P2 (in your first scenario) before joining, then both processes will run asynchronously. The interpreter will, however, wait until P1 finishes before attempting to wait for P2 to finish.
对于关于池场景的问题,从技术上讲,您应该使用 pool.close()
但这有点取决于您之后可能需要它做什么(如果它超出了范围,那么您不需要关闭它).pool.map()
是一种完全不同的动物,因为它将一堆参数(异步)分配给同一个函数,跨池进程,然后等待直到所有函数调用都完成之前返回结果列表.
For your questions about the pool scenario, you should technically use pool.close()
but it kind of depends on what you might need it for afterwards (if it just goes out of scope then you don't need to close it necessarily). pool.map()
is a completely different kind of animal, because it distributes a bunch of arguments to the same function (asynchronously), across the pool processes, and then waits until all function calls have completed before returning the list of results.
相关文章