如何将父进程全局变量复制到子进程中
问题描述
Ubuntu 20.04
我对python子进程访问全局变量的理解是这样的:
- 全局变量(比方说
b
)可用于写入时复制容量中的每个子进程 - 如果子进程修改该变量,则首先创建
b
的副本,然后修改该副本。此更改对父进程不可见(稍后我将就此部分提出问题)
我做了几个实验,试图了解对象何时被复制。我不能得出太多结论:
实验:
import numpy as np
import multiprocessing as mp
import psutil
b=np.arange(200000000).reshape(-1,100).astype(np.float64)
然后我尝试使用下面提到的函数查看内存消耗是如何变化的:
def f2():
print(psutil.virtual_memory().used/(1024*1024*1024))
global b
print(psutil.virtual_memory().used/(1024*1024*1024))
b = b + 1 ### I changed this statement to study the different memory behaviors. I am posting the results for different statements in place of b = b + 1.
print(psutil.virtual_memory().used/(1024*1024*1024))
p2 = mp.Process(target=f2)
p2.start()
p2.join()
结果格式:
statement used in place of b = b + 1
print 1
print 2
print 3
Comments and questions
结果:
b = b+1
6.571144104003906
6.57244873046875
8.082862854003906
Only a copy-on-write view was provided so no memory consumption till it hit b = b+1. At which point a copy of b was created and hence the memory usage spike
b[:, 1] = b[:, 1] + 1
6.6118621826171875
6.613414764404297
8.108139038085938
Only a copy-on-write view was provided so no memory consumption till it hit b[:, 1] = b[:, 1] + 1. It seems that even if some part of the memory is to be updated (here just one column) the entire object would be copied. Seems fair (so far)
b[0, :] = b[0, :] + 1
6.580562591552734
6.581851959228516
6.582511901855469
NO MEMORY CHANGE! When I tried to modify a column it copied the entire b. But when I try to modify a row, it does not create a copy? Can you please explain what happened here?
b[0:100000, :] = b[0:100000, :] + 1
6.572498321533203
6.5740814208984375
6.656215667724609
Slight memory spike. Assuming a partial copy since I modified just the first 1/20th of the rows. But that would mean that while modifying a column as well some partial copy should have been created, unlike the full copy that we saw in case 2 above. No? Can you please explain what happened here as well?
b[0:500000, :] = b[0:500000, :] + 1
6.593017578125
6.594577789306641
6.970676422119141
The assumption of partial copy was right I think. A moderate memory spike to reflect the change in 1/4th of the total rows
b[0:1000000, :] = b[0:1000000, :] + 1
6.570674896240234
6.5723876953125
7.318485260009766
In-line with partial copy hypothesis
b[0:2000000, :] = b[0:2000000, :] + 1
6.594249725341797
6.596080780029297
8.087333679199219
A full copy since now we are modifying the entire array. This is equal to b = b + 1 only. Just that we have now referred using a slice of all the rows
b[0:2000000, 1] = b[0:2000000, 1] + 1
6.564876556396484
6.566963195800781
8.069766998291016
Again full copy. It seems in the case of row slices a partial copy is getting created and in the case of a column slice, a full copy is getting created which, is weird to me. Can you please help me understand what the exact copy semantics of global variables of a child process are?
如您所见,我没有找到一种方法来证明我在我描述的实验设置中看到的结果是正确的。您能帮我了解在子流程全部/部分修改时,父流程的全局变量是如何复制的吗?
我还read:
子级获得父内存空间的写入时复制视图。只要您在触发进程之前加载数据集,并且不在多进程调用中传递对该内存空间的引用(即,工作进程应直接使用全局变量),则不会有副本。
问题1:只要在启动进程之前加载数据集,并且不在多进程调用中传递对该内存空间的引用(即,工作进程应直接使用全局变量),则不存在副本是什么意思?
正如蒂姆·罗伯茨先生下面回答的那样,它的意思是-
如果将DataSet作为参数传递,则Python必须创建一个副本来传输它。参数传递机制不使用写入时复制,部分原因是引用计数内容会被混淆。当您在事情开始之前将其创建为全局时,会有一个可靠的引用,因此多处理代码可以实现写入时复制。
但是,我无法验证此行为。以下是我运行以验证
的几个测试import numpy as np
import multiprocessing as mp
import psutil
b=np.arange(200000000).reshape(-1,100).astype(np.float64)
然后我尝试使用下面提到的函数查看内存消耗是如何变化的:
def f2(b): ### Please notice that the array is passed as an argument and not picked as the global variable of parent process
print(psutil.virtual_memory().used/(1024*1024*1024))
b = b + 1 ### I changed this statement to study the different memory behaviors. I am posting the results for different statements in place of b = b + 1.
print(psutil.virtual_memory().used/(1024*1024*1024))
print(psutil.virtual_memory().used/(1024*1024*1024))
p2 = mp.Process(target=f2,args=(b,)) ### Please notice that the array is passed as an argument and not picked as the global variable of parent process
p2.start()
p2.join()
结果格式:同上
结果:
b = b+1
6.692680358886719
6.69635009765625
8.189273834228516
The second print is arising from within the function hence, by then the copy should have been made and we should see the second print to be around 8.18
b = b
6.699306488037109
6.701808929443359
6.702671051025391
The second and third print should have been around 8.18. The results suggest that no copy is created even though the array b is passed to the function as an argument
解决方案
写入时复制一次执行一个虚拟内存页。只要您的更改位于单个4096字节的页面内,您就只需为该页面付费。当您修改列时,您所做的更改将分布在多个页面上。我们Python程序员不习惯于担心物理内存中的布局,但这就是这里的问题。
问题1:如果您将DataSet作为参数传递,那么Python必须制作一个副本来传输它。参数传递机制不使用写入时复制,部分原因是引用计数内容会被混淆。当您在事情开始之前将其创建为全局时,会有一个可靠的引用,因此多处理代码可以实现写入时复制。相关文章