共享内存中用于多处理的大型 numpy 数组:这种方法有问题吗?
问题描述
多处理是一个很棒的工具,但使用大内存块并不是那么直接.您可以在每个进程中加载块并将结果转储到磁盘上,但有时您需要将结果存储在内存中.最重要的是,使用花哨的 numpy 功能.
Multiprocessing is a wonderful tool but is not so straight forward to use large memory chunks with it. You can load chunks in each process and dump results on disk but sometimes you need to store the results in the memory. And on top, use the fancy numpy functionality.
我已经阅读/谷歌了很多,并想出了一些答案:
I have read/googled a lot and came up with some answers:
在共享内存中使用numpy数组进行多处理
在多处理进程之间共享大型只读 Numpy 数组
Python 多处理全局 numpy 数组
如何如何在 python 子进程之间传递大型 numpy 数组而不保存到磁盘?
等等等等等等.
它们都有缺点:不太主流的库(sharedmem
);全局存储变量;不太容易阅读代码、管道等.
They all have drawbacks: Not-so-mainstream libraries (sharedmem
); globally storing variables; not so easy to read code, pipes, etc etc.
我的目标是在我的工作人员中无缝使用 numpy,而不用担心转换和其他东西.
My goal was to seamlessly use numpy in my workers without worrying about conversions and stuff.
经过多次试验,我想出了 this.它适用于我的 ubuntu 16、python 3.6、16GB、8 核机器.与以前的方法相比,我做了很多捷径".没有全局共享状态,没有需要在 worker 内部转换为 numpy 的纯内存指针,作为进程参数传递的大型 numpy 数组等.
After much trials I came up with this. And it works on my ubuntu 16, python 3.6, 16GB, 8 core machine. I did a lot of "shortcuts" compared to previous approaches. No global shared state, no pure memory pointers that need to be converted to numpy inside workers, large numpy arrays passed as process arguments, etc.
Pastebin 链接上面,但我会在这里放几个片段.
Pastebin link above, but I will put few snippets here.
一些进口:
import numpy as np
import multiprocessing as mp
import multiprocessing.sharedctypes
import ctypes
分配一些共享内存并将其包装到一个 numpy 数组中:
Allocate some shared mem and wrap it into an numpy array:
def create_np_shared_array(shape, dtype, ctype)
. . . .
shared_mem_chunck = mp.sharedctypes.RawArray(ctype, size)
numpy_array_view = np.frombuffer(shared_mem_chunck, dtype).reshape(shape)
return numpy_array_view
创建共享数组并在其中放入一些东西
Create shared array and put something in it
src = np.random.rand(*SHAPE).astype(np.float32)
src_shared = create_np_shared_array(SHAPE,np.float32,ctypes.c_float)
dst_shared = create_np_shared_array(SHAPE,np.float32,ctypes.c_float)
src_shared[:] = src[:] # Some numpy ops accept an 'out' array where to store the results
产生进程:
p = mp.Process(target=lengthly_operation,args=(src_shared, dst_shared, k, k + STEP))
p.start()
p.join()
以下是一些结果(完整参考请参见 pastebin 代码):
Here are some results (see pastebin code for full reference):
Serial version: allocate mem 2.3741257190704346 exec: 17.092209577560425 total: 19.46633529663086 Succes: True
Parallel with trivial np: allocate mem 2.4535582065582275 spawn process: 0.00015354156494140625 exec: 3.4581971168518066 total: 5.911908864974976 Succes: False
Parallel with shared mem np: allocate mem 4.535916328430176 (pure alloc:4.014216661453247 copy: 0.5216996669769287) spawn process: 0.00015664100646972656 exec: 3.6783478260040283 total: 8.214420795440674 Succes: True
我还做了一个 cProfile
(为什么在分配共享内存时要多花 2 秒?)并意识到有一些对 tempfile.py
、{ 的调用'_io.BufferedWriter' 对象的 'write' 方法}
.
I also did a cProfile
(why 2 extra seconds when allocating shared mem?) and realized that there are some calls to the tempfile.py
, {method 'write' of '_io.BufferedWriter' objects}
.
问题
- 我做错了吗?
- (大型)阵列是否来回腌制而我没有获得任何加快速度的东西?请注意,第二次运行(使用常规 np 数组未通过正确性测试)
- 有没有办法进一步改进时序、代码清晰度等?(针对多处理范例)
备注
- 我不能使用进程池,因为 mem 必须在 fork 处继承,而不是作为参数发送.
解决方案
共享数组的分配很慢,因为显然是先写入磁盘,所以可以通过mmap共享.有关参考,请参阅 heap.py 和 sharedctypes.py.这就是 tempfile.py
出现在分析器中的原因.我认为这种方法的优点是共享内存在崩溃的情况下会被清理干净,而 POSIX 共享内存无法保证这一点.
Allocation of the shared array is slow, because apparently it's written to disk first, so it can be shared through a mmap. For reference see heap.py and sharedctypes.py.
This is why tempfile.py
shows up in the profiler. I think the advantage of this approach is that the shared memory is cleaned up in case of a crash, and this cannot be guaranteed with POSIX shared memory.
感谢 fork,您的代码不会发生酸洗,正如您所说,内存是继承的.第二次运行不起作用的原因是因为不允许子进程写入父进程的内存.相反,私有页面是动态分配的,只有在子进程结束时才会被丢弃.
There is no pickling happening with your code, thanks to fork and, as you said, the memory is inherited. The reason the 2nd run doesn't work is because the child processes are not allowed to write in the memory of the parent. Instead, private pages are allocated on the fly, only to be discared when the child process ends.
我只有一个建议:你不必自己指定ctype,可以通过np.ctypeslib._typecodes
从numpy dtype中找出正确的类型.或者只是对所有内容使用 c_byte
并使用 dtype itemsize 来计算缓冲区的大小,无论如何它都会被 numpy 强制转换.
I only have one suggestion: You don't have to specify a ctype yourself, the right type can be figured out from the numpy dtype through np.ctypeslib._typecodes
. Or just use c_byte
for everything and use the dtype itemsize to figure out the size of the buffer, it will be casted by numpy anyway.
相关文章