多处理:在进程之间共享一个大的只读对象?
问题描述
通过multiprocessing 生成的子进程是否共享程序之前创建的对象?
Do child processes spawned via multiprocessing share objects created earlier in the program?
我有以下设置:
do_some_processing(filename):
for line in file(filename):
if line.split(',')[0] in big_lookup_object:
# something here
if __name__ == '__main__':
big_lookup_object = marshal.load('file.bin')
pool = Pool(processes=4)
print pool.map(do_some_processing, glob.glob('*.data'))
我正在将一些大对象加载到内存中,然后创建一个需要使用该大对象的工作人员池.大对象是只读访问的,我不需要在进程之间传递它的修改.
I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.
我的问题是:大对象是否加载到共享内存中,就像我在 unix/c 中生成一个进程一样,还是每个进程都加载自己的大对象副本?
My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?
更新:进一步澄清 - big_lookup_object 是一个共享查找对象.我不需要将其拆分并单独处理.我需要保留一份.我需要拆分它的工作是读取许多其他大文件并根据查找对象查找这些大文件中的项目.
Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.
进一步更新:数据库是一个很好的解决方案,memcached 可能是一个更好的解决方案,磁盘上的文件(搁置或 dbm)可能会更好.在这个问题中,我对内存解决方案特别感兴趣.对于最终解决方案,我将使用 hadoop,但我想看看我是否也可以拥有本地内存版本.
Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.
解决方案
子进程是否通过程序之前创建的多进程共享对象产生?
Do child processes spawned via multiprocessing share objects created earlier in the program?
否(python 3.8 之前),并且 是3.8
No (python before 3.8), and Yes in 3.8
进程有独立的内存空间.
Processes have independent memory space.
解决方案 1
要充分利用拥有大量工人的大型结构,请执行此操作.
To make best use of a large structure with lots of workers, do this.
把每个worker写成一个过滤器";– 从
stdin
读取中间结果,确实有效,将中间结果写入stdout
.
Write each worker as a "filter" – reads intermediate results from
stdin
, does work, writes intermediate results onstdout
.
将所有工作人员连接为管道:
Connect all the workers as a pipeline:
process1 <source | process2 | process3 | ... | processn >result
每个进程读取、执行和写入.
Each process reads, does work and writes.
这是非常有效的,因为所有进程都是同时运行的.写入和读取直接通过进程之间的共享缓冲区.
This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.
解决方案 2
在某些情况下,您有一个更复杂的结构 - 通常是 扇出 结构.在这种情况下,您的父母有多个孩子.
In some cases, you have a more complex structure – often a fan-out structure. In this case you have a parent with multiple children.
家长打开源数据.父母分叉了许多孩子.
Parent opens source data. Parent forks a number of children.
父级读取源代码,将部分源代码分配给每个同时运行的子级.
Parent reads source, farms parts of the source out to each concurrently running child.
当父母到达终点时,关闭管道.孩子获得文件结尾并正常完成.
When parent reaches the end, close the pipe. Child gets end of file and finishes normally.
孩子的部分写起来很愉快,因为每个孩子都简单地阅读sys.stdin
.
The child parts are pleasant to write because each child simply reads sys.stdin
.
父母在产生所有孩子和正确保留管道方面有一点花哨的步法,但还不错.
The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.
Fan-in 是相反的结构.许多独立运行的进程需要将它们的输入交错到一个公共进程中.收集器不那么容易编写,因为它必须从许多来源中读取.
Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.
从许多命名管道读取通常使用 select
模块来查看哪些管道有待处理的输入.
Reading from many named pipes is often done using the select
module to see which pipes have pending input.
解决方案 3
共享查找是数据库的定义.
Shared lookup is the definition of a database.
解决方案 3A – 加载数据库.让工作人员处理数据库中的数据.
Solution 3A – load a database. Let the workers process the data in the database.
解决方案 3B – 创建一个非常简单的服务器,使用 werkzeug(或类似的)来提供响应HTTP GET,以便工作人员可以查询服务器.
Solution 3B – create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.
解决方案 4
共享文件系统对象.Unix OS 提供共享内存对象.这些只是映射到内存的文件,以便交换 I/O 而不是更多的约定缓冲读取.
Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.
您可以通过多种方式从 Python 上下文中执行此操作
You can do this from a Python context in several ways
编写一个启动程序,(1) 将你原来的巨大对象分解成更小的对象,(2) 启动工作人员,每个工作人员都有一个更小的对象.较小的对象可以是腌制的 Python 对象,以节省一点点文件读取时间.
Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.
编写一个启动程序,该程序 (1) 读取您的原始巨大对象并使用 seek
操作写入一个页面结构的字节编码文件,以确保各个部分很容易找到简单的寻找.这就是数据库引擎所做的——将数据分成页面,通过 seek
轻松定位每个页面.
Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek
operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does – break the data into pages, make each page easy to locate via a seek
.
生成可以访问这个大型页面结构文件的工作人员.每个工人都可以找到相关的部分并在那里工作.
Spawn workers with access this this large page-structured file. Each worker can seek to the relevant parts and do their work there.
相关文章