具有全局变量的 multiprocessing.Pool

2022-01-12 00:00:00 python multiprocessing

问题描述

我正在使用 python 多处理库中的 Pool 类编写一个将在 HPC 集群上运行的程序.

I am using the Pool class from python's multiprocessing library write a program that will run on an HPC cluster.

这是我正在尝试做的抽象:

Here is an abstraction of what I am trying to do:

def myFunction(x):
    # myObject is a global variable in this case
    return myFunction2(x, myObject)

def myFunction2(x,myObject):
    myObject.modify() # here I am calling some method that changes myObject
    return myObject.f(x)

poolVar = Pool()
argsArray = [ARGS ARRAY GOES HERE]
output = poolVar.map(myFunction, argsArray)

函数 f(x) 包含在 *.so 文件中,即它正在调用 C 函数.

The function f(x) is contained in a *.so file, i.e., it is calling a C function.

我遇到的问题是每次运行程序时输出变量的值都不同(即使函数 myObject.f() 是确定性函数).(如果我只有一个进程,那么每次运行程序时输出变量都是相同的.)

The problem I am having is that the value of the output variable is different each time I run my program (even though the function myObject.f() is a deterministic function). (If I only have one process then the output variable is the same each time I run the program.)

我尝试创建对象而不是将其存储为全局变量:

I have tried creating the object rather than storing it as a global variable:

def myFunction(x):
    myObject = createObject()
    return myFunction2(x, myObject)

然而,在我的程序中,对象的创建成本很高,因此,创建一次 myObject 然后在每次调用 myFunction2() 时修改它要容易得多.因此,我不想每次都创建对象.

However, in my program the object creation is expensive, and thus, it is a lot easier to create myObject once and then modify it each time I call myFunction2(). Thus, I would like to not have to create the object each time.

你有什么建议吗?我对并行编程很陌生,所以我可能会做错这一切.我决定使用 Pool 类,因为我想从简单的东西开始.但我愿意尝试更好的方法.

Do you have any tips? I am very new to parallel programming so I could be going about this all wrong. I decided to use the Pool class since I wanted to start with something simple. But I am willing to try a better way of doing it.


解决方案

我正在使用 python 多处理库中的 Pool 类来做HPC 集群上的一些共享内存处理.

进程不是线程!您不能简单地将 Thread 替换为 Process 并期望所有进程都能正常工作.进程不共享内存,这意味着全局变量被复制,因此它们在原始进程中的值不会改变.

Processes are not threads! You cannot simply replace Thread with Process and expect all to work the same. Processes do not share memory, which means that the global variables are copied, hence their value in the original process doesn't change.

如果你想在进程之间使用共享内存那么你必须使用multiprocessing的数据类型,例如ValueArray、或使用 Manager 创建共享列表等.

If you want to use shared memory between processes then you must use the multiprocessing's data types, such as Value, Array, or use the Manager to create shared lists etc.

您可能对 Manager.register 方法感兴趣,该方法允许 Manager 创建共享的自定义对象(尽管它们必须是可挑选的).

In particular you might be interested in the Manager.register method, which allows the Manager to create shared custom objects(although they must be picklable).

但是我不确定这是否会提高性能.由于进程之间的任何通信都需要酸洗,而酸洗通常需要更多时间,然后只是实例化对象.

However I'm not sure whether this will improve the performance. Since any communication between processes requires pickling, and pickling takes usually more time then simply instantiating the object.

请注意,您可以在创建 initializer 和 initargs 参数对工作进程进行一些初始化.org/3.3/library/multiprocessing.html#multiprocessing.pool.Pool" rel="noreferrer">Pool.

Note that you can do some initialization of the worker processes passing the initializer and initargs argument when creating the Pool.

例如,以最简单的形式,在工作进程中创建一个全局变量:

For example, in its simplest form, to create a global variable in the worker process:

def initializer():
    global data
    data = createObject()

用作:

pool = Pool(4, initializer, ())

那么worker函数就可以放心的使用data全局变量了.

Then the worker functions can use the data global variable without worries.

样式说明:从不为您的变量/模块使用内置名称.在您的情况下, object 是内置的.否则,您最终会遇到意想不到的错误,这些错误可能晦涩难懂且难以追踪.

Style note: Never use the name of a built-in for your variables/modules. In your case object is a built-in. Otherwise you'll end up with unexpected errors which may be obscure and hard to track down.

相关文章