Cython 容器不会释放内存吗?

问题描述

当我运行以下代码时,我希望一旦执行了 foo(),它所使用的内存(主要是创建 m)将被释放.然而,事实并非如此.要释放此内存,我需要重新启动 IPython 控制台.

When I run the following code, I expect that once foo() has been executed, the memory used by it (basically to create m) would be released. However, that is not the case. To release this memory I need to restart the IPython console.

%%cython
# distutils: language = c++

import numpy as np
from libcpp.map cimport map as cpp_map

cdef foo():
    cdef:
        cpp_map[int,int]    m
        int i
    for i in range(50000000):
        m[i] = i

foo()

如果有人能告诉我为什么会这样,以及如何在不重新启动 shell 的情况下释放这些内存,那就太好了.提前致谢.

It will be great if someone could tell me why this is the case and also how to release this memory without restarting the shell. Thanks in advance.


解决方案

你看到的效果或多或少是你的内存分配器(可能是 glibc 的默认分配器)的实现细节.glibc 的内存分配器工作原理如下:

Effects your are seeing are more or less implementation details of your memory allocator (possible glibc's default allocator). glibc's memory allocator works as follows:

  • 对小内存大小的请求由 arena 满足,arenas 会根据需要增长/数量会增加.
  • 对大内存的请求直接从操作系统获取,但在释放后也直接返回给操作系统.

当使用 mallopt,但通常使用内部启发式方法来决定何时/是否应将内存返回给操作系统 - 我最承认这是一种黑魔法对我来说.

One can tweak when the memory from those arenas is released using mallopt, but normally an internal heuristic is used which decides, when/if the memory should be returned to OS - which I most confess is kind of black magic to me.

std::map 的问题(和 std::unordered_map 的情况类似)是,它不包含一大块内存将立即返回到操作系统,但有很多小节点(地图实现为 Red-Black-Tree by libstdc++) - 所以它们都来自这些领域,启发式决定不将其返回给操作系统.

The problem of std::map (and situation is similar for std::unordered_map) is, that it doesn't consist of a big chunk of memory which would be returned to OS immediately, but of a lot of small nodes (map is implemented as Red-Black-Tree by libstdc++) - so they all are from those arenas and the heuristic decides not return it to OS.

当我们使用 glibc 的分配器时,可以使用非标准函数 malloc_trim 手动释放内存:

As we are using glibc's allocator, one could use the non-standard function malloc_trim to free the memory manually:

%%cython

cdef extern from "malloc.h" nogil:
     int malloc_trim(size_t pad)

def return_memory_to_OS():
    malloc_trim(0)

现在只需在每次使用 foo 后调用 return_memory_to_OS().

and now just call return_memory_to_OS() after every usage of foo.

上述解决方案快速而肮脏,但不可移植.您想要的是一个自定义分配器,一旦不再使用它就会将内存释放回操作系统.这是很多工作——但幸运的是,我们手头已经有了这样的分配器:CPython 的 pymalloc - 从 Python2.5 开始,它会将内存返回给操作系统(即使这意味着 有时会遇到麻烦).但是,我们也应该指出 pymalloc 的一个很大的缺陷——它不是线程安全的,所以它只能用于带有 gil 的代码!

The above solution is quick&dirty but is not portable. What you want to have is an custom allocator which would release the memory back to OS as soon as it is no longer used. That is a lot of work - but luckily we have already such an allocator at hand: CPython's pymalloc - since Python2.5 it returns memory to OS (even if it means sometimes trouble). However, we should also point out a big deficiency of pymalloc - it is not thread-safe, so it can be used only for code with gil!

使用 pymalloc-allocator 不仅有将内存返回给 OS 的优点,而且因为 pymalloc 是 8 字节对齐的,而 glibc 的分配器是 32 字节对齐的,因此产生的内存消耗会更小(map[int,int 的节点] 是 40 个字节,使用 pymalloc 只需要 40.5 个字节(加上开销),而 glibc 不需要小于 64 字节).

Using pymalloc-allocator has not only the advantage of returning the memory to OS but also because pymalloc is 8byte-aligned while glibc's allocator is 32byte aligned the resulting memory consumption will be smaller (nodes of map[int,int] are 40 bytes which will cost only 40.5 bytes with pymalloc (together with overhead) while glibc will needs not less than 64 bytes).

我的自定义分配器实现遵循 Nicolai M. Josuttis 的示例并实现只有真正需要的功能:

My implementation of the custom allocator follows Nicolai M. Josuttis' example and implements only the really needed functionality:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    #include <cstddef>   // std::size_t
    #include <Python.h>  // pymalloc

    template <class T>
    class pymalloc_allocator {
     public:
       // type definitions
       typedef T        value_type;
       typedef T*       pointer;
       typedef std::size_t    size_type;

       template <class U>
       pymalloc_allocator(const pymalloc_allocator<U>&) throw(){};
       pymalloc_allocator() throw() = default;
       pymalloc_allocator(const pymalloc_allocator&) throw() = default;
       ~pymalloc_allocator() throw() = default;

       // rebind allocator to type U
       template <class U>
       struct rebind {
           typedef pymalloc_allocator<U> other;
       };

       pointer allocate (size_type num, const void* = 0) {
           pointer ret = static_cast<pointer>(PyMem_Malloc(num*sizeof(value_type)));
           return ret;
       }

       void deallocate (pointer p, size_type num) {
           PyMem_Free(p);
       }

       // missing: destroy, construct, max_size, address
       //  -
   };

   // missing:
   //  bool operator== , bool operator!= 

    #include <utility>
    typedef pymalloc_allocator<std::pair<int, int>> PairIntIntAlloc;

    //further helper (not in functional.pxd):
    #include <functional>
    typedef std::less<int> Less;
    """
    cdef cppclass PairIntIntAlloc:
        pass
    cdef cppclass Less:
        pass


from libcpp.map cimport map as cpp_map

def foo():
    cdef:
        cpp_map[int,int, Less, PairIntIntAlloc] m
        int i
    for i in range(50000000):
        m[i] = i

现在,一旦 foo 完成,大部分已用内存将返回给操作系统 - 在任何操作系统和内存分配器上!

Now, lion's share of the used memory is returned to OS once foo is done - on any operating system and memory allocator!

如果内存消耗是一个问题,可以切换到需要较少内存的 unorder_map.然而,目前 unordered_map.pxd 不提供对所有模板参数的访问,因此必须手动包装它:

If memory consumption is an issue, one could switch to unorder_map which needs somewhat less memory. However, as of the moment unordered_map.pxd doesn't offer access to all template-parameters, so one will have to wrap it manually:

%%cython -c=-std=c++11 --cplus

cdef extern from *:
    """
    ....

    //further helper (not in functional.pxd):
    #include <functional>
    ...
    typedef std::hash<int> Hash;
    typedef std::equal_to<int> Equal_to;
    """
    ...
    cdef cppclass Hash:
        pass
    cdef cppclass Equal_to:
        pass

cdef extern from "<unordered_map>" namespace "std" nogil:
    cdef cppclass unordered_map[T, U, HASH=*,RPED=*, ALLOC=* ]:
        U& operator[](T&)

N = 5*10**8

def foo_unordered_pymalloc():
    cdef:
        unordered_map[int, int, Hash, Equal_to, PairIntIntAlloc] m
        int i
    for i in range(N):
        m[i] = i

<小时>

这里有一些基准,它们显然不完整,但可能很好地表明了方向(但对于 N=3e7 而不是 N=5e8):


Here are some benchmarks, which are obviously not complete, but probably show the direction pretty well (but for N=3e7 instead of N=5e8):

                                   Time           PeakMemory

map_default                        40.1s             1416Mb
map_default+return_memory          41.8s 
map_pymalloc                       12.8s             1200Mb

unordered_default                   9.8s             1190Mb
unordered_default+return_memory    10.9s
unordered_pymalloc                  5.5s              730Mb

计时是通过 %timeit 魔法完成的,而峰值内存使用是通过 via/usr/bin/time -fpeak_used_memory:%M python script_xxx.py 完成的.

The timings were done via %timeit magic and peak memory usage via via /usr/bin/time -fpeak_used_memory:%M python script_xxx.py.

我有点惊讶,pymalloc 的性能比 glibc-allocator 好很多,而且内存分配似乎是通常映射的瓶颈!也许这就是 glibc 支持多线程必须付出的代价.

I'm somewhat surprised, that pymalloc outperforms the glibc-allocator by so much and also that it seems as if memory allocations are the bottle-neck for the usual map! Maybe this is the price glibc must pay for supporting multi-threading.

unordered_map 更快,可能需要更少的内存(好的,因为重新散列最后一部分可能是错误的).

unordered_map is faster and maybe needs less memory (ok, because of the rehashing the last part could be wrong).

相关文章