如何点积(1,10^{13})(10^13,1)实数稀疏矩阵

2022-04-13 00:00:00 numpy scipy sparse-matrix dot-product

问题描述

基本上就是标题所包含的内容。 这两个矩阵几乎都是零。第一个是1 x 9999999999999,第二个是9999999999999 x 1 当我尝试做点积时,我得到了这样的结果。

Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64

Full traceback </br>

    MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64

In [31]: imputed.dot(s)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-31-670cfc69d4cf> in <module>
----> 1 imputed.dot(s)

~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in dot(self, other)
    357 
    358         """
--> 359         return self * other
    360 
    361     def power(self, n, dtype=None):

~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in __mul__(self, other)
    478             if self.shape[1] != other.shape[0]:
    479                 raise ValueError('dimension mismatch')
--> 480             return self._mul_sparse_matrix(other)
    481 
    482         # If it's a list or whatever, treat it like a matrix

~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other)
    499 
    500         major_axis = self._swap((M, N))[0]
--> 501         other = self.__class__(other)  # convert to this format
    502 
    503         idx_dtype = get_index_dtype((self.indptr, self.indices,

~/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     32                 arg1 = arg1.copy()
     33             else:
---> 34                 arg1 = arg1.asformat(self.format)
     35             self._set_self(arg1)
     36 

~/.local/lib/python3.8/site-packages/scipy/sparse/base.py in asformat(self, format, copy)
    320             # Forward the copy kwarg, if it's accepted.
    321             try:
--> 322                 return convert_method(copy=copy)
    323             except TypeError:
    324                 return convert_method()

~/.local/lib/python3.8/site-packages/scipy/sparse/csc.py in tocsr(self, copy)
    135         idx_dtype = get_index_dtype((self.indptr, self.indices),
    136                                     maxval=max(self.nnz, N))
--> 137         indptr = np.empty(M + 1, dtype=idx_dtype)
    138         indices = np.empty(self.nnz, dtype=idx_dtype)
    139         data = np.empty(self.nnz, dtype=upcast(self.dtype))

MemoryError: Unable to allocate 72.8 TiB for an array with shape (10000000000000,) and data type int64

似乎scipy正在尝试创建一个temp数组。 我使用的是Scipy提供的.ot方法。 我也对不重要的解决方案持开放态度。 谢谢!


解决方案

In [105]: from scipy import sparse

如果我制作一个(100,1)CSR矩阵:

In [106]: A = sparse.random(100,1,format='csr')
In [107]: A
Out[107]: 
<100x1 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Row format>

数据和指数为:

In [109]: A.data
Out[109]: array([0.19060481])
In [110]: A.indices
Out[110]: array([0], dtype=int32)
In [112]: A.indptr
Out[112]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

因此,即使只有1个非零项,一个数组也很大(101)。

另一方面,同一数组的csc格式的存储空间要小得多。但形状为(1,100)的csc将看起来像CSR。

In [113]: Ac = A.tocsc()
In [114]: Ac.indptr
Out[114]: array([0, 1], dtype=int32)
In [115]: Ac.indices
Out[115]: array([88], dtype=int32)

数学,特别是矩阵乘积是用csr/csc格式完成的。因此,可能很难避免这种80 TB的内存使用。


查看回溯,我发现它正在尝试将other转换为与self匹配的格式。

所以A.dot(B)A是(1,N)csr,小的形状。B是(N,1)csc,也是小形状。但B.tocsr()需要较大的(N+1,)形状indptr


让我们尝试dot的替代方案

前2个矩阵:

In [122]: A = sparse.random(1,100, .2,format='csr')
In [123]: B = sparse.random(100,1, .2,format='csc')
In [124]: A
Out[124]: 
<1x100 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
In [125]: B
Out[125]: 
<100x1 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Column format>
In [126]: A@B
Out[126]: 
<1x1 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in Compressed Sparse Row format>
In [127]: _.A
Out[127]: array([[1.33661021]])

它们的非零元素索引。只有匹配的才重要。

In [128]: A.indices, B.indices
Out[128]: 
(array([16, 20, 23, 28, 30, 37, 39, 40, 43, 49, 54, 59, 61, 63, 67, 70, 74,
        91, 94, 99], dtype=int32),
 array([ 5,  8, 15, 25, 34, 35, 40, 46, 47, 51, 53, 60, 68, 70, 75, 81, 87,
        90, 91, 94], dtype=int32))

等式矩阵:

In [129]: mask = A.indices[:,None]==B.indices

In [132]: np.nonzero(mask.any(axis=0))
Out[132]: (array([ 6, 13, 18, 19]),)
In [133]: np.nonzero(mask.any(axis=1))
Out[133]: (array([ 7, 15, 17, 18]),)

匹配索引:

In [139]: A.indices[Out[133]]
Out[139]: array([40, 70, 91, 94], dtype=int32)
In [140]: B.indices[Out[132]]
Out[140]: array([40, 70, 91, 94], dtype=int32)

相应data值的总和与[127]

匹配
In [141]: (A.data[Out[133]]*B.data[Out[132]]).sum()
Out[141]: 1.3366102138511582

相关文章