scikit-learn joblib 错误:多处理池 self.value 超出“i"格式代码的范围，仅适用于大型 numpy 数组

2022-01-10 00:00:00 python numpy scikit-learn multiprocessing anaconda

问题描述

我的代码在较小的测试样本上运行良好，例如 X_train、y_train 中的 10000 行数据.当我为数百万行调用它时，我得到了结果错误.包中的错误，还是我可以做一些不同的事情?我正在使用 Anaconda 2.0.1 中的 Python 2.7.7，我把 pool.py 来自 Anaconda 的多处理包和 parallel.py 来自 scikit-learn 的外部包在我的 Dropbox 上给你.

My code runs fine with smaller test samples, like 10000 rows of data in X_train, y_train. When I call it for millions of rows, I get the resulting error. Is the bug in a package, or can I do something differently? I am using Python 2.7.7 from Anaconda 2.0.1, and I put the pool.py from Anaconda's multiprocessing package and parallel.py from scikit-learn's external package on my Dropbox for you.

测试脚本是:

import numpy as np import sklearn from sklearn.linear_model import SGDClassifier from sklearn import grid_search import multiprocessing as mp def main(): print("Started.") print("numpy:", np.__version__) print("sklearn:", sklearn.__version__) n_samples = 1000000 n_features = 1000 X_train = np.random.randn(n_samples, n_features) y_train = np.random.randint(0, 2, size=n_samples) print("input data size: %.3fMB" % (X_train.nbytes / 1e6)) model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True) param_grid = [{ 'alpha' : 10.0 ** -np.arange(1,7), 'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1], }] gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100) gs.fit(X_train, y_train) print(gs.grid_scores_) if __name__=='__main__': mp.freeze_support() main()

这导致输出:

Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Started. ('numpy:', '1.8.1') ('sklearn:', '0.15.0b1') input data size: 8000.000MB Fitting 3 folds for each of 48 candidates, totalling 144 fits Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:userslaszlosappdatalocal emp4joblib_memmaping_pool_6172_787659766172-284752304-75223296-0.pkl Failed to save <type 'numpy.ndarray'> to .npy file: Traceback (most recent call last): File "C:Anacondalibsite-packagessklearnexternalsjoblib umpy_pickle.py", line 240, in save obj, filename = self._write_array(obj, filename) File "C:Anacondalibsite-packagessklearnexternalsjoblib umpy_pickle.py", line 203, in _write_array self.np.save(filename, array) File "C:Anacondalibsite-packages umpylib pyio.py", line 453, in save format.write_array(fid, arr) File "C:Anacondalibsite-packages umpylibformat.py", line 406, in write_array array.tofile(fp) ValueError: 1000000000 requested and 268435456 written Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:userslaszlosappdatalocal emp4joblib_memmaping_pool_6172_787659766172-284752304-75223296-0.pkl Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 28 days Traceback (most recent call last): File "S:laszlogridsearch_largearray.py", line 33, in <module> main() File "S:laszlogridsearch_largearray.py", line 28, in main gs.fit(X_train, y_train) File "C:Anacondalibsite-packagessklearngrid_search.py", line 597, in fit return self._fit(X, y, ParameterGrid(self.param_grid)) File "C:Anacondalibsite-packagessklearngrid_search.py", line 379, in _fit for parameters in parameter_iterable File "C:Anacondalibsite-packagessklearnexternalsjoblibparallel.py", line 651, in __call__ self.retrieve() File "C:Anacondalibsite-packagessklearnexternalsjoblibparallel.py", line 503, in retrieve self._output.append(job.get()) File "C:Anacondalibmultiprocessingpool.py", line 558, in get raise self._value struct.error: integer out of range for 'i' format code

ogrisel 的答案确实适用于 scikit-learn-0.15.0b1 的手动内存映射.不要忘记一次只运行一个脚本，否则你仍然会耗尽内存并且线程过多.(我的 CSV 文件大小约为 12.5 GB 的数据，使用 8 个线程，运行大约 60 GB.)

ogrisel's answer does work with manual memory mapping with scikit-learn-0.15.0b1. Don't forget to run only one script at once, otherwise you can still run out of memory and have too many threads. (My run take ~60 GB on data of size ~12.5 GB in CSV, with 8 threads.)

解决方案

作为一种解决方法，您可以尝试显式地对数据进行内存映射 &手动在 joblib 文档中解释.

As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.

编辑#1:这是重要的部分:

from sklearn.externals import joblib joblib.dump(X_train, some_filename) X_train = joblib.load(some_filename, mmap_mode='r+')

然后将这个 memmap 的数据传递给 scikit-learn 0.15+ 下的 GridSearchCV.

Then pass this memmap'ed data to GridSearchCV under scikit-learn 0.15+.

编辑 #2: 此外:如果您使用 32 位版本的 Anaconda，每个 python 进程将被限制为 2GB，这也会限制内存.

Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.

我刚刚为 numpy.save 找到了一个 bug在 Python 3.4 下，但即使修复后对 mmap 的后续调用也会失败:

I just found a bug for numpy.save under Python 3.4 but even when fixed the subsequent call to mmap will fail with:

OSError: [WinError 8] Not enough storage is available to process this command

所以请使用 64 位版本的 Python(Anaconda 作为 AFAIK，目前没有其他 64 位软件包用于 numpy/scipy/scikit-learn==0.15.0b1).

So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).

编辑 #3: 我发现了另一个可能导致 windows 下内存使用过多的问题:当前 joblib.Parallel 内存映射输入数据与 mmap_mode='c' 默认情况下:此写时复制设置似乎会导致窗口耗尽分页文件，有时会触发[错误 1455] 分页文件太小，无法完成此操作"错误.设置 mmap_mode='r' 或 mmap_mode='r+' 不会触发该问题.我将运行测试，看看我是否可以在下一个版本的 joblib 中更改默认模式.

Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel memory maps input data with mmap_mode='c' by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r' or mmap_mode='r+' does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.

相关文章