scikit-learn joblib 错误:多处理池 self.value 超出“i"格式代码的范围,仅适用于大型 numpy 数组
问题描述
我的代码在较小的测试样本上运行良好,例如 X_train
、y_train
中的 10000 行数据.当我为数百万行调用它时,我得到了结果错误.包中的错误,还是我可以做一些不同的事情?我正在使用 Anaconda 2.0.1 中的 Python 2.7.7,我把 pool.py 来自 Anaconda 的多处理包和 parallel.py 来自 scikit-learn 的外部包在我的 Dropbox 上给你.
My code runs fine with smaller test samples, like 10000 rows of data in X_train
, y_train
. When I call it for millions of rows, I get the resulting error. Is the bug in a package, or can I do something differently? I am using Python 2.7.7 from Anaconda 2.0.1, and I put the pool.py from Anaconda's multiprocessing package and parallel.py from scikit-learn's external package on my Dropbox for you.
测试脚本是:
import numpy as np
import sklearn
from sklearn.linear_model import SGDClassifier
from sklearn import grid_search
import multiprocessing as mp
def main():
print("Started.")
print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)
n_samples = 1000000
n_features = 1000
X_train = np.random.randn(n_samples, n_features)
y_train = np.random.randint(0, 2, size=n_samples)
print("input data size: %.3fMB" % (X_train.nbytes / 1e6))
model = SGDClassifier(penalty='elasticnet', n_iter=10, shuffle=True)
param_grid = [{
'alpha' : 10.0 ** -np.arange(1,7),
'l1_ratio': [.05, .15, .5, .7, .9, .95, .99, 1],
}]
gs = grid_search.GridSearchCV(model, param_grid, n_jobs=8, verbose=100)
gs.fit(X_train, y_train)
print(gs.grid_scores_)
if __name__=='__main__':
mp.freeze_support()
main()
这导致输出:
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Started.
('numpy:', '1.8.1')
('sklearn:', '0.15.0b1')
input data size: 8000.000MB
Fitting 3 folds for each of 48 candidates, totalling 144 fits
Memmaping (shape=(1000000L, 1000L), dtype=float64) to new file c:userslaszlosappdatalocal emp4joblib_memmaping_pool_6172_787659766172-284752304-75223296-0.pkl
Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
File "C:Anacondalibsite-packagessklearnexternalsjoblib
umpy_pickle.py", line 240, in save
obj, filename = self._write_array(obj, filename)
File "C:Anacondalibsite-packagessklearnexternalsjoblib
umpy_pickle.py", line 203, in _write_array
self.np.save(filename, array)
File "C:Anacondalibsite-packages
umpylib
pyio.py", line 453, in save
format.write_array(fid, arr)
File "C:Anacondalibsite-packages
umpylibformat.py", line 406, in write_array
array.tofile(fp)
ValueError: 1000000000 requested and 268435456 written
Memmaping (shape=(1000000L, 1000L), dtype=float64) to old file c:userslaszlosappdatalocal emp4joblib_memmaping_pool_6172_787659766172-284752304-75223296-0.pkl
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Vendor: Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 28 days
Traceback (most recent call last):
File "S:laszlogridsearch_largearray.py", line 33, in <module>
main()
File "S:laszlogridsearch_largearray.py", line 28, in main
gs.fit(X_train, y_train)
File "C:Anacondalibsite-packagessklearngrid_search.py", line 597, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:Anacondalibsite-packagessklearngrid_search.py", line 379, in _fit
for parameters in parameter_iterable
File "C:Anacondalibsite-packagessklearnexternalsjoblibparallel.py", line 651, in __call__
self.retrieve()
File "C:Anacondalibsite-packagessklearnexternalsjoblibparallel.py", line 503, in retrieve
self._output.append(job.get())
File "C:Anacondalibmultiprocessingpool.py", line 558, in get
raise self._value
struct.error: integer out of range for 'i' format code
ogrisel
的答案确实适用于 scikit-learn-0.15.0b1 的手动内存映射.不要忘记一次只运行一个脚本,否则你仍然会耗尽内存并且线程过多.(我的 CSV 文件大小约为 12.5 GB 的数据,使用 8 个线程,运行大约 60 GB.)
ogrisel
's answer does work with manual memory mapping with scikit-learn-0.15.0b1. Don't forget to run only one script at once, otherwise you can still run out of memory and have too many threads. (My run take ~60 GB on data of size ~12.5 GB in CSV, with 8 threads.)
解决方案
作为一种解决方法,您可以尝试显式地对数据进行内存映射 &手动 在 joblib 文档中解释一个>.
As a workaround you can try to memory map your data explicitly & manually as explained in the joblib documentation.
编辑#1:这是重要的部分:
from sklearn.externals import joblib
joblib.dump(X_train, some_filename)
X_train = joblib.load(some_filename, mmap_mode='r+')
然后将这个 memmap 的数据传递给 scikit-learn 0.15+ 下的 GridSearchCV
.
Then pass this memmap'ed data to GridSearchCV
under scikit-learn 0.15+.
编辑 #2: 此外:如果您使用 32 位版本的 Anaconda,每个 python 进程将被限制为 2GB,这也会限制内存.
Edit #2: Furthermore: if you use the 32bit version of Anaconda, you will be limited to 2GB per python process which can also limit the memory.
我刚刚为 numpy.save
找到了一个 bug在 Python 3.4 下,但即使修复后对 mmap 的后续调用也会失败:
I just found a bug for numpy.save
under Python 3.4 but even when fixed the subsequent call to mmap will fail with:
OSError: [WinError 8] Not enough storage is available to process this command
所以请使用 64 位版本的 Python(Anaconda 作为 AFAIK,目前没有其他 64 位软件包用于 numpy/scipy/scikit-learn==0.15.0b1).
So please use a 64 bit version of Python (with Anaconda as AFAIK there is currently no other 64bit packages for numpy / scipy / scikit-learn==0.15.0b1 at this time).
编辑 #3: 我发现了另一个可能导致 windows 下内存使用过多的问题:当前 joblib.Parallel
内存映射输入数据与 mmap_mode='c'
默认情况下:此写时复制设置似乎会导致窗口耗尽分页文件,有时会触发[错误 1455] 分页文件太小,无法完成此操作"错误.设置 mmap_mode='r'
或 mmap_mode='r+'
不会触发该问题.我将运行测试,看看我是否可以在下一个版本的 joblib 中更改默认模式.
Edit #3: I found another issue that might be causing excessive memory usage under windows: currently joblib.Parallel
memory maps input data with mmap_mode='c'
by default: this copy-on-write setting seems to cause windows to exhaust the paging file and sometimes triggers "[error 1455] the paging file is too small for this operation to complete" errors. Setting mmap_mode='r'
or mmap_mode='r+'
does not trigger that problem. I will run tests to see if I can change the default mode in the next version of joblib.
相关文章