在 python 中高效的文件读取需要在 ' ' 上拆分

2022-01-12 00:00:00 python multiprocessing

问题描述

我一直在阅读以下文件:

I've traditionally been reading in files with:

file = open(fullpath, "r")
allrecords = file.read()
delimited = allrecords.split('
')
for record in delimited[1:]:
    record_split = record.split(',')

with open(os.path.join(txtdatapath,pathfilename), "r") as data:
  datalines = (line.rstrip('
') for line in data)
  for record in datalines:
    split_line = record.split(',')
    if len(split_line) > 1:

但似乎当我在多处理线程中处理这些文件时,我得到了 MemoryError.当我正在阅读的文本文件需要在 ' ' 上拆分时,我如何才能最好地逐行读取文件.

But it seems when I process these files in a multiprocessing thread I get MemoryError. How can I best readin files line by line, when the text file I'm reading needs to be split on ' '.

这里是多处理代码:

pool = Pool()
fixed_args = (targetdirectorytxt, value_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PPD_star, list(varg), chunksize=1)     
while not op_list.ready():
  print("Number of files left to process: {}".format(op_list._number_left))
  time.sleep(60)
op_list = op_list.get()
pool.close()
pool.join()

这是错误日志

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:Python27lib	hreading.py", line 810, in __bootstrap_inner
    self.run()
  File "C:Python27lib	hreading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "C:Python27libmultiprocessingpool.py", line 380, in _handle_results
    task = get()
MemoryError

我正在尝试按照 Mike 的建议安装 pathos,但我遇到了问题.这是我的安装命令:

I'm trying to install pathos as Mike has kindly suggested but I'm running into issues. Here is my install command:

pip install https://github.com/uqfoundation/pathos/zipball/master --allow-external pathos --pre

但这是我收到的错误消息:

But here are the error messages that I get:

Downloading/unpacking https://github.com/uqfoundation/pathos/zipball/master
  Running setup.py (path:c:usersxxxappdatalocal	emp2pip-1e4saj-b
uildsetup.py) egg_info for package from https://github.com/uqfoundation/pathos/
zipball/master

Downloading/unpacking ppft>=1.6.4.5 (from pathos==0.2a1.dev0)
  Running setup.py (path:c:usersxxxappdatalocal	emp2pip_build_jp
tyuserppftsetup.py) egg_info for package ppft

    warning: no files found matching 'python-restlib.spec'
Requirement already satisfied (use --upgrade to upgrade): dill>=0.2.2 in c:pyth
on27libsite-packagesdill-0.2.2-py2.7.egg (from pathos==0.2a1.dev0)
Requirement already satisfied (use --upgrade to upgrade): pox>=0.2.1 in c:pytho
n27libsite-packagespox-0.2.1-py2.7.egg (from pathos==0.2a1.dev0)
Downloading/unpacking pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)
  Could not find any downloads that satisfy the requirement pyre==0.8.2.0-pathos
 (from pathos==0.2a1.dev0)
  Some externally hosted files were ignored (use --allow-external pyre to allow)
.
Cleaning up...
No distributions at all found for pyre==0.8.2.0-pathos (from pathos==0.2a1.dev0)

Storing debug log for failure in C:Usersxxxpippip.log

我在 Windows 7 64 位上安装.最后,我设法使用 easy_install 进行了安装.

I'm installing on Windows 7 64 bit. In the end I managed to install with easy_install.

但是现在我失败了,因为我无法打开那么多文件:

But Now I have a failure as I cannot open that many files:

Finished reading in Exposures...
Reading Samples from:  C:XXXXXXXXX
Traceback (most recent call last):
  File "events.py", line 568, in <module>
    mdrcv_dict = ReadDamages(damage_dir, value_dict)
  File "events.py", line 185, in ReadDamages
    res = thpool.amap(mppool.map, [rstrip]*len(readinfiles), files)
  File "C:Python27libsite-packagespathos-0.2a1.dev0-py2.7.eggpathosmultipr
ocessing.py", line 230, in amap
    return _pool.map_async(star(f), zip(*args)) # chunksize
  File "events.py", line 184, in <genexpr>
    files = (open(name, 'r') for name in readinfiles[0:])
IOError: [Errno 24] Too many open files: 'C:\xx.csv'

当前使用多处理库,我将参数和字典传递到我的函数中并打开映射文件,然后输出字典.这是我目前如何做的一个例子,如何用 pathos 做这个聪明的方法?

Currently using the multiprocessing library, I am passing in parameters and dictionaries into my function and opening a mapped file and then outputting a dictionary. Here is an example of how I currently do it, how would the smart way to do this with pathos?

def PP_star(args_flat):
    return PP(*args_flat)

def PP(pathfilename, txtdatapath, my_dict):
    return com_dict

fixed_args = (targetdirectorytxt, my_dict)
varg = ((filename,) + fixed_args for filename in readinfiles)
op_list = pool.map_async(PP_star, list(varg), chunksize=1)

如何使用 pathos.multiprocessing


解决方案

假设我们有 file1.txt:

hello35
1234123
1234123
hello32
2492wow
1234125
1251234
1234123
1234123
2342bye
1234125
1251234
1234123
1234123
1234125
1251234
1234123

file2.txt:

1234125
1251234
1234123
hello35
2492wow
1234125
1251234
1234123
1234123
hello32
1234125
1251234
1234123
1234123
1234123
1234123
2342bye

等等,通过file5.txt:

1234123
1234123
1234125
1251234
1234123
1234123
1234123
1234125
1251234
1234125
1251234
1234123
1234123
hello35
hello32
2492wow
2342bye

我建议使用分层并行 map 来快速读取您的文件.multiprocessing 的一个分支(称为 pathos.multiprocessing)可以做到这一点.

I'd suggest to use a hierarchical parallel map to read your files quickly. A fork of multiprocessing (called pathos.multiprocessing) can do this.

>>> import pathos
>>> thpool = pathos.multiprocessing.ThreadingPool()
>>> mppool = pathos.multiprocessing.ProcessingPool()
>>> 
>>> def rstrip(line):
...     return line.rstrip()
... 
# get your list of files
>>> fnames = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt']
>>> # open the files
>>> files = (open(name, 'r') for name in fnames)
>>> # read each file in asynchronous parallel
>>> # while reading and stripping each line in parallel
>>> res = thpool.amap(mppool.map, [rstrip]*len(fnames), files)
>>> # get the result when it's done
>>> res.ready()
True
>>> data = res.get()
>>> # if not using a files iterator -- close each file by uncommenting the next line
>>> # files = [file.close() for file in files]
>>> data[0]
['hello35', '1234123', '1234123', 'hello32', '2492wow', '1234125', '1251234', '1234123', '1234123', '2342bye', '1234125', '1251234', '1234123', '1234123', '1234125', '1251234', '1234123']
>>> data[1]
['1234125', '1251234', '1234123', 'hello35', '2492wow', '1234125', '1251234', '1234123', '1234123', 'hello32', '1234125', '1251234', '1234123', '1234123', '1234123', '1234123', '2342bye']
>>> data[-1]
['1234123', '1234123', '1234125', '1251234', '1234123', '1234123', '1234123', '1234125', '1251234', '1234125', '1251234', '1234123', '1234123', 'hello35', 'hello32', '2492wow', '2342bye']

但是,如果您想检查还有多少文件要完成,您可能需要使用迭代"映射 (imap) 而不是异步"映射 (地图).有关详细信息,请参阅此帖子:Python 多处理 - 跟踪pool.map操作过程

However, if you want to check how many files you have left to finish, you might want to use an "iterated" map (imap) instead of an "asynchronous" map (amap). See this post for details: Python multiprocessing - tracking the process of pool.map operation

在此处获取 pathos:https://github.com/uqfoundation

相关文章