从 Python 中的大文件中删除重复的行
问题描述
我有一个 csv 文件,我想从中删除重复的行,但它太大而无法放入内存.我找到了一种方法来完成它,但我的猜测是这不是最好的方法.
I've a csv file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way.
每行包含 15 个字段和数百个字符,并且需要所有字段来确定唯一性.我不是比较整行来查找重复项,而是比较 hash(row-as-a-string)
以尝试节省内存.我设置了一个过滤器,将数据划分为大致相等的行数(例如一周中的几天),并且每个分区足够小,以至于该分区的哈希值查找表将适合内存.我为每个分区传递一次文件,检查唯一行并将它们写入第二个文件(伪代码):
Each row contains 15 fields and several hundred characters, and all fields are needed to determine uniqueness. Instead of comparing the entire row to find a duplicate, I'm comparing hash(row-as-a-string)
in an attempt to save memory. I set a filter that partitions the data into a roughly equal number of rows (e.g. days of the week), and each partition is small enough that a lookup table of hash values for that partition will fit in memory. I pass through the file once for each partition, checking for unique rows and writing them out to a second file (pseudo code):
import csv
headers={'DayOfWeek':None, 'a':None, 'b':None}
outs=csv.DictWriter(open('c:dedupedFile.csv','wb')
days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
outs.writerows(headers)
for day in days:
htable={}
ins=csv.DictReader(open('c:igfile.csv','rb'),headers)
for line in ins:
hvalue=hash(reduce(lambda x,y:x+y,line.itervalues()))
if line['DayOfWeek']==day:
if hvalue in htable:
pass
else:
htable[hvalue]=None
outs.writerow(line)
我想加快速度的一种方法是找到更好的过滤器来减少必要的通过次数.假设行的长度是均匀分布的,也许不是
One way I was thinking to speed this up is by finding a better filter to reduce the number of passes necessary. Assuming the length of the rows is uniformly distributed, maybe instead of
for day in days:
和
if line['DayOfWeek']==day:
我们有
for i in range(n):
和
if len(reduce(lambda x,y:x+y,line.itervalues())%n)==i:
在内存允许的范围内,'n' 尽可能小.但这仍然使用相同的方法.
where 'n' as small as memory will allow. But this is still using the same method.
Wayne Werner在下面提供了一个很好的实用解决方案;从算法的角度来看,我很好奇是否有更好/更快/更简单的方法来做到这一点.
Wayne Werner provided a good practical solution below; I was curious if there was better/faster/simpler way to do this from an algorithm perspective.
附:我仅限于 Python 2.5.
P.S. I'm limited to Python 2.5.
解决方案
如果你想要一个非常简单的方法来做到这一点,只需创建一个 sqlite 数据库:
If you want a really simple way to do this, just create a sqlite database:
import sqlite3
conn = sqlite3.connect('single.db')
cur = conn.cursor()
cur.execute("""create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5 text,
f6 text,
f7 text,
f8 text,
f9 text,
f10 text,
f11 text,
f12 text,
f13 text,
f14 text,
f15 text,
primary key(f1, f2, f3, f4, f5, f6, f7,
f8, f9, f10, f11, f12, f13, f14, f15))
"""
conn.commit()
#simplified/pseudo code
for row in reader:
#assuming row returns a list-type object
try:
cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?,
?, ?, ?, ?, ?, ?, ?, ?)''', row)
conn.commit()
except IntegrityError:
pass
conn.commit()
cur.execute('select * from test')
for row in cur:
#write row to csv file
那么您自己就不必担心任何比较逻辑 - 只需让 sqlite 为您处理.它可能不会比散列字符串快得多,但它可能要容易得多.当然,如果需要,您可以修改存储在数据库中的类型,或者视情况而定.当然,由于您已经将数据转换为字符串,因此您可以只使用一个字段.这里有很多选择.
Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.
相关文章