尝试比较两个 csv 文件并将差异写入输出
问题描述
我正在开发一个脚本,它获取 2 个 csv 文件之间的差异,并制作一个新的 csv 文件作为输出,但只有当两个输入文件之间的相同 2 行(指行号)包含不同的数据时,例如第 3 行在文件 1 中有mike"、basketball player",在文件 2 中第 3 行有mike"、baseball player".输出 csv 会抓取这些打印它们并将它们写入 csv.它有效,但存在一些问题(我知道这个问题之前也被问过好几次,但其他人对我的做法不同,而且由于我对编程还很陌生,所以我不太了解他们的代码).
I'm developing a script which takes the difference between 2 csv files and makes a new csv file as output with the differences BUT only if the same 2 rows (refers to row number) between the two input files contain different data e.g. row 3 has "mike", "basketball player" in file 1 and row 3 in file 2 has "mike", "baseball player". The output csv would grab these print them and write them to a csv. It works but there are some issues (I know that this question has also been asked several times before but others have done it differently to me and since I'm fairly new to programming I don't quite understand their codes).
新 csv 文件中的输出在每个单元格中都有输出的每个字母(见下图),我相信这与第 37 行的分隔符/引号字符/引号有关.我希望它们在自己的单元格中没有任何句号、多个空格、逗号或|".
The output in the new csv file has each letter of the output in each cell (see image below) and I believe its something to do with the delimiter/quotechar/quoting line 37. I want them in their own cells without any fullstops, multiple spaces, commas or "|".
另一个问题是它需要很长时间才能运行.我正在处理多达 50,000 行的数据集,运行可能需要一个多小时.为什么会这样?有什么建议可以加快速度?也许在 for 循环之外放一些东西?我之前确实尝试过 difflib 方法,但我只能打印整个input_file1",但不能将该文件与另一个文件进行比较.
Another issue is that it takes a long time to run. I'm working with datasets of up to 50,000 rows and it can take over an hour to run. Why is this and what advice would be useful to speed it up? Put something outside of the for loop maybe? I did try the difflib method earlier on but I was only able to print the entire "input_file1" but not compare that file with another.
# aim of script is to compare csv files and output difference as a new csv
# import necessary libraries
import csv
# File1 = open(raw_input("path:"),"r") #filename, mode
# File2 = open(raw_input("path:"),"r") #filename, mode
# selects the 2 input files to be compared
input_file1 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book1.csv"
input_file2 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book2.csv"
# creates the blank output csv file
output_path = "G:/savestuffhereqwerty/electorate_meshblocks/outputs/output2.csv"
a = open(input_file1, "r")
output_file = open(output_path,"w")
output_file.close()
count = 0
with open(input_file1) as fp1:
for row_number1, row_value1 in enumerate(fp1):
if row_number1 == count:
print "got to 1st point"
value1 = row_value1
with open(input_file2) as fp2:
for row_number2, row_value2 in enumerate(fp2):
if row_number2 == count:
print "got to 2nd point"
value2 = row_value2
if value1 == value2:
print value1, value2
else:
print value1, value2
with open(output_path, 'wb') as f:
writer = csv.writer(f, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
# testing to see if the code writes text to the csv
writer.writerow(["test1"])
writer.writerow(["test2", "test3", "test4"])
writer.writerows([value1, value2])
print "code reached writing stage"
count += 1
print count
print "done"
# replace(",",".")
解决方案
既然要逐行比较这两个文件,你应该不要循环遍历第二个文件 for 第一个文件中的每一行.您可以简单地 zip
两个 csv 阅读器并过滤行:
Since you want to compare the two files line-by-line, you should not loop through the second file for every line in the first file. You can simply zip
two csv readers and filter the rows:
input_file1 = "foo"
input_file2 = "bar"
output_path = "baz"
with open(input_file1) as fin1:
with open(input_file2) as fin2:
read1 = csv.reader(fin1)
read2 = csv.reader(fin2)
diff_rows = (row1 for row1, row2 in zip(read1, read2) if row1 != row2)
with open(output_path, 'w') as fout:
writer = csv.writer(fout)
writer.writerows(diff_rows)
此解决方案假定两个文件具有相同的行数.
This solution assumes that the two files have the same number of lines.
相关文章