如何在Python中检测两个文件是否相同

2022-01-25 00:00:00 python file compare md5

问题描述

在这种情况下对md5sum file1"和md5sum file2"进行系统调用并比较两个返回值是否足够?

Is making system call to "md5sum file1" and "md5sum file2" and compare two return values enough in this case?


解决方案

好吧,这将告诉您它们是绝对不同还是可能相同.可能两个文件具有相同的散列但实际上不具有相同的数据...只是不太可能.

Well, that will tell you whether they're definitely different or probably the same. It's possible for two files to have the same hash but not actually have the same data... just very unlikely.

在您的情况下,如果您得到误报(即,如果您认为它们相同,但事实并非如此),会有什么影响?如果冲突只会意外发生,MD5 可能就足够了,不用担心冲突......但如果你的安全(或金钱)处于危险之中,并且有人可能会用与好"文件相同的哈希,你不应该依赖它.

In your situation, what is the impact if you get a false positive (i.e. if you think they're the same, but they're not)? MD5 is probably good enough not to worry about collisions if they would only occur accidentally... but if you've got security (or money) at stake and someone could plant a "bad" file with the same hash as a "good" file, you shouldn't rely on it.

就个人而言,我可能只是读取两个文件,比较每个字节 - 对于一次性比较,散列和这种方法都需要在它们相等时读取整个文件;正如丹尼尔在评论中指出的那样,进行逐字节比较可以让您在看到差异时尽早退出.首先比较文件大小是另一个快速优化:)

Personally, I'd probably just read both files, comparing each byte - for a one off comparison, both the hashing and this approach will require reading the whole file when they're equal; as Daniel points out in the comments, doing a byte-by-byte comparison lets you exit early as soon as you see a difference. Comparing the file sizes first is another quick optimization :)

当您将现有文件的哈希存储在某处时,哈希的一般优势就会出现,这样下次您可以只需读取新文件.

The general advantage of hashing occurs when you store the hash of the existing file somewhere, so that next time you can just read the new file.

相关文章