HDFS 文件校验和

2022-01-14 00:00:00 hadoop mapreduce checksum java hdfs

在使用 Hadoop API - DFSCleint.getFileChecksum() 复制到 HDFS 后，我正在尝试检查文件的一致性.

I am trying to check the consistency of a file after copying to HDFS using Hadoop API - DFSCleint.getFileChecksum().

我得到上述代码的以下输出:

I am getting the following output for the above code:

Null HDFS : null Local : null

谁能指出错误或错误?这是代码:

Can anyone point out the error or mistake? Here is the Code :

import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileChecksum; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.LocalFileSystem; import org.apache.hadoop.fs.Path; public class fileCheckSum { /** * @param args * @throws IOException */ public static void main(String[] args) throws IOException { // TODO Auto-generated method stub Configuration conf = new Configuration(); FileSystem hadoopFS = FileSystem.get(conf); // Path hdfsPath = new Path("/derby.log"); LocalFileSystem localFS = LocalFileSystem.getLocal(conf); // Path localPath = new Path("file:///home/ubuntu/derby.log"); // System.out.println("HDFS PATH : "+hdfsPath.getName()); // System.out.println("Local PATH : "+localPath.getName()); FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log")); FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log")); if(null!=hdfsChecksum || null!=localChecksum){ System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+" "+hdfsChecksum.getLength()); System.out.println("Local Checksum : "+localChecksum.toString()+" "+localChecksum.getLength()); if(hdfsChecksum.toString().equals(localChecksum.toString())){ System.out.println("Equal"); }else{ System.out.println("UnEqual"); } }else{ System.out.println("Null"); System.out.println("HDFS : "+hdfsChecksum); System.out.println("Local : "+localChecksum); } } }

推荐答案

由于您没有在 conf 上设置远程地址并且本质上使用相同的配置，因此两者 hadoopFS 和 localFS 指向 LocalFileSystem 的一个实例.

Since you aren't setting a remote address on the conf and essentially using the same configuration, both hadoopFS and localFS are pointing to an instance of LocalFileSystem.

getFileChecksum 没有为 LocalFileSystem 实现并返回 null.不过，它应该适用于 DistributedFileSystem，如果您的 conf 指向分布式集群，则 FileSystem.get(conf) 应该返回一个实例DistributedFileSystem 返回 MD5 of MD5 of CRC32 checksums 大小为 bytes.per.checksum 的块.该值取决于块大小和集群范围的配置，bytes.per.checksum.这就是为什么这两个参数也被编码在分布式校验和的返回值中作为算法的名称:MD5-of-xxxMD5-of-yyyCRC32 其中xxx是每个块的CRC校验和的数量，yyy是字节.per.checksum 参数.

getFileChecksum isn't implemented for LocalFileSystem and returns null. It should be working for DistributedFileSystem though, which if your conf is pointing to a distributed cluster, FileSystem.get(conf) should return an instance of DistributedFileSystem that returns an MD5 of MD5 of CRC32 checksums of chunks of size bytes.per.checksum. This value depends on the block size and the cluster-wide config, bytes.per.checksum. That's why these two params are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is number of CRC checksums per block and yyy is the bytes.per.checksum parameter.

getFileChecksum 并非旨在跨文件系统进行比较.虽然可以在本地模拟分布式校验和，或者手工制作 map-reduce 作业来计算本地哈希的等价物，但我建议依靠 Hadoop 自己的完整性检查，当文件被写入或从 Hadoop 读取时发生

The getFileChecksum isn't designed to be comparable across filesystems. Although it's possible to simulate the distributed checksum locally, or hand-craft map-reduce jobs to calculate equivalents of local hashes, I suggest relying Hadoop's own integrity checks that happens when a files gets written to or read from Hadoop

相关文章