HDFS 文件校验和
在使用 Hadoop API - DFSCleint.getFileChecksum() 复制到 HDFS 后,我正在尝试检查文件的一致性.
I am trying to check the consistency of a file after copying to HDFS using Hadoop API - DFSCleint.getFileChecksum().
我得到上述代码的以下输出:
I am getting the following output for the above code:
Null
HDFS : null
Local : null
谁能指出错误或错误?这是代码:
Can anyone point out the error or mistake? Here is the Code :
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileChecksum;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
public class fileCheckSum {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
FileSystem hadoopFS = FileSystem.get(conf);
// Path hdfsPath = new Path("/derby.log");
LocalFileSystem localFS = LocalFileSystem.getLocal(conf);
// Path localPath = new Path("file:///home/ubuntu/derby.log");
// System.out.println("HDFS PATH : "+hdfsPath.getName());
// System.out.println("Local PATH : "+localPath.getName());
FileChecksum hdfsChecksum = hadoopFS.getFileChecksum(new Path("/derby.log"));
FileChecksum localChecksum = localFS.getFileChecksum(new Path("file:///home/ubuntu/derby.log"));
if(null!=hdfsChecksum || null!=localChecksum){
System.out.println("HDFS Checksum : "+hdfsChecksum.toString()+" "+hdfsChecksum.getLength());
System.out.println("Local Checksum : "+localChecksum.toString()+" "+localChecksum.getLength());
if(hdfsChecksum.toString().equals(localChecksum.toString())){
System.out.println("Equal");
}else{
System.out.println("UnEqual");
}
}else{
System.out.println("Null");
System.out.println("HDFS : "+hdfsChecksum);
System.out.println("Local : "+localChecksum);
}
}
}
推荐答案
由于您没有在 conf
上设置远程地址并且本质上使用相同的配置,因此两者 hadoopFS
和 localFS
指向 LocalFileSystem
的一个实例.
Since you aren't setting a remote address on the conf
and essentially using the same configuration, both hadoopFS
and localFS
are pointing to an instance of LocalFileSystem
.
getFileChecksum
没有为 LocalFileSystem
实现并返回 null.不过,它应该适用于 DistributedFileSystem
,如果您的 conf
指向分布式集群,则 FileSystem.get(conf)
应该返回一个实例DistributedFileSystem
返回 MD5 of MD5 of CRC32 checksums 大小为 bytes.per.checksum
的块.该值取决于块大小和集群范围的配置,bytes.per.checksum
.这就是为什么这两个参数也被编码在分布式校验和的返回值中作为算法的名称:MD5-of-xxxMD5-of-yyyCRC32 其中xxx是每个块的CRC校验和的数量,yyy是字节.per.checksum
参数.
getFileChecksum
isn't implemented for LocalFileSystem
and returns null. It should be working for DistributedFileSystem
though, which if your conf
is pointing to a distributed cluster, FileSystem.get(conf)
should return an instance of DistributedFileSystem
that returns an MD5 of MD5 of CRC32 checksums of chunks of size bytes.per.checksum
. This value depends on the block size and the cluster-wide config, bytes.per.checksum
. That's why these two params are also encoded in the return value of the distributed checksum as the name of the algorithm: MD5-of-xxxMD5-of-yyyCRC32 where xxx is number of CRC checksums per block and yyy is the bytes.per.checksum
parameter.
getFileChecksum
并非旨在跨文件系统进行比较.虽然可以在本地模拟分布式校验和,或者手工制作 map-reduce 作业来计算本地哈希的等价物,但我建议依靠 Hadoop 自己的完整性检查,当文件被写入或从 Hadoop 读取时发生
The getFileChecksum
isn't designed to be comparable across filesystems. Although it's possible to simulate the distributed checksum locally, or hand-craft map-reduce jobs to calculate equivalents of local hashes, I suggest relying Hadoop's own integrity checks that happens when a files gets written to or read from Hadoop
相关文章