使用 Hadoop 处理大量小文件

2022-01-13 00:00:00 hadoop mapreduce java hdfs

我正在使用 Hadoop 示例程序 WordCount 来处理大量小文件/网页 (cca. 2-3 kB).由于这与 hadoop 文件的最佳文件大小相去甚远,因此程序非常慢.我想这是因为设置和撕裂工作的成本远高于工作本身.这样的小文件也会导致文件名的命名空间耗尽.

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.

我读到在这种情况下我应该使用 HDFS 存档 (HAR),但我不确定如何修改此程序 WordCount 以从该存档中读取.程序可以在不修改或需要修改的情况下继续工作吗?

I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?

即使我在档案中打包了很多文件,问题仍然存在,这是否会提高性能.我读到即使我打包多个文件,一个存档中的这些文件也不会被一个映射器处理,而是很多,在我的情况下(我猜)不会提高性能.

Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.

如果这个问题太简单,请理解我是 Hadoop 新手,对它的经验很少.

推荐答案

使用 HDFS 不会改变导致 hadoop 处理大量小文件的情况.在这种情况下,最好的选择可能是将文件 cat 到单个(或几个大)文件中.这将减少您拥有的映射器的数量,从而减少需要处理的事物的数量.

Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s). This will reduce the number of mappers you have, which will reduce the number of things required to be processed.

如果您在分布式系统上运行,使用 HDFS 可以提高性能.如果您只是在做伪分布式(一台机器),那么 HDFS 不会提高性能.限制是机器.

To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.

当您对大量小文件进行操作时,将需要大量的映射器和缩减器.setup/down 可以与文件本身的处理时间相媲美,造成很大的开销.cat文件应该会减少映射器 hadoop 运行作业的数量,这应该会提高性能.

When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.

您可以从使用 HDFS 存储文件中看到的好处是在分布式模式下使用多台机器.这些文件将跨机器存储在块中(默认为 64MB),每台机器都能够处理驻留在机器上的数据块.这减少了网络带宽的使用,因此它不会成为处理中的瓶颈.

The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.

归档文件,如果 hadoop 要取消归档它们只会导致 hadoop 仍然有大量小文件.

Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.

希望这有助于您的理解.

Hope this helps your understanding.

相关文章