多行文本到一个地图
我一直在尝试使用 Hadoop 将 N 行数发送到单个映射.我不需要已经拆分行.
I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already.
我尝试使用 NLineInputFormat,但是它一次将 N 行文本从数据发送到每个映射器 [在第 N 行之后放弃].
I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line].
我已尝试设置该选项,它只需要 N 行输入,一次将 1 行发送到每个地图:
I have tried to set the option and it only takes N lines of input sending it at 1 line at a time to each map:
job.setInt("mapred.line.input.format.linespermap", 10);
我找到了一个邮件列表,建议我重写 LineRecordReader::next,但这并不是那么简单,因为内部数据成员都是私有的.
I've found a mailing list recommending me to override LineRecordReader::next, however that is not that simple, as that the internal data members are all private.
我刚刚检查了 NLineInputFormat 的源代码,它对 LineReader 进行了硬编码,因此覆盖无济于事.
I've just checked the source for NLineInputFormat and it hard codes LineReader, so overriding will not help.
另外,顺便说一句,我使用 Hadoop 0.18 来兼容 Amazon EC2 MapReduce.
Also, btw I'm using Hadoop 0.18 for compatibility with the Amazon EC2 MapReduce.
推荐答案
你必须实现你自己的输入格式.您还可以定义自己的唱片阅读器.
You have to implement your own input format. You also have the possibility to define your own record reader then.
不幸的是,您必须定义一个 getSplits() 方法.在我看来,这将比实现记录读取器更难:这个方法必须实现一个逻辑来分块输入数据.
Unfortunately you have to define a getSplits()-method. In my opinion this will be harder than implementing the record reader: This method has to implement a logic to chunk the input data.
请参阅Hadoop - 权威指南"的以下摘录(我会一直推荐的一本好书!):
See the following excerpt from "Hadoop - The definitive guide" (a great book I would always recommend!):
界面如下:
public interface InputFormat<K, V> {
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V> getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
}
JobClient 调用 getSplits() 方法,传递所需数量的地图任务作为 numSplits 参数.这个数字被视为一个提示,作为 InputFormat impl-mentations 可以自由地将不同数量的拆分返回到指定的数量数分裂.计算完拆分后,客户端将它们发送到 jobtracker,后者使用它们的存储位置来安排地图任务以在 tasktracker 上处理它们.
The JobClient calls the getSplits() method, passing the desired number of map tasks as the numSplits argument. This number is treated as a hint, as InputFormat imple- mentations are free to return a different number of splits to the number specified in numSplits. Having calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers.
在 tasktracker 上,map 任务将拆分传递给 getRecordReader() 方法InputFormat 以获取该拆分的 RecordReader.RecordReader 只不过是记录的迭代器,map 任务使用一个迭代器生成记录键值对,它传递给 map 函数.代码片段(基于 MapRunner 中的代码)说明了这个想法:
On a tasktracker, the map task passes the split to the getRecordReader() method on InputFormat to obtain a RecordReader for that split. A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs, which it passes to the map function. A code snippet (based on the code in MapRunner) illustrates the idea:
K key = reader.createKey();
V value = reader.createValue();
while (reader.next(key, value)) {
mapper.map(key, value, output, reporter);
}
相关文章