在运行 Hadoop MapReduce 作业时获取文件名/文件数据作为 Map 的键/值输入

2022-01-13 00:00:00 hadoop mapreduce java

我解决了这个问题如何在运行 Hadoop MapReduce 作业时获取文件名/文件内容作为 MAP 的键/值输入? 在这里.虽然它解释了这个概念，但我无法成功地将其转换为代码.

I went through the question How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job? here. Though it explains the concept, I am unable to successfully transform it to code.

基本上，我希望文件名作为键，文件数据作为值.为此，我按照上述问题中的建议编写了一个自定义 RecordReader .但是我不明白如何在这个类中获取文件名作为键.另外，在编写自定义 FileInputFormat 类时，我无法理解如何返回我之前编写的自定义 RecordReader.

Basically, I want the file name as key and the file data as value. For that I wrote a custom RecordReader as recommended in the aforementioned question. But I couldn't understand how to get the file name as the key in this class. Also, while writing the custom FileInputFormat class, I couldn't understand how to return the custom RecordReader I wrote previously.

RecordReader 代码为:

import java.io.IOException; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptContext; public class CustomRecordReader extends RecordReader<Text, Text> { private static final String LINE_SEPARATOR = System.getProperty("line.separator"); private StringBuffer valueBuffer = new StringBuffer(""); private Text key = new Text(); private Text value = new Text(); private RecordReader<Text, Text> recordReader; public SPDRecordReader(RecordReader<Text, Text> recordReader) { this.recordReader = recordReader; } @Override public void close() throws IOException { recordReader.close(); } @Override public Text getCurrentKey() throws IOException, InterruptedException { return key; } @Override public Text getCurrentValue() throws IOException, InterruptedException { return value; } @Override public float getProgress() throws IOException, InterruptedException { return recordReader.getProgress(); } @Override public void initialize(InputSplit arg0, TaskAttemptContext arg1) throws IOException, InterruptedException { recordReader.initialize(arg0, arg1); } @Override public boolean nextKeyValue() throws IOException, InterruptedException { if (valueBuffer.equals("")) { while (recordReader.nextKeyValue()) { valueBuffer.append(recordReader.getCurrentValue()); valueBuffer.append(LINE_SEPARATOR); } value.set(valueBuffer.toString()); return true; } return false; } }

而不完整的FileInputFormat类是:

import java.io.IOException; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reporter; public class CustomFileInputFormat extends FileInputFormat<Text, Text> { @Override protected boolean isSplitable(FileSystem fs, Path filename) { return false; } @Override public RecordReader<Text, Text> getRecordReader(InputSplit arg0, JobConf arg1, Reporter arg2) throws IOException { return null; } }

推荐答案

在你的 CustomRecordReader 类中有这个代码.

Have this code in your CustomRecordReader class.

private LineRecordReader lineReader; private String fileName; public CustomRecordReader(JobConf job, FileSplit split) throws IOException { lineReader = new LineRecordReader(job, split); fileName = split.getPath().getName(); } public boolean next(Text key, Text value) throws IOException { // get the next line if (!lineReader.next(key, value)) { return false; } key.set(fileName); value.set(value); return true; } public Text createKey() { return new Text(""); } public Text createValue() { return new Text(""); }

删除 SPDRecordReader 构造函数(这是一个错误).

Remove SPDRecordReader constructor (It is an error).

并在您的 CustomFileInputFormat 类中包含此代码

And have this code in your CustomFileInputFormat class

public RecordReader<Text, Text> getRecordReader( InputSplit input, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(input.toString()); return new CustomRecordReader(job, (FileSplit)input); }

相关文章