在 Hadoop 中,框架将 Map 任务的输出保存在普通 Map-Reduce 应用程序中的什么位置?

我正在尝试找出 Map 任务的输出在被 Reduce 任务使用之前保存到磁盘的位置.

I am trying to find out where does the output of a Map task is saved to disk before it can be used by a Reduce task.

注意: - 使用的版本是带有新 API 的 Hadoop 0.20.204

Note: - version used is Hadoop 0.20.204 with the new API

例如覆盖Map类中的map方法时:

For example, when overwriting the map method in the Map class:

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
    }

    // code that starts a new Job.

}

我很想知道 context.write() 最终在哪里写入数据.到目前为止,我遇到了:

I am interested to find out where does context.write() ends up writing the data. So far i've ran into the:

FileOutputFormat.getWorkOutputPath(context);

这给了我在 hdfs 上的以下位置:

Which gives me the following location on hdfs:

hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0

当我尝试将其用作另一项工作的输入时,它给了我以下错误:

When i try to use it as input for another job it gives me the following error:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/tmp/outputs/1/_temporary/_attempt_201112221334_0001_m_000000_0

注意:作业在 Mapper 中启动,因此从技术上讲,当新作业开始时,Mapper 任务正在写入其输出的临时文件夹就存在.话又说回来,还是说输入路径不存在.

Note: the job is started in the Mapper, so technically, the temporary folder where the Mapper task is writing it's output exists when the new job begins. Then again, it still says that the input path does not exist.

对将临时输出写入何处有任何想法吗?或者,在同时具有 Map 和 Reduce 阶段的作业期间,我可以在哪里找到 Map 任务的输出?

Any ideas to where the temporary output is written to? Or maybe what is the location where i can find the output of a Map task during a job that has both a Map and a Reduce stage?

推荐答案

所以,我已经弄清楚到底发生了什么.

So, I've figured out what is really going on.

映射器的输出被缓冲,直到它达到其大小的大约 80%,然后它开始将结果转储到其本地磁盘并继续允许项目进入缓冲区.

The output of the mapper is buffered until it gets to about 80% of its size, and at that point it begins to dump the result to its local disk and continues to admit items into the buffer.

我想在映射器仍在运行时获取映射器的中间输出并将其用作另一个作业的输入.事实证明,如果不大量修改 hadoop 0.20.204 部署,这是不可能的.即使在地图上下文中指定的所有内容之后,系统的工作方式也是如此:

I wanted to get the intermediate output of the mapper and use it as input for another job, while the mapper was still running. It turns out that this is not possible without heavily modifying the hadoop 0.20.204 deployment. The way the system works is even after all the things that are specified in the map context:

map .... {
  setup(context)
  .
  .
  cleanup(context)
}

并且清理被调用,仍然没有转储到临时文件夹.

and the cleanup is called, there is still no dumping to the temporary folder.

之后,整个 Map 计算最终都被合并并转储到磁盘,并成为 Reducer 之前的 Shuffle 和 Sorting 阶段的输入.

After, the whole Map computation everything eventually gets merged and dumped to disk and becomes the input for the Shuffling and Sorting stages that precede the Reducer.

到目前为止,从我所阅读和查看的所有内容来看,最终应该输出的临时文件夹是我事先猜测的那个.

So far from all I've read and looked at, the temporary folder where the output should be eventually, is the one that I was guessing beforehand.

FileOutputFormat.getWorkOutputPath(context)

我设法以不同的方式完成了我想做的事情.反正对此有任何疑问,请告诉我.

I managed to the what I wanted to do in a different way. Anyway any questions there might be about this, let me know.

相关文章