Map-Reduce 中的二次排序

2022-01-13 00:00:00 parallel-processing hadoop mapreduce java

我了解了在键进入减速器之前对特定键的值进行排序的方式.我了解到可以通过编写三个方法来完成，即 keycomarator、partitioner 和 valuegrouping.

I understood the way of sorting the values of a particular key before the key enters the reducer. I learned that it can be done by writing three methods viz, keycomparator, partitioner and valuegrouping.

现在，当 valuegrouping 运行时，它基本上将与自然键关联的所有值分组，对吗?因此，当它将自然键的所有值分组时，与一组排序值一起发送到减速器的实际键是什么?自然键将与不止一种类型的实体(组合键的第二部分)相关联.发送到 reducer 的复合键是什么?

Now, when valuegrouping runs, it basically groups all the values associated with the natural key, right? So when it groups all the values for the natural key, what will be the actual key that is sent along with a set of sorted values to the reducer? The natural key would have been associated with more than one type of entity (the second part of the composite key). What will be the composite key sent to the reducer?

ap

推荐答案

知道这可能会令人惊讶，但值 Iterable 的每次迭代实际上也会更新键引用:

This may be surprising to know, but each iteration of the values Iterable actually updates the key reference too:

protected void reduce(K key, Iterable<V> values, Context context) { for (V value : values) { // key object contents will update for each iteration of this loop } }

我知道这适用于新的 mapreduce API，但我没有为旧的 mapred API 跟踪它.

I know this works for the new mapreduce API, i haven't traced it for the old mapred API.

因此，在回答您的问题时，所有键都可用，第一个键将与组的第一个排序键相关.

So in answer to your question, all the keys will be available, the first key will relate to the first sorted key of the group.

编辑:有关其工作方式和原因的一些附加信息:

EDIT: Some additional information as to how and why this works:

reducer 使用两个比较器来处理 map 阶段输出的键/值对:

There are two comparators that the reducer uses to process the key/value pairs output by the map stage:

键排序比较器 - 首先应用此比较器并对所有 KV 对进行排序.从概念上讲，您在这个阶段仍在处理序列化字节.
密钥组比较器 - 此比较器负责确定上一个密钥和当前密钥何时不同"，表示一组 KV 对与另一组 KV 对之间的边界

在底层，对键和值的引用永远不会改变，每次调用 Iterable.Iterator.next() 都会将底层字节流中的指针指向下一个 KV 对.如果 key grouper 确定当前的 key 字节集合和之前的 set 是比较相同的 key，那么值 Iterable.iterator() 的 hasNext 方法将返回 true，否则返回 false.如果返回 true，则将字节反序列化为 Key 和 Value 实例，以便在您的 reduce 方法中使用.

Under the hood, the reference to the key and value never changes, each call to Iterable.Iterator.next() advances the pointer in the underlying byte stream to the next KV pair. If the key grouper determines that the current set of keys bytes and previous set are comparatively the same key, then the hasNext method of the value Iterable.iterator() will return true, otherwise false. If true is returned, the bytes are deserialized into the Key and Value instances for consumption in your reduce method.

相关文章