在 Hadoop 中传播自定义配置值
在 Map/Reduce 期间,有没有办法在 Hadoop 中设置和(稍后)获取自定义配置对象?
Is there any way to set and (later) get a custom configuration object in Hadoop, during Map/Reduce?
例如,假设一个应用程序预处理一个大文件并动态确定与该文件相关的一些特征.此外,假设这些特征保存在自定义 Java 对象中(例如,Properties
对象,但不排他性,因为有些可能不是字符串)并且随后对于每个映射和减少工作.
For example, assume an application that preprocesses a large file and determines dynamically some characteristics related to the file. Furthermore, assume that those characteristics are saved in a custom Java object (e.g., a Properties
object, but not exclusively, since some may not be strings) and are subsequently necessary for each of the map and of the reduce jobs.
应用程序如何传播"此配置,以便每个映射器和缩减器函数在需要时都可以访问它?
How could the application "propagate" this configuration, so that each mapper and reducer function can access it, when needed?
一种方法是使用 JobConf
类的 set(String, String)
方法,例如,将配置对象序列化为 JSON
字符串通过第二个参数,但这可能太过分了,然后每个 Mapper
和 都必须访问适当的
无论如何(例如,遵循 较早的问题).JobConf
实例Reducer
One approach could be to use the set(String, String)
method of the JobConf
class and, for instance, pass the configuration object serialized as a JSON
string via the second parameter, but this may be too much of a hack and then the appropriate JobConf
instance must be accessed by each Mapper
and Reducer
anyway (e.g., following an approach like the one suggested in an earlier question).
推荐答案
除非我遗漏了什么,如果你有一个 Properties
对象包含你在 M/R 工作中需要的所有属性,你只需将 Properties
对象的内容写入 Hadoop Configuration
对象即可.例如,像这样:
Unless I'm missing something, if you have a Properties
object containing every property you need in your M/R job, you simply need to write the content of the Properties
object to the Hadoop Configuration
object. For example, something like this:
Configuration conf = new Configuration();
Properties params = getParameters(); // do whatever you need here to create your object
for (Entry<Object, Object> entry : params.entrySet()) {
String propName = (String)entry.getKey();
String propValue = (String)entry.getValue();
conf.set(propName, propValue);
}
然后在您的 M/R 作业中,您可以使用 Context
对象在两个映射器(map
函数)或reducer(reduce
函数),像这样:
Then inside your M/R job, you can use the Context
object to get back your Configuration
in both the mapper (the map
function) or the reducer (the reduce
function), like this:
public void map(MD5Hash key, OverlapDataWritable value, Context context)
Configuration conf = context.getConfiguration();
String someProperty = conf.get("something");
....
}
注意,使用Configuration
对象时,还可以在setup
和cleanup
中访问Context
方法,如果需要进行一些初始化很有用.
Note that when using the Configuration
object, you can also access the Context
in the setup
and cleanup
methods, useful to do some initialization if needed.
另外值得一提的是,您可以直接从 Configuration
对象调用 addResource
方法,将属性直接添加为 InputStream
或文件,但我相信这必须是像常规 Hadoop XML 配置一样的 XML 配置,所以这可能有点矫枉过正.
Also it's worth mentioning you could probably directly call the addResource
method from the Configuration
object to add your properties directly as an InputStream
or a file, but I believe this has to be an XML configuration like the regular Hadoop XML configs, so that might just be overkill.
编辑:如果是非字符串对象,我建议使用序列化:您可以序列化您的对象,然后将它们转换为字符串(可能像我一样使用 Base64 对它们进行编码)不知道如果你有不寻常的字符会发生什么),然后在映射器/归约器端反序列化从 Configuration
中的属性获得的字符串中的对象.
EDIT: In case of non-String objects, I would advise using serialization: You can serialize your objects, and then convert them to Strings (probably encode them for example with Base64 as I'm not sure what would happen if you have unusual characters), and then on the mapper/reducer side de-serialize the objects from the Strings you get from the properties inside Configuration
.
另一种方法是执行相同的序列化技术,但改为写入 HDFS,然后将这些文件添加到 DistributedCache
.听起来有点矫枉过正,但这可能会奏效.
Another approach would be to do the same serialization technique, but instead write to HDFS, and then add these files to the DistributedCache
. Sounds a bit overkill, but this would probably work.
相关文章