在 Hadoop 中传播自定义配置值

2022-01-13 00:00:00 properties configuration hadoop mapreduce java

在 Map/Reduce 期间，有没有办法在 Hadoop 中设置和(稍后)获取自定义配置对象?

Is there any way to set and (later) get a custom configuration object in Hadoop, during Map/Reduce?

例如，假设一个应用程序预处理一个大文件并动态确定与该文件相关的一些特征.此外，假设这些特征保存在自定义 Java 对象中(例如，Properties 对象，但不排他性，因为有些可能不是字符串)并且随后对于每个映射和减少工作.

For example, assume an application that preprocesses a large file and determines dynamically some characteristics related to the file. Furthermore, assume that those characteristics are saved in a custom Java object (e.g., a Properties object, but not exclusively, since some may not be strings) and are subsequently necessary for each of the map and of the reduce jobs.

应用程序如何传播"此配置，以便每个映射器和缩减器函数在需要时都可以访问它?

How could the application "propagate" this configuration, so that each mapper and reducer function can access it, when needed?

一种方法是使用 JobConf 类的 set(String, String) 方法，例如，将配置对象序列化为 JSON 字符串通过第二个参数，但这可能太过分了，然后每个 Mapper 和 都必须访问适当的 JobConf 实例Reducer 无论如何(例如，遵循较早的问题).

One approach could be to use the set(String, String) method of the JobConf class and, for instance, pass the configuration object serialized as a JSON string via the second parameter, but this may be too much of a hack and then the appropriate JobConf instance must be accessed by each Mapper and Reducer anyway (e.g., following an approach like the one suggested in an earlier question).

推荐答案

除非我遗漏了什么，如果你有一个 Properties 对象包含你在 M/R 工作中需要的所有属性，你只需将 Properties 对象的内容写入 Hadoop Configuration 对象即可.例如，像这样:

Unless I'm missing something, if you have a Properties object containing every property you need in your M/R job, you simply need to write the content of the Properties object to the Hadoop Configuration object. For example, something like this:

Configuration conf = new Configuration(); Properties params = getParameters(); // do whatever you need here to create your object for (Entry<Object, Object> entry : params.entrySet()) { String propName = (String)entry.getKey(); String propValue = (String)entry.getValue(); conf.set(propName, propValue); }

然后在您的 M/R 作业中，您可以使用 Context 对象在两个映射器(map函数)或reducer(reduce函数)，像这样:

Then inside your M/R job, you can use the Context object to get back your Configuration in both the mapper (the map function) or the reducer (the reduce function), like this:

public void map(MD5Hash key, OverlapDataWritable value, Context context) Configuration conf = context.getConfiguration(); String someProperty = conf.get("something"); .... }

注意，使用Configuration对象时，还可以在setup和cleanup中访问Context方法，如果需要进行一些初始化很有用.

Note that when using the Configuration object, you can also access the Context in the setup and cleanup methods, useful to do some initialization if needed.

另外值得一提的是，您可以直接从 Configuration 对象调用 addResource 方法，将属性直接添加为 InputStream 或文件，但我相信这必须是像常规 Hadoop XML 配置一样的 XML 配置，所以这可能有点矫枉过正.

Also it's worth mentioning you could probably directly call the addResource method from the Configuration object to add your properties directly as an InputStream or a file, but I believe this has to be an XML configuration like the regular Hadoop XML configs, so that might just be overkill.

编辑:如果是非字符串对象，我建议使用序列化:您可以序列化您的对象，然后将它们转换为字符串(可能像我一样使用 Base64 对它们进行编码)不知道如果你有不寻常的字符会发生什么)，然后在映射器/归约器端反序列化从 Configuration 中的属性获得的字符串中的对象.

EDIT: In case of non-String objects, I would advise using serialization: You can serialize your objects, and then convert them to Strings (probably encode them for example with Base64 as I'm not sure what would happen if you have unusual characters), and then on the mapper/reducer side de-serialize the objects from the Strings you get from the properties inside Configuration.

另一种方法是执行相同的序列化技术，但改为写入 HDFS，然后将这些文件添加到 DistributedCache.听起来有点矫枉过正，但这可能会奏效.

Another approach would be to do the same serialization technique, but instead write to HDFS, and then add these files to the DistributedCache. Sounds a bit overkill, but this would probably work.

相关文章