如何指定映射配置和使用 Amazon 的 EMR 在 CLI 中使用自定义 jar 的 java 选项?

2022-01-14 00:00:00 hadoop mapreduce java elastic-map-reduce emr

我想知道在使用自定义 jar 运行流式作业时如何指定 mapreduce 配置，例如 mapred.task.timeout 、 mapred.min.split.size 等.

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custom jar.

当我们使用ruby或python等外部脚本语言运行时，我们可以通过以下方式来指定这些配置:

We can use the following way to specify these configurations when we run using external scripting languages like ruby or python:

ruby elastic-mapreduce -j --stream --step-name "mystream" --jobconf mapred.task.timeout=0 --jobconf mapred.min.split.size=52880 --mapper s3://somepath/mapper.rb --reducer s3:somepath/reducer.rb --input s3://somepath/input --output s3://somepath/output

我尝试了以下方法，但都没有奏效:

I tried the following ways, but none of them worked:

ruby elastic-mapreduce --jobflow --jar s3://somepath/job.jar --arg s3://somepath/input --arg s3://somepath/output --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0

ruby elastic-mapreduce --jobflow --jar s3://somepath/job.jar --arg s3://somepath/input --arg s3://somepath/output --args -jobconf,mapred.min.split.size=52880 -jobconf,mapred.task.timeout=0

我还想知道如何使用 EMR 中的自定义 jar 将 java 选项传递给流式作业.在 hadoop 上本地运行时，我们可以按如下方式传递:

I would also like to know how to pass java options to a streaming job using custom jar in EMR. When running locally on hadoop we can pass it as follows:

bin/hadoop jar job.jar input_path output_path -D=<some_value >

bin/hadoop jar job.jar input_path output_path -D< some_java_parameter >=< some_value >

推荐答案

我相信如果你想在每个作业的基础上设置这些，那么你需要

I believe if you want to set these on a per-job basis, then you need to

A) 对于自定义 Jars，将它们作为参数传递到您的 jar 中，然后自己处理它们.我相信这可以自动化如下:

A) for custom Jars, pass them into your jar as arguments, and process them yourself. I believe this can be automated as follows:

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); args = new GenericOptionsParser(conf, args).getRemainingArgs(); //.... }

然后以这种方式创建作业(尚未验证是否有效):

Then create the job in this manner (haven't verified if works though):

> elastic-mapreduce --jar s3://mybucket/mycode.jar --args "-D,mapred.reduce.tasks=0" --arg s3://mybucket/input --arg s3://mybucket/output

GenericOptionsParser 应该自动将 -D 和 -jobconf 参数传输到 Hadoop 的作业设置中.更多细节:http://hadoop.apache.org/docs/r0.20.0/api/org/apache/hadoop/util/GenericOptionsParser.html

The GenericOptionsParser should automatically transfer the -D and -jobconf parameters into Hadoop's job setup. More details: http://hadoop.apache.org/docs/r0.20.0/api/org/apache/hadoop/util/GenericOptionsParser.html

B) 对于 hadoop 流 jar，您也只需将配置更改传递给命令

B) for the hadoop streaming jar, you also just pass the configuration change to the command

> elastic-mapreduce --jobflow j-ABABABABA --stream --jobconf mapred.task.timeout=600000 --mapper s3://mybucket/mymapper.sh --reducer s3://mybucket/myreducer.sh --input s3://mybucket/input --output s3://mybucket/output --jobconf mapred.reduce.tasks=0

更多详情:https://forums.aws.amazon.com/thread.jspa?threadID=43872 和 elastic-mapreduce --help

相关文章