Python Hadoop Streaming 错误“ERROR streaming.StreamJob:作业不成功！"和堆栈跟踪:ExitCodeException exitCode=134

2022-01-13 00:00:00 python subprocess hadoop mapreduce hadoop-streaming

问题描述

我正在尝试使用 Hadoop Streaming 在 Hadoop 集群上运行 python 脚本以进行情绪分析.我在本地机器上运行的相同脚本正在正确运行并提供输出.
要在本地机器上运行，我使用此命令.

I am trying to run python script on Hadoop cluster using Hadoop Streaming for sentiment analysis. The Same script I am running on Local machine which is running Properly and giving output.
to run on local machine I use this command.

$ cat /home/MB/analytics/Data/input/* | ./new_mapper.py

为了在 hadoop 集群上运行，我使用以下命令

and to run on hadoop cluster I use below command

$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.2.0.jar -mapper "python $PWD/new_mapper.py" -reducer "$PWD/new_reducer.py" -input /user/hduser/Test_04012015_Data/input/* -output /user/hduser/python-mr/out-mr-out

我的脚本示例代码是

The Sample code of my script is

#!/usr/bin/env python import sys def main(argv): ## for line in sys.stdin: ## print line for line in sys.stdin: line = line.split(',') t_text = re.sub(r'[?|$|.|!|,|!|?|;]',r'',line[7]) words = re.findall(r"[w']+", t_text.rstrip()) predicted = classifier.classify(feature_select(words)) i=i+1 referenceSets[predicted].add(i) testSets[predicted].add(i) print line[7] +' '+predicted if __name__ == "__main__": main(sys.argv)

Exception的堆栈跟踪是:

The stack trace of Exception is:

15/04/22 12:55:14 INFO mapreduce.Job: Task Id : attempt_1429611942931_0010_m_000001_0, Status : FAILED Error: java.io.IOException: Stream closed at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434) ... Exit code: 134 Exception message: /bin/bash: line 1: 1691 Aborted (core dumped) /usr/lib/jvm/java-7-oracle-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true -Xmx525955249 -Djava.io.tmpdir=/yarn/nm/usercache/hduser/appcache/application_1429611942931_0010/container_1429611942931_0010_01_000016/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1429611942931_0010/container_1429611942931_0010_01_000016 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.122 48725 attempt_1429611942931_0010_m_000006_1 16 > /var/log/hadoop-yarn/container/application_1429611942931_0010/container_1429611942931_0010_01_000016/stdout 2> /var/log/hadoop-yarn/container/application_1429611942931_0010/container_1429611942931_0010_01_000016/stderr .... 15/04/22 12:55:47 ERROR streaming.StreamJob: Job not Successful! Streaming Command Failed!

我试图查看日志，但在色调中它向我显示了这个错误.请建议我，出了什么问题.

I tried to see logs but in hue it shows me this error. Please suggest me, what is going wrong.

解决方案

您好像忘记将文件 new_mapper.py 添加到您的工作中.

It looks like you forgot to add the file new_mapper.py to your job.

基本上，您的作业会尝试运行 python 脚本 new_mapper.py，但运行映射器的服务器上缺少此脚本.

Basically, your job tries to run the python script new_mapper.py, but this script is missing on the server running your mapper.

您必须使用选项 -file <local_path_to_your_file> 将此文件添加到您的作业中.

You must add this file to your job, using the option -file <local_path_to_your_file>.

在此处查看文档和示例:https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html#Streaming_Command_Options

See documentation and example here: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html#Streaming_Command_Options

相关文章