Hadoop Streaming:映射器“包装"二进制可执行文件

2022-01-09 00:00:00 python binary hadoop streaming mapreduce

问题描述

我有一个管道,目前在一个大型大学计算机集群上运行.出于发布目的,我想将其转换为 mapreduce 格式,以便任何人都可以在使用 hadoop 集群(如 amazon webservices (AWS))时运行它.该管道目前由一系列 python 脚本组成,这些脚本包装不同的二进制可执行文件并使用 python 子进程和 tempfile 模块管理输入和输出.不幸的是,我没有编写二进制可执行文件,其中许多要么不使用 STDIN,要么不以可用"的方式发出 STDOUT(例如,仅将其发送到文件).这些问题是我将大部分问题封装在 python 中的原因.

I have a pipeline that I currently run on a large university computer cluster. For publication purposes I'd like to convert it into mapreduce format such that it could be run by anyone on using a hadoop cluster such as amazon webservices (AWS). The pipeline currently consists of as series of python scripts that wrap different binary executables and manage the input and output using the python subprocess and tempfile modules. Unfortunately I didn’t write the binary executables and many of them either don’t take STDIN or don't emit STDOUT in a ‘useable’ fashion (e.g., only sent it to files). These problems are why I’ve wrapped most of them in python.

到目前为止,我已经能够修改我的 Python 代码,这样我就有了一个映射器和一个缩减器,我可以在本地机器上以标准的测试格式"运行它们.

So far I’ve been able to modify my Python code such that I have a mapper and a reducer that I can run on my local machine in the standard ‘test format.’

$ cat data.txt | mapper.py | reducer.py

映射器按照它包装的二进制文件想要的方式格式化每一行数据,使用 subprocess.popen 将文本发送到二进制文件(这也允许我屏蔽很多虚假的 STDOUT),然后收集我想要的 STOUT,并将其格式化为适合减速器的文本行.当我尝试在本地 hadoop 安装上复制命令时出现问题.我可以让映射器执行,但它给出的错误提示它找不到二进制可执行文件.

The mapper formats each line of data the way the binary it wraps wants it, sends the text to the binary using subprocess.popen (this also allows me to mask a lot of spurious STDOUT), then collects the STOUT I want, and formats it into lines of text appropriate for the reducer. The problems arise when I try to replicate the command on a local hadoop install. I can get the mapper to execute, but it give an error that suggests that it can’t find the binary executable.

文件"/Users/me/Desktop/hadoop-0.21.0/./phyml.py",第 69 行,在main() 文件/Users/me/Desktop/hadoop-0.21.0/./mapper.py",第 66 行,主要phyml(无)文件/Users/me/Desktop/hadoop-0.21.0/./mapper.py",第 46 行,在 phyml 中ft = Popen(cli_parts,stdin=PIPE,stderr=PIPE,stdout=PIPE)文件"/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py",第 621 行,在 init 中错误读取,错误写入)文件/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py",第 1126 行,在 _execute_child 中引发 child_exceptionOSError: [Errno 13] 权限被拒绝

File "/Users/me/Desktop/hadoop-0.21.0/./phyml.py", line 69, in main() File "/Users/me/Desktop/hadoop-0.21.0/./mapper.py", line 66, in main phyml(None) File "/Users/me/Desktop/hadoop-0.21.0/./mapper.py", line 46, in phyml ft = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE) File "/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py", line 621, in init errread, errwrite) File "/Library/Frameworks/Python.framework/Versions/6.1/lib/python2.6/subprocess.py", line 1126, in _execute_child raise child_exception OSError: [Errno 13] Permission denied

我的 hadoop 命令如下所示:

My hadoop command looks like the following:

./bin/hadoop jar /Users/me/Desktop/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar 
-input /Users/me/Desktop/Code/AWS/temp/data.txt 
-output /Users/me/Desktop/aws_test 
-mapper  mapper.py 
-reducer  reducer.py 
-file /Users/me/Desktop/Code/AWS/temp/mapper.py 
-file /Users/me/Desktop/Code/AWS/temp/reducer.py 
-file /Users/me/Desktop/Code/AWS/temp/binary

正如我上面提到的,在我看来,映射器不知道二进制文件 - 也许它没有被发送到计算节点?不幸的是,我无法真正说出问题所在.任何帮助将不胜感激.很高兴看到一些用 python 编写的封装二进制可执行文件的 hadoop 流映射器/reducer.我无法想象我是第一个尝试这样做的人!事实上,这里有另一个帖子问基本相同的问题,但还没有回答......

As I noted above it looks to me like the mapper isn't aware of the binary - perhaps it's not being sent to the compute node? Unfortunately I can't really tell what the problem is. Any help would be greatly appreciated. It would be particulary nice to see some hadoop streaming mappers/reducers written in python that wrap binary executables. I can’t imagine I’m the first one to try to do this! In fact, here is another post asking essentially the same question, but it hasn't been answered yet...

Hadoop/Elastic Map Reduce 与二进制可执行文件?p>


解决方案

经过大量谷歌搜索(等)后,我想出了如何包含映射器/reducer 可以访问的可执行二进制文件/脚本/模块.诀窍是首先将所有文件上传到hadoop.

After much googling (etc.) I figured out how to include executable binaries/scripts/modules that are accessible to your mappers/reducers. The trick is to upload all you files to hadoop first.

$ bin/hadoop dfs -copyFromLocal /local/file/system/module.py module.py

然后你需要像下面的模板那样格式化你的流命令:

Then you need to format you streaming command like the following template:

$ ./bin/hadoop jar /local/file/system/hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar 
-file /local/file/system/data/data.txt 
-file /local/file/system/mapper.py 
-file /local/file/system/reducer.py 
-cacheFile hdfs://localhost:9000/user/you/module.py#module.py 
-input data.txt 
-output output/ 
-mapper mapper.py 
-reducer reducer.py 
-verbose

如果您要链接一个 python 模块,您需要将以下代码添加到您的映射器/减速器脚本中:

If you're linking a python module you'll need to add the following code to your mapper/reducer scripts:

import sys 
sys.path.append('.')
import module

如果您通过子处理访问二进制文件,您的命令应如下所示:

If you're accessing a binary via subprocessing your command should look something like this:

cli = "./binary %s" % (argument)
cli_parts = shlex.split(cli)
mp = Popen(cli_parts, stdin=PIPE, stderr=PIPE, stdout=PIPE)
mp.communicate()[0]

希望这会有所帮助.

相关文章