Hadoop 中的流式处理或自定义 Jar

2022-01-13 00:00:00 python hadoop streaming mapreduce java

我正在使用 Python 编写的映射器和缩减器在 Hadoop(在 Amazon 的 EMR 上)运行流式作业.我想知道如果我在 Java 中(或使用 Pig)实现相同的映射器和减速器,我会体验到的速度提升.

I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig).

特别是,我正在寻找人们从流式迁移到自定义 jar 部署和/或 Pig 的经验,以及包含这些选项的基准比较的文档.我发现了这个问题,但答案对我来说不够具体.我不是在寻找 Java 和 Python 之间的比较,而是在 Hadoop 中的自定义 jar 部署和基于 Python 的流之间的比较.

In particular, I'm looking for people's experiences on migrating from streaming to custom jar deployments and/or Pig and also documents containing benchmark comparisons of these options. I found this question, but the answers are not specific enough for me. I'm not looking for comparisons between Java and Python, but comparisons between custom jar deployment in Hadoop and Python-based streaming.

我的工作是从 Google Books NGgram 数据集中读取 NGram 计数并计算聚合度量.计算节点上的 CPU 利用率似乎接近 100%.(我想听听您对 CPU 密集型或 IO 密集型作业的区别的看法.

My job is reading NGram counts from the Google Books NGgram dataset and computing aggregate measures. It seems like CPU utilization on the compute nodes are close to 100%. (I would like to hear your opinions about the differences of having CPU-bound or an IO-bound job, as well).

谢谢!

澳大利亚

推荐答案

为什么要考虑部署自定义 jar?

Why consider deploying custom jars ?

  • 能够使用更强大的自定义输入格式.对于流式作业,即使您使用 here,您的映射器/归约器的键和值被限制为文本/字符串.您需要花费一些 CPU 周期才能转换为所需的类型.
  • 我还听说 Hadoop 可以很聪明地跨多个作业重用 JVM,这在流式传输时是不可能的(无法确认)
  • Ability to use more powerful custom Input formats. For streaming jobs, even if you use pluggable input/output like it's mentioned here, you are limited to the key and value(s) to your mapper/reducer being a text/string. You would need to expend some amount of CPU cycles to convert to your required type.
  • Ive also heard that Hadoop can be smart about reusing JVMs across multiple Jobs which wont be possible when streaming (can't confirm this)

什么时候用猪?

  • Pig Latin 非常酷,是一种比 java/python 更高级别的数据流语言或 perl.您的 Pig 脚本往往比任何其他语言编写的等效任务要小得多
  • Pig Latin is pretty cool and is a much higher level data flow language than java/python or perl. Your Pig scripts WILL tend to be much smaller than an equivalent task written any of the other languages

什么时候不使用 pig ?

When to NOT use pig ?

  • 尽管 pig 非常擅长自己计算出多少 map/reduce 以及何时生成 map 或 reduce 以及无数这样的事情,但如果你确定需要多少 maps/reduce 并且你有一些您需要在 Map/reduce 函数中进行非常具体的计算,并且您对性能非常具体,那么您应该考虑部署自己的 jar.这个 link 表明 pig 在性能上可能落后于原生 hadoop M/R.你也可以看看编写你自己的 Pig UDFs 来隔离一些计算密集型函数(甚至可能使用 JNI 在 UDF 中调用一些本机 C/C++ 代码)
  • Even though pig is pretty good at figuring out by itself how many maps/reduce and when to spawn a map or reduce and a myriad of such things, if you are dead sure how many maps/reduce you need and you have some very specific computation you need to do within your Map/reduce functions and you are very specific about performance, then you should consider deploying your own jars. This link shows that pig can lag native hadoop M/R in performance. You could also take a look at writing your own Pig UDFs which isolate some compute intensive function (and possibly even use JNI to call some native C/C++ code within the UDF)

关于 IO 和 CPU 绑定作业的注意事项:

A Note on IO and CPU bound jobs :

  • 从技术上讲,hadoop 和 map reduce 的全部意义在于并行化计算密集型函数,因此我认为您的 map 和 reduce 作业是计算密集型的.当数据通过网络发送时,Hadoop 子系统唯一一次忙于执行 IO 是在 map 和 reduce 阶段之间.此外,如果您有大量数据并且您手动配置了太少的映射和减少导致溢出到磁盘(尽管太多的任务将导致花费太多时间启动/停止 JVM 和太多的小文件).流式作业还会产生额外的启动 Python/Perl VM 的开销,并在 JVM 和脚本 VM 之间来回复制数据.

相关文章