我应该为这项任务学习/使用 MapReduce 还是其他类型的并行化?

问题描述

在与 Google 的一位朋友交谈后,我想实现某种 Job/Worker 模型来更新我的数据集.

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.

此数据集反映了 3rd 方服务的数据,因此,要进行更新,我需要对其 API 进行多次远程调用.我认为将花费大量时间等待来自此第 3 方服务的响应.我想加快速度,并更好地利用我的计算时间,通过并行化这些请求并同时保持其中许多请求处于打开状态,因为它们等待各自的响应.

This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.

在我解释我的具体数据集并解决问题之前,我想澄清一下我正在寻找什么答案:

Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:

  1. 这是一个非常适合与 MapReduce 并行化的流程吗?
  2. 如果是,在 Amazon 的 mapreduce 模块上运行是否具有成本效益,该模块按小时计费,并在工作完成后向上取整?(我不确定究竟什么才是工作",所以我不知道具体是如何计费的)
  3. 如果否,我应该使用其他系统/模式吗?和是否有库可以帮助我在 python 中执行此操作(在 AWS 上,使用 EC2 + EBS)?
  4. 您认为我设计此工作流程的方式有什么问题吗?
  1. Is this a flow that would be well suited to parallelizing with MapReduce?
  2. If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? (I'm not sure exactly what counts as a "Job", so I don't know exactly how I'll be billed)
  3. If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
  4. Are there any problems you see with the way I've designed this job flow?

好的,现在进入细节:

数据集由拥有最喜欢项目和关注其他用户的用户组成.目的是能够更新每个用户的队列——用户在加载页面时将看到的项目列表,基于她关注的用户最喜欢的项目.但是,在我处理数据和更新用户队列之前,我需要确保我拥有最新的数据,这是 API 调用的来源.

The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.

我可以拨打两个电话:

  • Get Followed Users -- 返回被请求用户关注的所有用户,以及
  • Get Favorite Items -- 返回所请求用户的所有收藏项.
  • Get Followed Users -- Which returns all the users being followed by the requested user, and
  • Get Favorite Items -- Which returns all the favorite items of the requested user.

在我为正在更新的用户调用获取关注的用户之后,我需要为每个被关注的用户更新最喜欢的项目.只有当所有被关注用户的所有收藏都返回时,我才能开始处理该原始用户的队列.这个流程看起来像:

After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:

此流程中的工作包括:

  • Start Updating Queue for user -- 通过获取用户,然后更新用户,存储它们,然后为每个用户创建 Get Favorites 作业来启动该过程用户.
  • 为用户获取收藏夹 -- 从第三方服务请求并存储指定用户的收藏夹列表.
  • 为用户计算新队列 -- 处理一个新队列,现在所有数据都已获取,然后将结果存储在应用程序层使用的缓存中.
  • Start Updating Queue for user -- kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
  • Get Favorites for user -- Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
  • Calculate New Queue for user -- Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.

所以,我的问题是:

  1. 这是一个非常适合与 MapReduce 并行化的流程吗?我不知道它是否能让我启动 UserX 的流程,获取所有相关数据,然后只有在这一切都完成后才返回处理 UserX 的队列.
  2. 如果是,在 Amazon 的 mapreduce 模块上运行是否具有成本效益,该模块按小时计费,并在工作完成后向上取整?如果我使用他们的模块,我可以等待开放 API 请求的线程"数量是否有限制?
  3. 如果否,我应该使用其他系统/模式吗?和是否有库可以帮助我在 python 中执行此操作(在 AWS 上,使用 EC2 + EBS?)?
  4. 您认为我设计此工作流程的方式有什么问题吗?
  1. Is this a flow that would be well suited to parallelizing with MapReduce? I don't know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX's queue only after that's all done.
  2. If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? Is there a limit on how many "threads" I can have waiting on open API requests if I use their module?
  3. If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
  4. Are there any problems you see with the way I've designed this job flow?

感谢阅读,期待与大家讨论.

Thanks for reading, I'm looking forward to some discussion with you all.

编辑,回应 JimR:

感谢您的中肯答复.自从我写了原始问题以来,在我的阅读中,我已经远离使用 MapReduce.我还没有确定我想如何构建它,但是当我真的只是想并行化 HTTP 请求时,我开始觉得 MapReduce 更适合分配/并行化计算负载.

Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.

我的减少"任务,即获取所有获取的数据并将其处理成结果的部分,并不是计算密集型的.我很确定它最终会成为一个大型 SQL 查询,每个用户执行一两秒钟.

What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.

所以,我倾向于:

  • 一个非 MapReduce Job/Worker 模型,用 Python 编写.我的一个 Google 朋友让我为此学习 Python,因为它的开销低且可扩展性好.
  • 使用 Amazon EC2 作为计算层.我认为这意味着我还需要一个 EBS 切片来存储我的数据库.
  • 可能使用亚马逊的简单消息队列.听起来这第 3 个亚马逊小部件旨在跟踪作业队列、将结果从一个任务移动到另一个任务的输入并优雅地处理失败的任务.它非常便宜.可能值得实施而不是自定义作业队列系统.
  • A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it's low overhead and scales well.
  • Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
  • Possibly using Amazon's Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It's very cheap. May be worth implementing instead of a custom job-queue system.

解决方案

看来我们要使用 Node.js和 Seq 流控制库.从我的流程地图/流程图转移到代码的存根非常容易,现在只需填写代码以连接到正确的 API.

Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.

感谢您的回答,他们在寻找我正在寻找的解决方案方面提供了很多帮助.

Thanks for the answers, they were a lot of help finding the solution I was looking for.

相关文章