使用 Hadoop 计算唯一身份访问者的最佳方法是什么?

2022-01-13 00:00:00 python hadoop mapreduce

问题描述

大家好,刚刚开始使用 hadoop,并且好奇如果您的日志文件看起来像这样,在 mapreduce 中计算唯一访问者的最佳方法是什么...

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...

DATE       siteID  action   username
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview tom
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview bob
05-05-2010 siteA   pageview mike

对于每个站点,您想找出每个站点的唯一访问者?

and for each site you wanted to find out the unique visitors for each site?

我在想映射器会发出 siteID 用户名并且 reducer 会为每个键保留一个唯一用户名的 set(),然后发出该集合的长度.但是,这可能会在内存中存储数百万个用户名,这似乎并不正确.谁有更好的方法?

I was thinking the mapper would emit siteID username and the reducer would keep a set() of the unique usersnames per key and then emit the length of that set. However that would be potentially storing millions of usernames in memory which doesn't seem right. Anyone have a better way?

顺便说一句,我正在使用 python 流媒体

I'm using python streaming by the way

谢谢


解决方案

你可以把它作为一个 2-stage 操作:

You could do it as a 2-stage operation:

第一步,发出 (username => siteID),然后让 reducer 使用 set 折叠多次出现的 siteID - 因为你通常有很远网站比用户少,这应该没问题.

First step, emit (username => siteID), and have the reducer just collapse multiple occurrences of siteID using a set - since you'd typically have far less sites than users, this should be fine.

然后在第二步中,您可以发出 (siteID => username) 并进行简单的计数,因为重复项已被删除.

Then in the second step, you can emit (siteID => username) and do a simple count, since the duplicates have been removed.

相关文章