如何在 Google BigQuery 中为数千个类别创建虚拟变量列?
我有一个包含 2 列的简单表格:UserID 和 Category,每个 UserID 可以重复几个类别,如下所示:
I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:
UserID Category
------ --------
1 A
1 B
2 C
3 A
3 C
3 B
我想虚拟化"这个表:即创建一个输出表,其中每个类别都有一个唯一的列,该列由虚拟变量组成(0/1 取决于用户 ID 是否属于该特定类别):
I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):
UserID A B C
------ -- -- --
1 1 1 0
2 0 0 1
3 1 1 1
我的问题是我有数千个类别(不仅仅是本示例中的 3 个),因此使用 CASE WHEN 语句无法有效地实现这一点.
My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.
所以我的问题是:
1) 有没有一种方法可以在不使用数千个 CASE WHEN 语句的情况下模拟"Google BigQuery 中的 Category 列.
1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.
2) 这是 UDF 功能运行良好的情况吗?似乎确实如此,但我对 BigQuery 中的 UDF 不够熟悉,无法解决这个问题.有人可以帮忙吗?
2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?
谢谢.
推荐答案
您可以使用下面的技术"
You can use below "technic"
首先运行查询 #1.它生成您需要运行以获得所需结果的查询(查询#2).请在使用数千个类别狂野"之前仍然考虑 Mosha 的评论:o)
First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)
查询 #1:
SELECT 'select UserID, ' +
GROUP_CONCAT_UNQUOTED(
'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
)
+ ' from YourTable group by UserID'
FROM (
SELECT category
FROM YourTable
GROUP BY category
)
结果将如下所示 - 查询 #2
Resulted will be like below - Query #2
SELECT
UserID,
SUM(IF(category = "A", 1, 0)) AS A,
SUM(IF(category = "B", 1, 0)) AS B,
SUM(IF(category = "C", 1, 0)) AS C
FROM
YourTable
GROUP BY
UserID
当然对于三个类别 - 您可以手动完成,但对于数千个类别,它肯定会让您大开眼界!!
of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!
查询 #2 的结果将如您所愿:
Result of query #2 will looks as you expect:
UserID A B C
1 1 1 0
2 0 0 1
3 1 1 1
相关文章