如何在 Google BigQuery 中为数千个类别创建虚拟变量列?

2021-11-20 00:00:00 sql google-bigquery mysql dummy-variable

我有一个包含 2 列的简单表格:UserID 和 Category,每个 UserID 可以重复几个类别,如下所示:

I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:

UserID   Category
------   --------
1         A
1         B
2         C
3         A
3         C
3         B

我想虚拟化"这个表:即创建一个输出表,其中每个类别都有一个唯一的列,该列由虚拟变量组成(0/1 取决于用户 ID 是否属于该特定类别):

I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):

UserID    A  B  C
------    -- -- --
1         1  1  0
2         0  0  1
3         1  1  1

我的问题是我有数千个类别(不仅仅是本示例中的 3 个),因此使用 CASE WHEN 语句无法有效地实现这一点.

My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

所以我的问题是:

1) 有没有一种方法可以在不使用数千个 CASE WHEN 语句的情况下模拟"Google BigQuery 中的 Category 列.

1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) 这是 UDF 功能运行良好的情况吗?似乎确实如此,但我对 BigQuery 中的 UDF 不够熟悉,无法解决这个问题.有人可以帮忙吗?

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

谢谢.

推荐答案

您可以使用下面的技术"

You can use below "technic"

首先运行查询 #1.它生成您需要运行以获得所需结果的查询(查询#2).请在使用数千个类别狂野"之前仍然考虑 Mosha 的评论:o)

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

查询 #1:

SELECT 'select UserID, ' + 
   GROUP_CONCAT_UNQUOTED(
    'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
   ) 
   + ' from YourTable group by UserID'
FROM (
  SELECT category 
  FROM YourTable  
  GROUP BY category
)

结果将如下所示 - 查询 #2

Resulted will be like below - Query #2

SELECT
  UserID,
  SUM(IF(category = "A", 1, 0)) AS A,
  SUM(IF(category = "B", 1, 0)) AS B,
  SUM(IF(category = "C", 1, 0)) AS C
FROM
  YourTable
GROUP BY
  UserID

当然对于三个类别 - 您可以手动完成,但对于数千个类别,它肯定会让您大开眼界!!

of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

查询 #2 的结果将如您所愿:

Result of query #2 will looks as you expect:

UserID  A   B   C    
1       1   1   0    
2       0   0   1    
3       1   1   1    

相关文章