如何在 BigQuery/SQL 中将行转换为具有大量数据的列?

我在将 BigQuery 中的大量数据表(15 亿行)从行转换为列时遇到问题.我可以弄清楚如何在硬编码时使用少量数据来做到这一点,但数量如此之大.该表的快照如下所示:

I have a problem in transposing a large amount of data table in BigQuery (1.5 billion rows) from rows to columns. I could figure out how to do it with small amount of data when hardcoded, but with this large amount. A snapshot of the table looks like this:

<代码>+---------------------------+|CustomerID 特征值 |+-------------------------+|1 A123 3 ||1 F213 7 ||1 F231 8 ||1 B789 9.1 ||2 A123 4 ||2 U123 4 ||2 B789 12 ||.. .. .. ||.. .. .. ||400000 A123 8 ||400000 U123 7 ||400000 R231 6 |+-------------------------+

所以基本上有大约 400,000 个不同的 customerID 有 3000 个特征,并不是每个 customerID 都具有相同的特征,所以一些 customerID 可能有 2000 个特征,而有些则有 3000 个.我想得到的最终结果表是每一行呈现一个独特的 customerID,并有 3000 列显示所有功能.像这样:

So basically there are approximately 400,000 distinct customerID with 3000 features, and not every customerID has the same features, so some customerID may have 2000 features while some have 3000. The end result table I would like to get is each row presents one distinct customerID, and with 3000 columns that presents all the features. Like this:

CustomerID Feature1 Feature2 ... Feature3000

因此某些单元格可能缺少值.

So some of the cells may have missing values.

有人知道如何在 BigQuery 或 SQL 中执行此操作吗?

Anyone has idea how to do this in BigQuery or SQL?

提前致谢.

推荐答案

STEP #1

在下面的查询中,将yourTable 替换为您的表的真实名称并执行/运行它

In below query replace yourTable with real name of your table and execute/run it

SELECT 'SELECT CustomerID, ' + 
   GROUP_CONCAT_UNQUOTED(
      'MAX(IF(Feature = "' + STRING(Feature) + '", Value, NULL))'
   ) 
   + ' FROM yourTable GROUP BY CustomerID'
FROM (SELECT Feature FROM yourTable GROUP BY Feature) 

因此,您将获得一些字符串以供下一步使用!

As a result you will get some string to be used in next step!

STEP #2

获取从步骤 1 中得到的字符串并将其作为查询执行
输出是您提出问题的 Pivot

Take string you got from Step 1 and just execute it as a query
The output is a Pivot you asked in question

相关文章