有包含列表的 Pandas 列，如何将唯一列表元素旋转到列?

2022-01-22 00:00:00 python numpy pandas dataframe pivot

问题描述

我编写了一个网络爬虫来从产品表中提取信息并构建数据框.数据表有一个描述列，其中包含描述产品的属性的逗号分隔字符串.我想在数据框中为每个唯一属性创建一列，并用属性的子字符串填充该列中的行.下面的例子 df.

I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attributes describing the product. I want to create a column in the dataframe for every unique attribute and populate the row in that column with the attribute's substring. Example df below.

PRODUCTS DATE DESCRIPTION Product A 2016-9-12 Steel, Red, High Hardness Product B 2016-9-11 Blue, Lightweight, Steel Product C 2016-9-12 Red

我认为第一步是将描述拆分成一个列表.

I figure the first step is to split the description into a list.

In: df2 = df['DESCRIPTION'].str.split(',') Out: DESCRIPTION ['Steel', 'Red', 'High Hardness'] ['Blue', 'Lightweight', 'Steel'] ['Red']

我想要的输出如下表所示.列名并不是特别重要.

My desired output looks like the table below. The column names are not particularly important.

PRODUCTS DATE STEEL_COL RED_COL HIGH HARDNESS_COL BLUE COL LIGHTWEIGHT_COL Product A 2016-9-12 Steel Red High Hardness Product B 2016-9-11 Steel Blue Lightweight Product C 2016-9-12 Red

我相信可以使用 Pivot 设置列，但我不确定在建立列后填充列的最 Pythonic 方式.任何帮助表示赞赏.

I believe the columns can be set up using a Pivot but I'm not sure the most Pythonic way to populate the columns after establishing them. Any help is appreciated.

非常感谢您的回答.我选择@MaxU 的响应是正确的，因为它看起来稍微灵活一些，但@piRSquared 的结果非常相似，甚至可能被认为是更Pythonic 的方法.我测试了两个版本，都做了我需要的.谢谢！

Thank you very much for the answers. I selected @MaxU's response as correct since it seems slightly more flexible, but @piRSquared's gets a very similar result and may even be considered the more Pythonic approach. I tested both version and both do what I needed. Thanks!

解决方案

你可以建立一个稀疏矩阵:

you can build up a sparse matrix:

In [27]: df Out[27]: PRODUCTS DATE DESCRIPTION 0 Product A 2016-9-12 Steel, Red, High Hardness 1 Product B 2016-9-11 Blue, Lightweight, Steel 2 Product C 2016-9-12 Red In [28]: (df.set_index(['PRODUCTS','DATE']) ....: .DESCRIPTION.str.split(',s*', expand=True) ....: .stack() ....: .reset_index() ....: .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value=0, aggfunc='size') ....: ) Out[28]: 0 Blue High Hardness Lightweight Red Steel PRODUCTS DATE Product A 2016-9-12 0 1 0 1 1 Product B 2016-9-11 1 0 1 0 1 Product C 2016-9-12 0 0 0 1 0 In [29]: (df.set_index(['PRODUCTS','DATE']) ....: .DESCRIPTION.str.split(',s*', expand=True) ....: .stack() ....: .reset_index() ....: .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value='', aggfunc='size') ....: ) Out[29]: 0 Blue High Hardness Lightweight Red Steel PRODUCTS DATE Product A 2016-9-12 1 1 1 Product B 2016-9-11 1 1 1 Product C 2016-9-12 1

相关文章