Python数据框:列的累积总和,直到达到条件并返回索引

2022-01-09 00:00:00 python pandas dataframe sum

问题描述

我是 Python 新手,目前面临一个我无法解决的问题.我真的希望你能帮助我.英语不是我的母语,所以如果我不能正确表达自己,我很抱歉.

I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.

假设我有一个包含两列的简单数据框:

Say I have a simple data frame with two columns:

index  Num_Albums  Num_authors
0      10          4
1      1           5
2      4           4
3      7           1000
4      1           44
5      3           8

Num_Abums_tot = sum(Num_Albums) = 30

我需要对Num_Albums中的数据进行累计求和,直到达到某个条件.注册达到条件的索引,从Num_authors中获取对应的值.

I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.

示例:Num_Albums 的累积总和,直到总和等于 30 的 50% ± 1/15 (--> 15±2):

Example: cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):

10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop. 

在索引 2 处达到条件.然后在该索引处获取 Num_Authors:Num_Authors(2)=4

Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4

我想看看 pandas 中是否已经实现了一个函数,然后我开始考虑如何使用 while/for 循环来实现......

I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....

[我想指定我想在相关索引处检索值的列(当我有 4 列并且我想对第 1 列中的元素求和时,这会派上用场,条件达到 =yes 然后得到第 2 列中的对应值;然后对第 3 列和第 4 列执行相同操作)].

[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].


解决方案

选项 - 1:

您可以使用 计算累积总和cumsum.然后使用 np.isclose 使用它的内置容差参数来检查该系列中存在的值是否在指定的阈值 15 +/- 2 内.这将返回一个布尔数组.

You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.

通过 np.flatnonzero,返回 True 条件成立的索引的序数值.我们选择 True 值的第一个实例.

Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.

最后,使用 .iloc 根据之前计算的索引检索您需要的列名的值.

Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.

val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val]      # for faster access, use .iat 
4

当在series上执行np.isclose后转换为数组:

When performing np.isclose on the series later converted to an array:

np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False,  True, False, False, False], dtype=bool)

选项 - 2:

使用 pd.Index.cumsum 计算序列上的 get_loc,它还支持 nearest 方法上的 tolerance 参数.

Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.

val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4

选项 - 3:

使用 idxmaxsubabscumsum<操作后创建的布尔掩码找到 True 值的第一个索引/code> 系列:

Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:

df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4

相关文章