Python数据框:列的累积总和,直到达到条件并返回索引
问题描述
我是 Python 新手,目前面临一个我无法解决的问题.我真的希望你能帮助我.英语不是我的母语,所以如果我不能正确表达自己,我很抱歉.
I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.
假设我有一个包含两列的简单数据框:
Say I have a simple data frame with two columns:
index Num_Albums Num_authors
0 10 4
1 1 5
2 4 4
3 7 1000
4 1 44
5 3 8
Num_Abums_tot = sum(Num_Albums) = 30
我需要对Num_Albums
中的数据进行累计求和,直到达到某个条件.注册达到条件的索引,从Num_authors
中获取对应的值.
I need to do a cumulative sum of the data in Num_Albums
until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors
.
示例:Num_Albums
的累积总和,直到总和等于 30 的 50% ± 1/15 (--> 15±2):
Example:
cumulative sum of Num_Albums
until the sum equals 50% ± 1/15 of 30 (--> 15±2):
10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.
在索引 2 处达到条件.然后在该索引处获取 Num_Authors
:Num_Authors(2)=4
Condition reached at index 2. Then get Num_Authors
at that index: Num_Authors(2)=4
我想看看 pandas
中是否已经实现了一个函数,然后我开始考虑如何使用 while/for 循环来实现......
I would like to see if there's a function already implemented in pandas
, before I start thinking how to do it with a while/for loop....
[我想指定我想在相关索引处检索值的列(当我有 4 列并且我想对第 1 列中的元素求和时,这会派上用场,条件达到 =yes 然后得到第 2 列中的对应值;然后对第 3 列和第 4 列执行相同操作)].
[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].
解决方案
选项 - 1:
您可以使用 计算累积总和cumsum
.然后使用 np.isclose
使用它的内置容差参数来检查该系列中存在的值是否在指定的阈值 15 +/- 2 内.这将返回一个布尔数组.
You could compute the cumulative sum using cumsum
. Then use np.isclose
with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.
通过 np.flatnonzero
,返回 True
条件成立的索引的序数值.我们选择 True
值的第一个实例.
Through np.flatnonzero
, return the ordinal values of the indices for which the True
condition holds. We select the first instance of a True
value.
最后,使用 .iloc
根据之前计算的索引检索您需要的列名的值.
Finally, use .iloc
to retrieve value of the column name you require based on the index computed earlier.
val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val] # for faster access, use .iat
4
当在series
上执行np.isclose
后转换为数组:
When performing np.isclose
on the series
later converted to an array:
np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False, True, False, False, False], dtype=bool)
选项 - 2:
使用 pd.Index.
,它还支持 cumsum
计算序列上的 get_locnearest
方法上的 tolerance
参数.
Use pd.Index.get_loc
on the cumsum
calculated series which also supports a tolerance
parameter on the nearest
method.
val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4
选项 - 3:
使用 idxmax
为 sub
和 abs
对 cumsum<操作后创建的布尔掩码找到
True
值的第一个索引/code> 系列:
Use idxmax
to find the first index of a True
value for the boolean mask created after sub
and abs
operations on the cumsum
series:
df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4
相关文章