如何根据根据条件重置的累积总和进行分组

问题描述

我有一个 pandas df,其字数与文章相对应.我希望能够添加另一列 MERGED,该列基于具有最小累积总和min_words"的文章组.

I have a pandas df with word counts corresponding to articles. I want to be able to be able to add another column MERGED that is based on groups of articles that have a minimum cumulative sum of 'min_words'.

df = pd.DataFrame([[  0,  6],
       [  1,  10],
       [  3,   5],
       [  4,   7],
       [  5,  26],
       [  6,   7],
       [  9,   4],
       [ 10, 133],
       [ 11,  42],
       [ 12,   1]], columns=['ARTICLE', 'WORD_COUNT'])

df
Out[15]: 
   ARTICLE  WORD_COUNT
0        0           6
1        1          10
2        3           5
3        4           7
4        5          26
5        6           7
6        9           4
7       10         133
8       11          42
9       12           1

那么如果 min_words = 20 这是所需的输出:

So then if min_words = 20 this is the desired output:

    df
Out[17]: 
   ARTICLE  WORD_COUNT  MERGED
0        0           6       0
1        1          10       0
2        3           5       0
3        4           7       1
4        5          26       1
5        6           7       2
6        9           4       2
7       10         133       2
8       11          42       3
9       12           1       4

如上所示,最终文章可能不满足 min_words 条件,这没关系.

As seen above, it is possible that the final article(s) won't satisfy the min_words condition, and that's ok.


解决方案

只能做self def功能

We can only do self def function

def dymcumsum(v, limit):
     idx = []
     sums = 0
     for i in range(len(v)):
         sums += v[i]
         if sums >= limit:
             idx.append(i)
             sums = 0
     return(idx)
df['New']=np.nan
df.loc[dymcumsum(df.WORD_COUNT,20),'New']=1
df.New=df.New.iloc[::-1].eq(1).cumsum()[::-1].factorize()[0]+1
 
df
   ARTICLE  WORD_COUNT  New
0        0           6    1
1        1          10    1
2        3           5    1
3        4           7    2
4        5          26    2
5        6           7    3
6        9           4    3
7       10         133    3
8       11          42    4
9       12           1    5

相关文章