pandas python中的COUNTIF在具有多个条件的多个列上

2022-01-21 00:00:00 python pandas dataset

问题描述

我有一个数据集,我试图在其中确定每个人的风险因素数量.所以我有以下数据:

I have a dataset wherein I am trying to determine the number of risk factors per person. So I have the following data:

Person_ID  Age  Smoker  Diabetes
      001   30       Y         N
      002   45       N         N
      003   27       N         Y
      004   18       Y         Y
      005   55       Y         Y

每个属性(年龄、吸烟者、糖尿病)都有自己的条件来确定它是否是风险因素.因此,如果年龄 >= 45,这是一个风险因素.如果吸烟者和糖尿病是Y",则它们是危险因素.我想要添加一个列,根据这些条件将每个人的风险因素的数量加起来.所以数据看起来像这样:

Each attribute (Age, Smoker, Diabetes) has its own condition to determine whether it is a risk factor. So if Age >= 45, it's a risk factor. Smoker and Diabetes are risk factors if they are "Y". What I would like is to add a column that adds up the number of risk factors for each person based on those conditions. So the data would look like this:

Person_ID  Age  Smoker  Diabetes  Risk_Factors
      001   30       Y         N             1
      002   25       N         N             0
      003   27       N         Y             1
      004   18       Y         Y             2
      005   55       Y         Y             3

我在 Excel 中有一个示例数据集,我在其中使用的方法是使用 COUNTIF 公式,如下所示:

I have a sample dataset that I was fooling around with in Excel, and the way I did it there was to use the COUNTIF formula like so:

=COUNTIF(B2,">45") + COUNTIF(C2,"=Y") + COUNTIF(D2,"=Y")

但是,我将使用的实际数据集对于 Excel 来说太大了,所以我正在为 python 学习 pandas.我希望我能提供我已经尝试过的例子,但坦率地说,我什至不知道从哪里开始.我查看了 这个问题,但它并没有真正解决什么问题使用来自多个列的不同条件将其应用于整个新列.有什么建议吗?

However, the actual dataset that I will be using is way too large for Excel, so I'm learning pandas for python. I wish I could provide examples of what I've already tried, but frankly I don't even know where to start. I looked at this question, but it doesn't really address what to do about applying it to an entire new column using different conditions from multiple columns. Any suggestions?


解决方案

如果你想坚持使用 pandas.您可以使用以下...

If you want to stick with pandas. You can use the following...

isY = lambda x:int(x=='Y')
countRiskFactors = lambda row: isY(row['Smoker']) + isY(row['Diabetes']) + int(row["Age"]>45)

df['Risk_Factors'] = df.apply(countRiskFactors,axis=1)

工作原理

isY - 是一个存储的 lambda 函数,用于检查单元格的值是否为 Y,否则返回 1,否则为 0countRiskFactors - 将风险因素相加

isY - is a stored lambda function that checks if the value of a cell is Y returns 1 if it is otherwise 0 countRiskFactors - adds up the risk factors

最后一行使用 apply 方法,参数键设置为 1,它将方法 -first 参数 - 沿 DataFrame 逐行应用,并返回一个附加到 DataFrame 的 Series.

the final line uses the apply method, with the paramater key set to 1, which applies the method -first parameter - row wise along the DataFrame and Returns a Series which is appended to the DataFrame.

打印 df 的输出

   Person_ID  Age Smoker Diabetes  Risk_Factors
0          1   30      Y        N             1
1          2   45      N        N             0
2          3   27      N        Y             1
3          4   18      Y        Y             2
4          5   55      Y        Y             3

相关文章