根据字符串中定义的条件填充新列
问题描述
我有条件填充在字符串中定义的新列。
condition_string = "colA='yes' & colB='yes' & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
可以用任何其他格式(词典)重写/构造字符串,然后将其送入代码以获得最终结果。
数据帧是
df = pd.DataFrame(
{
'ID': ['AB01', 'AB02', 'AB03', 'AB03', 'AB04','AB05', 'AB06'],
'colA': ["yes","yes",'yes',"no","no",'yes', np.nan],
'colB': [np.nan,'yes','yes',"no",'no', np.nan, "yes"],
'colC': ["yes",'yes', 'yes',"no", "no",np.nan,np.nan],
'colD': ["yes",'no', 'yes',"no",np.nan,"no",np.nan],
}
)
最终结果应如下所示
如何才能在不对condition_string
中的内容进行硬编码的情况下完成此操作。或者您有什么方法可以重构condition_string
然后应用于数据帧吗?
更新: 如果词典是这样的呢?
condition_string = "colA='yes' & (colB='yes' | colB='no)' &
(colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
数据帧类似
df = pd.DataFrame(
{
'ID': ['AB01', 'AB02', 'AB03', 'AB03', 'AB04','AB05', 'AB06'],
'colA': ["yes","yes",'yes',"no","no",'yes', np.nan],
'colB': ["no",'yes','yes',"no",'no', np.nan, "yes"],
'colC': ["yes",'yes', 'yes',"no", "no",np.nan,np.nan],
'colD': ["yes",'no', 'yes',"no",np.nan,"no",np.nan]
}
)
解决方案
这里有一个解决方案,可以将您的条件转换为Python函数,然后将其应用于DataFrame的行:
import re
condition_string = "colA='yes' & colB='yes' & (colC='yes' | colD='yes'): 'Yes', colA='no' & colB='no' & (colC='no' | colD='no'): 'No', ELSE : 'UNKNOWN'"
# formatting string as python function apply_cond
for col in df.columns:
condition_string = re.sub(rf"(W|^){col}(W|$)", rf"1row['{col}']2", condition_string)
condition_string = re.sub(rf"row['{col}']s*=(?!=)", f"row['{col}']==", condition_string)
cond_form = re.sub(r'(:[^[(]+), (?!ELSE)', r'1
elif ', condition_string)
.replace(": ", ":
return ")
.replace("&", "and")
.replace('|', 'or')
cond_form = re.sub(r", ELSEs*:", "
else:", cond_form)
function_def = "def apply_cond(row):
if " + cond_form
#print(function_def) # uncomment to see how the function is defined
# executing the function definition of apply_cond
exec(function_def)
# applying the function to each row
df["result"]=df.apply(lambda x: apply_cond(x), axis=1)
print(df)
输出:
ID colA colB colC colD result
0 AB01 yes NaN yes yes UNKNOWN
1 AB02 yes yes yes no Yes
2 AB03 yes yes yes yes Yes
3 AB03 no no no no No
4 AB04 no no no NaN No
5 AB05 yes NaN NaN no UNKNOWN
6 AB06 NaN yes NaN NaN UNKNOWN
您可能希望根据condition_string
调整字符串格式(我做得很快,可能有一些不受支持的组合),但如果您自动获取这些字符串,将使您不必重新定义它们。
相关文章