Pandas 根据布尔条件选择行和列

2022-01-20 00:00:00 python pandas dataframe conditional

问题描述

我有一个大约 50 列和 >100 行的 pandas 数据框.我想选择列 'col_x', 'col_y' where 'col_z' <米.有没有一种简单的方法可以做到这一点,类似于 df[df['col3'] <m]df[['colx','coly']] 但合并了吗?

I have a pandas dataframe with about 50 columns and >100 rows. I want to select columns 'col_x', 'col_y' where 'col_z' < m. Is there a simple way to do this, similar to df[df['col3'] < m] and df[['colx','coly']] but combined?


解决方案

让我们分解你的问题.你想

Let's break down your problem. You want to

  1. 根据一些布尔条件过滤行
  2. 您想从结果中选择列的子集.

对于第一点,您需要的条件是 -

For the first point, the condition you'd need is -

df["col_z"] < m

对于第二个要求,您需要指定所需的列列表 -

For the second requirement, you'd want to specify the list of columns that you need -

["col_x", "col_y"]

您将如何将这两者结合起来使用 pandas 产生预期的输出?最直接的方法是使用 loc -

How would you combine these two to produce an expected output with pandas? The most straightforward way is using loc -

df.loc[df["col_z"] < m, ["col_x", "col_y"]]

第一个参数选择行,第二个参数选择列.

The first argument selects rows, and the second argument selects columns.

更多关于loc

从关系代数运算的角度来考虑这一点 - 选择和投影.如果您来自 SQL 世界,这将是一个相关的等价物.上面的操作,在 SQL 语法中,看起来像这样 -

Think of this in terms of the relational algebra operations - selection and projection. If you're from the SQL world, this would be a relatable equivalent. The above operation, in SQL syntax, would look like this -

SELECT col_x, col_y     # projection on columns
FROM df
WHERE col_z < m         # selection on rows

pandas loc 允许您指定用于选择行的索引标签.例如,如果您有一个数据框 -

pandas loc allows you to specify index labels for selecting rows. For example, if you have a dataframe -

   col_x  col_y
a      1      4
b      2      5
c      3      6

要选择索引 accol_x,您可以使用 -

To select index a, and c, and col_x you'd use -

df.loc[['a', 'c'], ['col_x']]

   col_x
a      1
c      3

或者,用于通过布尔条件进行选择(使用一系列/数组 bool 值,正如您最初的问题所问的那样),其中 col_x 中的所有值都是奇数 -

Alternatively, for selecting by a boolean condition (using a series/array of bool values, as your original question asks), where all values in col_x are odd -

df.loc[(df.col_x % 2).ne(0), ['col_y']]

   col_y
a      4
c      6

详细信息,df.col_x % 2 计算每个值相对于 2 的模数.然后 ne(0) 会将值与 0 进行比较,如果不是则返回 True(所有奇数都是这样选择的).这是该表达式的结果 -

For details, df.col_x % 2 computes the modulus of each value with respect to 2. The ne(0) will then compare the value to 0, and return True if it isn't (all odd numbers are selected like this). Here's what that expression results in -

(df.col_x % 2).ne(0)

a     True
b    False
c     True
Name: col_x, dtype: bool


进一步阅读

  • 10 分钟了解 Pandas - 按标签选择
  • 索引和选择数据
    • 布尔索引

相关文章