删除 NEARLY 重复的观察 - Python

2022-01-10 00:00:00 python pandas duplicates

问题描述

我正在尝试删除 pandas DataFrame 中的一些观察结果，其中相似性几乎为 100%，但不完全一致.见下图:

I am attempting to remove some observations in a pandas DataFrame where the similarities are ALMOST 100% but not quite. See frame below:

注意John"、Mary"和Wesley"是如何出现的具有几乎相同的观察结果，但有一列不同.真实数据集有 15 列和 215,000 多个观测值.在我可以直观验证的所有情况下，相似之处同样是:在 15 列中，其他观察每次最多匹配 14 列.为了项目的目的，我决定删除重复的观察结果(并将它们存储到另一个 DataFrame 中，以防我的老板要求查看它们).

Notice how "John", "Mary", and "Wesley" have nearly identical observations, but have one column being different. The real data set has 15 columns, and 215,000+ observations. In all of the cases I could visually verify, the similarities were likewise: out of 15 columns, the other observation would match up to 14 columns, every time. For the purpose of the project I have decided to remove the repeated observations (and store them into another DataFrame just in case my boss asks to see them).

我显然已经想到了 remove_duplicates(keep='something')，但这行不通，因为观察结果并不完全相似.有没有人遇到过这样的问题?有什么补救办法吗?

I have evidently thought of remove_duplicates(keep='something'), but that would not work since the observations are not ENTIRELY similar. Has anyone ever encounter such an issue? Any idea on a remedy?

解决方案

关于列子集的简单循环怎么样:

What about a simple loop over subset of columns :

import pandas as pd df = pd.DataFrame( [ ['John', 45, 85000, 'DC'], ['Netcha', 25, 48000, 'NYC'], ['Mary', 45, 85000, 'DC'], ['Wesley', 36, 72500, 'LA'], ['Porter', 22, 98750, 'Seattle'], ['John', 45, 105500, 'DC'], ['Mary', 28, 85000, 'DC'], ['Wesley', 36, 72500, 'Boston'], ], columns=['Name', 'Age', 'Salary', 'City']) cols = df.columns.tolist() cols.remove('Name') for col in cols: observed_cols = df.drop(col, axis=1).columns.tolist() df.drop_duplicates(observed_cols, keep='first', inplace=True) print(df)

返回:

Name Age Salary City 0 John 45 85000 DC 1 Netcha 25 48000 NYC 2 Mary 45 85000 DC 3 Wesley 36 72500 LA 4 Porter 22 98750 Seattle

相关文章

删除 *NEARLY* 重复的观察 - Python

问题描述

解决方案

删除 NEARLY 重复的观察 - Python