Pandas:如何使用 df.to_dict() 轻松共享示例数据框?
问题描述
尽管在 我如何提出一个好问题? 和 如何创建一个最小的、可重现的示例,许多人似乎只是忽略了在他们的问题中包含一个可重现的数据样本.那么当简单的 pd.DataFrame(np.random.random(size=(5, 5)))
不够用时,有什么实用且简单的方法来重现数据样本呢?例如,如何使用 df.to_dict() 并将输出包含在问题中?
Despite the clear guidance on How do I ask a good question? and How to create a Minimal, Reproducible Example, many just seem to ignore to include a reproducible data sample in their question. So what is a practical and easy way to reproduce a data sample when a simple pd.DataFrame(np.random.random(size=(5, 5)))
is not enough? How can you, for example, use df.to_dict() and include the output in a question?
解决方案
答案:
在许多情况下,使用带有 df.to_dict()
的方法可以完美地完成工作!以下是我想到的两种情况:
The answer:
In many situations, using an approach with df.to_dict()
will do the job perfectly! Here are two cases that come to mind:
案例 1: 您已经从本地来源用 Python 构建或加载了一个数据框
案例 2: 您在另一个应用程序(如 Excel)中有一个表格
案例 1: 您从本地源构建或加载了一个数据框
假设您有一个名为 df
的 pandas 数据框,只需
Given that you've got a pandas dataframe named df
, just
- 在控制台或编辑器中运行
df.to_dict()
,然后 - 复制格式化为字典的输出,然后
- 将内容粘贴到
pd.DataFrame(<output>)
并将该块包含在您现在可重现的代码片段中.
- run
df.to_dict()
in you console or editor, and - copy the output that is formatted as a dictionary, and
- paste the content into
pd.DataFrame(<output>)
and include that chunk in your now reproducible code snippet.
案例 2: 您在另一个应用程序(如 Excel)中有一个表格
根据来源和分隔符,如 (',', ';' '\s+')
后者表示任何空格,您可以简单地:
Depending on the source and separator like (',', ';' '\s+')
where the latter means any spaces, you can simply:
Ctrl+C
内容- 在您的控制台或编辑器中运行
df=pd.read_clipboard(sep='\s+')
,然后 - 运行
df.to_dict()
,然后 - 在
df=pd.DataFrame(<output>)
中包含输出
Ctrl+C
the contents- run
df=pd.read_clipboard(sep='\s+')
in your console or editor, and - run
df.to_dict()
, and - include the output in
df=pd.DataFrame(<output>)
在这种情况下,您的问题的开头将如下所示:
In this case, the start of your question would look something like this:
import pandas as pd
df = pd.DataFrame({0: {0: 0.25474768796402636, 1: 0.5792136563952824, 2: 0.5950396800676201},
1: {0: 0.9071073567355232, 1: 0.1657288354283053, 2: 0.4962367707789421},
2: {0: 0.7440601352930207, 1: 0.7755487356392468, 2: 0.5230707257648775}})
当然,对于较大的数据帧,这会有点笨拙.但很多时候,所有试图回答您问题的人都需要您的真实世界数据的一小部分样本,以便将您的数据结构考虑在内.
Of course, this gets a little clumsy with larger dataframes. But very often, all anyone who seeks to answer your question need is a little sample of your real world data to take the structure of your data into consideration.
- 运行
df.head(20).to_dict()
以仅包含前20 行
,并且 - 使用例如
df.to_dict('split')
(有 其他选项 除了'split'
) 将输出重塑为需要更少行的 dict.
- run
df.head(20).to_dict()
to only include the first20 rows
, and - change the format of your dict using, for example,
df.to_dict('split')
(there are other options besides'split'
) to reshape your output to a dict that requires fewer lines.
这是一个使用 iris 数据集的示例,以及其他可用位置来自情节快递.
Here's an example using the iris dataset, among other places available from plotly express.
如果你只是运行:
import plotly.express as px
import pandas as pd
df = px.data.iris()
df.to_dict()
这将产生近 1000 行的输出,并且作为可重现的样本不太实用.但是如果你包含 .head(25)
,你会得到:
This will produce an output of nearly 1000 lines, and won't be very practical as a reproducible sample. But if you include .head(25)
, you'll get:
{'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0, 5: 5.4, 6: 4.6, 7: 5.0, 8: 4.4, 9: 4.9},
'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6, 5: 3.9, 6: 3.4, 7: 3.4, 8: 2.9, 9: 3.1},
'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4, 5: 1.7, 6: 1.4, 7: 1.5, 8: 1.4, 9: 1.5},
'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.4, 6: 0.3, 7: 0.2, 8: 0.2, 9: 0.1},
'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa', 5: 'setosa', 6: 'setosa', 7: 'setosa', 8: 'setosa', 9: 'setosa'},
'species_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}
现在我们正在取得进展.但是根据数据的结构和内容,这可能无法以令人满意的方式涵盖内容的复杂性.但是您可以通过包含 to_dict('split')代码> 像这样:
And now we're getting somewhere. But depending on the structure and content of the data, this may not cover the complexity of the contents in a satisfactory manner. But you can include more data on fewer lines by including to_dict('split')
like this:
import plotly.express as px
df = px.data.iris().head(10)
df.to_dict('split')
现在您的输出将如下所示:
Now your output will look like:
{'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'columns': ['sepal_length',
'sepal_width',
'petal_length',
'petal_width',
'species',
'species_id'],
'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
[4.9, 3.0, 1.4, 0.2, 'setosa', 1],
[4.7, 3.2, 1.3, 0.2, 'setosa', 1],
[4.6, 3.1, 1.5, 0.2, 'setosa', 1],
[5.0, 3.6, 1.4, 0.2, 'setosa', 1],
[5.4, 3.9, 1.7, 0.4, 'setosa', 1],
[4.6, 3.4, 1.4, 0.3, 'setosa', 1],
[5.0, 3.4, 1.5, 0.2, 'setosa', 1],
[4.4, 2.9, 1.4, 0.2, 'setosa', 1],
[4.9, 3.1, 1.5, 0.1, 'setosa', 1]]}
现在您可以轻松地增加 .head(10)
中的数字,而不会过多地混淆您的问题.但有一个小缺点.现在您不能再直接在 pd.DataFrame
中使用输入.但是,如果您包含一些关于 index、column 和 data
的规范,那么您就可以了.所以对于这个特定的数据集,我首选的方法是:
And now you can easily increase the number in .head(10)
without cluttering your question too much. But there's one minor drawback. Now you can no longer use the input directly in pd.DataFrame
. But if you include a few specifications with regards to index, column, and data
you'll be just fine. So for this particluar dataset, my preferred approach would be:
import pandas as pd
import plotly.express as px
sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'columns': ['sepal_length',
'sepal_width',
'petal_length',
'petal_width',
'species',
'species_id'],
'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
[4.9, 3.0, 1.4, 0.2, 'setosa', 1],
[4.7, 3.2, 1.3, 0.2, 'setosa', 1],
[4.6, 3.1, 1.5, 0.2, 'setosa', 1],
[5.0, 3.6, 1.4, 0.2, 'setosa', 1],
[5.4, 3.9, 1.7, 0.4, 'setosa', 1],
[4.6, 3.4, 1.4, 0.3, 'setosa', 1],
[5.0, 3.4, 1.5, 0.2, 'setosa', 1],
[4.4, 2.9, 1.4, 0.2, 'setosa', 1],
[4.9, 3.1, 1.5, 0.1, 'setosa', 1],
[5.4, 3.7, 1.5, 0.2, 'setosa', 1],
[4.8, 3.4, 1.6, 0.2, 'setosa', 1],
[4.8, 3.0, 1.4, 0.1, 'setosa', 1],
[4.3, 3.0, 1.1, 0.1, 'setosa', 1],
[5.8, 4.0, 1.2, 0.2, 'setosa', 1]]}
df = pd.DataFrame(index=sample['index'], columns=sample['columns'], data=sample['data'])
df
现在你可以使用这个数据框了:
Now you'll have this dataframe to work with:
sepal_length sepal_width petal_length petal_width species species_id
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1
5 5.4 3.9 1.7 0.4 setosa 1
6 4.6 3.4 1.4 0.3 setosa 1
7 5.0 3.4 1.5 0.2 setosa 1
8 4.4 2.9 1.4 0.2 setosa 1
9 4.9 3.1 1.5 0.1 setosa 1
10 5.4 3.7 1.5 0.2 setosa 1
11 4.8 3.4 1.6 0.2 setosa 1
12 4.8 3.0 1.4 0.1 setosa 1
13 4.3 3.0 1.1 0.1 setosa 1
14 5.8 4.0 1.2 0.2 setosa 1
这将大大增加您获得有用答案的机会!
Which will increase your chances of receiving useful answers significantly!
df_to_dict()
将无法读取像 1: Timestamp('2020-01-02 00:00:00')
这样的时间戳,而不包括 >从熊猫导入时间戳
df_to_dict()
will not be able to read timestamps like 1: Timestamp('2020-01-02 00:00:00')
without also including from pandas import Timestamp
相关文章