如何制作良好的可重现 pandas 示例

2022-01-29 00:00:00 python pandas r

问题描述

花费大量时间观看 r 和 pandas 标签上的 SO,印象我得到的是 pandas 问题不太可能包含可重复的数据.这是 R 社区一直非常鼓励的事情,感谢 this,新手可以在整理这些示例时获得一些帮助.能够阅读这些指南并返回可重复数据的人通常会更幸运地获得问题的答案.

我们如何为 pandas 问题创建良好的可重现示例?可以将简单的数据框放在一起,例如:

将 pandas 导入为 pddf = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'],'收入': [40000, 50000, 42000]})

但许多示例数据集需要更复杂的结构,例如:

  • 日期时间索引或数据
  • 多个分类变量(是否有等效于 R 的 expand.grid() 函数,它产生某些给定变量的所有可能组合?)
  • MultiIndex 或面板数据

对于难以使用几行代码模拟的数据集,是否有与 R 的 dput() 等效的方法,可让您生成可复制粘贴的代码以重新生成数据结构?

解决方案

注意:这里的想法对于 Stack Overflow 来说是非常通用的,确实 问题.

免责声明:写出好问题很难.

善:

  • 确实包含 small* 示例 DataFrame,或者作为可运行代码:

     In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

    或将其设为可复制和可粘贴";使用 pd.read_clipboard(sep='ss+'),您可以格式化堆栈溢出突出显示的文本并使用 Ctrl+K(或在每行前添加四个空格),或在代码上方和下方放置三个反引号 (```),且代码不缩进:

     在 [2] 中:df输出[2]:甲乙0 1 21 1 32 4 6

    自己测试pd.read_clipboard(sep='ss+').

    * 我的意思是 小,绝大多数示例 DataFrame 可能少于 6 行需要引用,并且我敢打赌分 5 行完成. 你能用 df = df.head() 重现错误吗?如果没有,请四处看看是否可以制作一个小型 DataFrame 来展示您所面临的问题.

    * 每个规则都有一个例外,很明显的一个是针对性能问题的(在这种情况下肯定使用 %timeit 并且可能使用 %prun),你应该在哪里生成(考虑使用 np.random.seed,所以我们有完全相同的帧):df = pd.DataFrame(np.random.randn(100000000, 10)).说,为我快速编写此代码";不是严格针对该网站的主题...

  • 写出你想要的结果(同上)

     在 [3] 中:iwantthis输出[3]:甲乙0 1 51 4 6

    解释这些数字的来源:5 是 A 为 1 的行的 B 列的总和.

  • 显示代码你尝试过:

     [4] 中:df.groupby('A').sum()输出[4]:乙一个1 54 6

    但是说一下不正确的地方:A 列在索引中而不是列中.

  • 确实表明你已经做了一些研究(搜索文档,搜索 Stack Overflow),并给出总结:

    <块引用>

    sum 的文档字符串只是声明计算组值的总和"

    <块引用>

    groupby 文档没有'不要为此举任何例子.

    除此之外:这里的答案是使用 df.groupby('A', as_index=False).sum().

  • 如果你有 Timestamp 列是相关的,例如您正在重新采样或其他东西,然后明确并将 pd.to_datetime 应用于它们以进行良好的衡量**.

     df['date'] = pd.to_datetime(df['date']) # 这列应该是日期..

    ** 有时这就是问题本身:它们是字符串.

坏人:

  • 不要包含 MultiIndex,它我们无法复制和粘贴(见上文).这是对 Pandas 的默认显示的一种不满,但仍然很烦人:

     在 [11] 中:df输出[11]:C甲乙1 2 32 6

    正确的方法是包含一个普通的 DataFrame 和一个 set_index 调用:

     In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])在[13]中:df出[13]:C甲乙1 2 32 6

  • 在给出你想要的结果时,一定要提供洞察力:

    <代码> B一个1 15 0

    具体说明你是如何得到这些数字的(它们是什么)...仔细检查它们是否正确.

  • 如果您的代码抛出错误,请务必包含整个堆栈跟踪(如果太嘈杂,可以稍后将其编辑掉).显示行号(以及它所针对的代码的相应行).

丑陋的:

  • 不要链接到我们没有的 CSV 文件访问(最好不要链接到外部资源...)

     df = pd.read_csv('my_secret_file.csv') # 最好有很多解析选项

    大多数数据是专有的我们知道:组成类似的数据,看看你是否可以重现问题(一些小问题).

  • 不要用语言模糊地解释这种情况,比如你有一个大"的 DataFrame,顺便提及一些列名(一定不要提及它们的 dtypes).在没有看到实际背景的情况下,尝试深入了解一些完全没有意义的事情的细节.大概没有人会读到这一段的结尾.

    散文不好,小例子更容易.

  • 在回答您的实际问题之前,不要包含 10+ (100+??) 行数据.

    拜托,我们在日常工作中看到的已经够多了.我们想提供帮助,但不是这样....删掉介绍,只在给你带来麻烦的步骤中显示相关的 DataFrames(或它们的小版本).

不管怎样,学习 Python、NumPy 和 Pandas 玩得开心!

Having spent a decent amount of time watching both the r and pandas tags on SO, the impression that I get is that pandas questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.

How can we create good reproducible examples for pandas questions? Simple dataframes can be put together, e.g.:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

But many example datasets need more complicated structure, e.g.:

  • datetime indices or data
  • Multiple categorical variables (is there an equivalent to R's expand.grid() function, which produces all possible combinations of some given variables?)
  • MultiIndex or Panel data

For datasets that are hard to mock up using a few lines of code, is there an equivalent to R's dput() that allows you to generate copy-pasteable code to regenerate your datastructure?

解决方案

Note: The ideas here are pretty generic for Stack Overflow, indeed questions.

Disclaimer: Writing a good question is hard.

The Good:

  • do include small* example DataFrame, either as runnable code:

      In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep='ss+'), you can format the text for Stack Overflow highlight and use Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented:

      In [2]: df
      Out[2]:
         A  B
      0  1  2
      1  1  3
      2  4  6
    

    test pd.read_clipboard(sep='ss+') yourself.

    * I really do mean small, the vast majority of example DataFrames could be fewer than 6 rowscitation needed, and I bet I can do it in 5 rows. Can you reproduce the error with df = df.head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

    * Every rule has an exception, the obvious one is for performance issues (in which case definitely use %timeit and possibly %prun), where you should generate (consider using np.random.seed so we have the exact same frame): df = pd.DataFrame(np.random.randn(100000000, 10)). Saying that, "make this code fast for me" is not strictly on topic for the site...

  • write out the outcome you desire (similarly to above)

      In [3]: iwantthis
      Out[3]:
         A  B
      0  1  5
      1  4  6
    

    Explain what the numbers come from: the 5 is sum of the B column for the rows where A is 1.

  • do show the code you've tried:

      In [4]: df.groupby('A').sum()
      Out[4]:
         B
      A
      1  5
      4  6
    

    But say what's incorrect: the A column is in the index rather than a column.

  • do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • if it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure**.

      df['date'] = pd.to_datetime(df['date']) # this column ought to be date..
    

    ** Sometimes this is the issue itself: they were strings.

The Bad:

  • don't include a MultiIndex, which we can't copy and paste (see above). This is kind of a grievance with Pandas' default display, but nonetheless annoying:

      In [11]: df
      Out[11]:
           C
      A B
      1 2  3
        2  6
    

    The correct way is to include an ordinary DataFrame with a set_index call:

      In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C']).set_index(['A', 'B'])
    
      In [13]: df
      Out[13]:
           C
      A B
      1 2  3
        2  6
    

  • do provide insight to what it is when giving the outcome you want:

         B
      A
      1  1
      5  0
    

    Be specific about how you got the numbers (what are they)... double check they're correct.

  • If your code throws an error, do include the entire stack trace (this can be edited out later if it's too noisy). Show the line number (and the corresponding line of your code which it's raising against).

The Ugly:

  • don't link to a CSV file we don't have access to (ideally don't link to an external source at all...)

      df = pd.read_csv('my_secret_file.csv')  # ideally with lots of parsing options
    

    Most data is proprietary we get that: Make up similar data and see if you can reproduce the problem (something small).

  • don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.

    Essays are bad, it's easier with small examples.

  • don't include 10+ (100+??) lines of data munging before getting to your actual question.

    Please, we see enough of this in our day jobs. We want to help, but not like this.... Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.

Anyway, have fun learning Python, NumPy and Pandas!

相关文章