如何在 Pandas 中绘制日期的核密度图?

问题描述

我有一个 pandas 数据框,其中每个观察值都有一个日期(作为 datetime[64] 格式的条目列).这些日期分布在大约 5 年的时间里.我想绘制所有观察日期的核密度图,年份标记在 x 轴上.

I have a pandas dataframe where each observation has a date (as a column of entries in datetime[64] format). These dates are spread over a period of about 5 years. I would like to plot a kernel-density plot of the dates of all the observations, with the years labelled on the x-axis.

我已经弄清楚如何创建一个相对于某个参考日期的时间增量,然后创建一个每次观察与参考日期之间的小时数/天数/年数的密度图:

I have figured out how to create a time-delta relative to some reference date and then create a density plot of the number of hours/days/years between each observation and the reference date:

df['relativeDate'].astype('timedelta64[D]').plot(kind='kde')

但这并不是我想要的:如果我转换为年份增量,那么 x 轴是正确的,但我失去了年内变化.但是,如果我使用较小的时间单位,例如小时或天,x 轴标签就更难解释了.

But this isn't exactly what I want: If I convert to year-deltas, then the x-axis is right but I lose the within-year variation. But if I take a smaller unit of time like hour or day, the x-axis labels are much harder to interpret.

在 Pandas 中进行这项工作的最简单方法是什么?

What's the simplest way to make this work in Pandas?


解决方案

受@JohnE 回答的启发,将日期转换为数值的另一种方法是使用 .toordinal().

Inspired by @JohnE 's answer, an alternative approach to convert date to numeric value is to use .toordinal().

import pandas as pd
import numpy as np

# simulate some artificial data
# ===============================
np.random.seed(0)
dates = pd.date_range('2010-01-01', periods=31, freq='D')
df = pd.DataFrame(np.random.choice(dates,100), columns=['dates'])
# use toordinal() to get datenum
df['ordinal'] = [x.toordinal() for x in df.dates]

print(df)

        dates  ordinal
0  2010-01-13   733785
1  2010-01-16   733788
2  2010-01-22   733794
3  2010-01-01   733773
4  2010-01-04   733776
5  2010-01-28   733800
6  2010-01-04   733776
7  2010-01-08   733780
8  2010-01-10   733782
9  2010-01-20   733792
..        ...      ...
90 2010-01-19   733791
91 2010-01-28   733800
92 2010-01-01   733773
93 2010-01-15   733787
94 2010-01-04   733776
95 2010-01-22   733794
96 2010-01-13   733785
97 2010-01-26   733798
98 2010-01-11   733783
99 2010-01-21   733793

[100 rows x 2 columns]    

# plot non-parametric kde on numeric datenum
ax = df['ordinal'].plot(kind='kde')
# rename the xticks with labels
x_ticks = ax.get_xticks()
ax.set_xticks(x_ticks[::2])
xlabels = [datetime.datetime.fromordinal(int(x)).strftime('%Y-%m-%d') for x in x_ticks[::2]]
ax.set_xticklabels(xlabels)

相关文章