将日期转换为浮点数以对 Pandas 数据框进行线性回归

2022-01-11 00:00:00 python pandas time-series

问题描述

似乎要让 OLS 线性回归在 Pandas 中正常工作,参数必须是浮点数.我从以下形式的 csv(称为gameAct.csv")开始:

It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:

date, city, players, sales

2014-04-28,London,111,1091.28

2014-04-29,London,100,1100.44

2014-04-28,Paris,87,1001.33

...

我想对销售如何依赖日期执行线性回归(随着时间的推移,销售如何变化?).我下面的代码的问题似乎是日期不是浮点值.对于如何解决 Pandas 中的这个索引问题,我将不胜感激.

I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.

我当前的(不工作,但正在编译的代码):

My current (non-working, but compiling code):

import pandas as pd

from pandas import DataFrame, Series

import statsmodels.formula.api as sm

df = pd.read_csv('gameAct.csv')

df.columns = ['date', 'city', 'players', 'sales']

city_data = df[df['city'] == 'London']

result = sm.ols(formula = 'sales ~ date', data = city_data).fit()

当我改变城市值时,我得到 R^2 = 1 结果,这是错误的.我也尝试过 index_col = 0, parse_dates == True' 来定义 dataframe df,但没有成功.

As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True' in defining the dataframe df, but without success.

我怀疑有更好的方法来读取此类 csv 文件,以对日期执行基本回归,以及进行更一般的时间序列分析.感谢您的帮助、示例和资源!

I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!

注意,使用上面的代码,如果我将日期索引(对于给定城市)转换为数组,则该数组中的值的形式为:

Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:

'xefxbbxbf2014-04-28'

如何对所有非销售参数进行 AIC 分析?(例如,结果可能是销售额最线性地取决于日期和城市).

How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).


解决方案

对于这种回归,我通常将日期或时间戳转换为数据开始后的整数天数.

For this kind of regression, I usually convert the dates or timestamps to an integer number of days since the start of the data.

这可以很好地解决问题:

This does the trick nicely:

df = pd.read_csv('test.csv')
df['date'] = pd.to_datetime(df['date'])    
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date_delta', data = city_data).fit()

此方法的优点是您可以确定回归中涉及的单位(天),而自动转换可能会隐式使用其他单位,从而在线性模型中创建令人困惑的系数.它还允许您将在不同时间开始的多个销售活动的数据组合到您的回归中(例如,您对作为活动天数的函数的活动的有效性感兴趣).如果您有兴趣衡量一年中的某天趋势,您也可以选择 1 月 1 日作为您的 0.选择您自己的 0 日期让您掌控一切.

The advantage of this method is that you're sure of the units involved in the regression (days), whereas an automatic conversion may implicitly use other units, creating confusing coefficients in your linear model. It also allows you to combine data from multiple sales campaigns that started at different times into your regression (say you're interested in effectiveness of a campaign as a function of days into the campaign). You could also pick Jan 1st as your 0 if you're interested in measuring the day of year trend. Picking your own 0 date puts you in control of all that.

还有证据表明 statsmodels 支持来自 pandas 的时间序列.您也可以将其应用于线性模型:http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html

There's also evidence that statsmodels supports timeseries from pandas. You may be able to apply this to linear models as well: http://statsmodels.sourceforge.net/stable/examples/generated/ex_dates.html

另外,一个简短的说明:您应该能够直接从 csv 中自动读取列名,就像我发布的示例代码一样.在您的示例中,我看到 csv 文件第一行中的逗号之间有空格,导致列名如日期".删除空格,自动读取 csv 标头应该可以正常工作.

Also, a quick note: You should be able to read column names directly out of the csv automatically as in the sample code I posted. In your example I see there are spaces between the commas in the first line of the csv file, resulting in column names like ' date'. Remove the spaces and automatic csv header reading should just work.

相关文章