如何处理用yfinance下载的多级列名

2022-01-31 00:00:00 python python-3.x pandas dataframe yfinance

问题描述

我有一个代码列表 (tickerStrings)，我可以一次性下载.当我尝试使用 pandas 的 read_csv 时，它不会像我从 yfinance 下载数据时那样读取 csv 文件.

I have a list of tickers (tickerStrings) that I to download all at once. When I try to use pandas' read_csv it doesn't read the csv file in the way it does when I download the data from yfinance.

我通常通过如下代码访问我的数据:data['AAPL'] 或 data['AAPL'].Close，但是当我从它不允许我这样做的 csv 文件.

I usually access my data by ticker like this: data['AAPL'] or data['AAPL'].Close, but when I read the data from the csv file it does not let me do that.

if path.exists(data_file): data = pd.read_csv(data_file, low_memory=False) data = pd.DataFrame(data) print(data.head()) else: data = yf.download(tickerStrings, group_by="Ticker", period=prd, interval=intv) data.to_csv(data_file)

这是打印输出:

Unnamed: 0 OLN OLN.1 OLN.2 OLN.3 ... W.1 W.2 W.3 W.4 W.5 0 NaN Open High Low Close ... High Low Close Adj Close Volume 1 Datetime NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 2 2020-06-25 09:30:00-04:00 11.1899995803833 11.220000267028809 11.010000228881836 11.079999923706055 ... 201.2899932861328 197.3000030517578 197.36000061035156 197.36000061035156 112156 3 2020-06-25 09:45:00-04:00 11.130000114440918 11.260000228881836 11.100000381469727 11.15999984741211 ... 200.48570251464844 196.47999572753906 199.74000549316406 199.74000549316406 83943 4 2020-06-25 10:00:00-04:00 11.170000076293945 11.220000267028809 11.119999885559082 11.170000076293945 ... 200.49000549316406 198.19000244140625 200.4149932861328 200.4149932861328 88771

我在尝试访问数据时遇到的错误:

The error I get when trying to access the data:

Traceback (most recent call last): File "getdata.py", line 49, in processData avg = data[x].Close.mean() AttributeError: 'Series' object has no attribute 'Close'

解决方案

将所有代码下载到具有单级列标题的单个数据框中
选项 1
下载单个股票代码数据时，返回的数据框列名称是单个级别，但没有代码列.
这将为每个代码下载数据，添加代码列，并从所有所需代码创建一个数据框.

import yfinance as yf import pandas as pd tickerStrings = ['AAPL', 'MSFT'] df_list = list() for ticker in tickerStrings: data = yf.download(ticker, group_by="Ticker", period='2d') data['ticker'] = ticker # add this column because the dataframe doesn't contain a column with the ticker df_list.append(data) # combine all dataframes into a single dataframe df = pd.concat(df_list) # save to csv df.to_csv('ticker.csv')

选项 2
下载所有代码并取消堆叠级别
group_by='Ticker' 将代码放在列名的level=0

Option 2

Download all the tickers and unstack the levels

group_by='Ticker' puts the ticker at level=0 of the column name

tickerStrings = ['AAPL', 'MSFT'] df = yf.download(tickerStrings, group_by='Ticker', period='2d') df = df.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)

读取 yfinance 已存储多级列名的 csv
如果您希望保留并读取具有多级列索引的文件，请使用以下代码，该代码会将数据框恢复为其原始形式.

Read yfinance csv already stored with multi-level column names

If you wish to keep, and read in a file with a multi-level column index, use the following code, which will return the dataframe to its original form.

df = pd.read_csv('test.csv', header=[0, 1]) df.drop([0], axis=0, inplace=True) # drop this row because it only has one column with Date in it df[('Unnamed: 0_level_0', 'Unnamed: 0_level_1')] = pd.to_datetime(df[('Unnamed: 0_level_0', 'Unnamed: 0_level_1')], format='%Y-%m-%d') # convert the first column to a datetime df.set_index(('Unnamed: 0_level_0', 'Unnamed: 0_level_1'), inplace=True) # set the first column as the index df.index.name = None # rename the index

问题是，tickerStrings 是一个代码列表，这会导致最终数据帧具有多级列名

The issue is, tickerStrings is a list of tickers, which results in a final dataframe with multi-level column names

AAPL MSFT Open High Low Close Adj Close Volume Open High Low Close Adj Close Volume Date 1980-12-12 0.513393 0.515625 0.513393 0.513393 0.405683 117258400 NaN NaN NaN NaN NaN NaN 1980-12-15 0.488839 0.488839 0.486607 0.486607 0.384517 43971200 NaN NaN NaN NaN NaN NaN 1980-12-16 0.453125 0.453125 0.450893 0.450893 0.356296 26432000 NaN NaN NaN NaN NaN NaN 1980-12-17 0.462054 0.464286 0.462054 0.462054 0.365115 21610400 NaN NaN NaN NaN NaN NaN 1980-12-18 0.475446 0.477679 0.475446 0.475446 0.375698 18362400 NaN NaN NaN NaN NaN NaN

当它被保存到 csv 时，它看起来像下面的示例，并产生一个你遇到问题的数据框.

,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,MSFT,MSFT,MSFT,MSFT,MSFT,MSFT ,Open,High,Low,Close,Adj Close,Volume,Open,High,Low,Close,Adj Close,Volume Date,,,,,,,,,,,, 1980-12-12,0.5133928656578064,0.515625,0.5133928656578064,0.5133928656578064,0.40568336844444275,117258400,,,,,, 1980-12-15,0.4888392984867096,0.4888392984867096,0.4866071343421936,0.4866071343421936,0.3845173120498657,43971200,,,,,, 1980-12-16,0.453125,0.453125,0.4508928656578064,0.4508928656578064,0.3562958240509033,26432000,,,,,,

将多级列扁平化为单级并添加一个ticker列
如果股票代码是列名的 level=0(顶部)
当使用 group_by='Ticker' 时

Flatten multi-level columns into a single level and add a ticker column

If the ticker symbol is level=0 (top) of the column names

When group_by='Ticker' is used

df.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)

如果股票代码是列名的level=1(底部)

df.stack(level=1).rename_axis(['Date', 'Ticker']).reset_index(level=1)

下载每个代码并将其保存到单独的文件中
我建议单独下载并保存每个代码，如下所示:

import yfinance as yf import pandas as pd tickerStrings = ['AAPL', 'MSFT'] for ticker in tickerStrings: data = yf.download(ticker, group_by="Ticker", period=prd, interval=intv) data['ticker'] = ticker # add this column because the dataframe doesn't contain a column with the ticker data.to_csv(f'ticker_{ticker}.csv') # ticker_AAPL.csv for example

data 看起来像

Open High Low Close Adj Close Volume ticker Date 1986-03-13 0.088542 0.101562 0.088542 0.097222 0.062205 1031788800 MSFT 1986-03-14 0.097222 0.102431 0.097222 0.100694 0.064427 308160000 MSFT 1986-03-17 0.100694 0.103299 0.100694 0.102431 0.065537 133171200 MSFT 1986-03-18 0.102431 0.103299 0.098958 0.099826 0.063871 67766400 MSFT 1986-03-19 0.099826 0.100694 0.097222 0.098090 0.062760 47894400 MSFT

生成的 csv 将如下所示

Date,Open,High,Low,Close,Adj Close,Volume,ticker 1986-03-13,0.0885416641831398,0.1015625,0.0885416641831398,0.0972222238779068,0.0622050017118454,1031788800,MSFT 1986-03-14,0.0972222238779068,0.1024305522441864,0.0972222238779068,0.1006944477558136,0.06442664563655853,308160000,MSFT 1986-03-17,0.1006944477558136,0.1032986119389534,0.1006944477558136,0.1024305522441864,0.0655374601483345,133171200,MSFT 1986-03-18,0.1024305522441864,0.1032986119389534,0.0989583358168602,0.0998263880610466,0.06387123465538025,67766400,MSFT 1986-03-19,0.0998263880610466,0.1006944477558136,0.0972222238779068,0.0980902761220932,0.06276042759418488,47894400,MSFT

读入上一节保存的多个文件并创建一个数据框

import pandas as pd from pathlib import Path # set the path to the files p = Path('c:/path_to_files') # find the files; this is a generator, not a list files = p.glob('ticker_*.csv') # read the files into a dataframe df = pd.concat([pd.read_csv(file) for file in files])

相关文章

如何处理用yfinance下载的多级列名

问题描述

解决方案

将所有代码下载到具有单级列标题的单个数据框中

选项 1

选项 2

Option 2

读取 `yfinance` 已存储多级列名的 csv

Read `yfinance` csv already stored with multi-level column names

将多级列扁平化为单级并添加一个ticker列

Flatten multi-level columns into a single level and add a ticker column

下载每个代码并将其保存到单独的文件中

读入上一节保存的多个文件并创建一个数据框

如何处理用yfinance下载的多级列名

问题描述

解决方案

将所有代码下载到具有单级列标题的单个数据框中

选项 1

选项 2

Option 2

读取 yfinance 已存储多级列名的 csv

Read yfinance csv already stored with multi-level column names

将多级列扁平化为单级并添加一个ticker列

Flatten multi-level columns into a single level and add a ticker column

下载每个代码并将其保存到单独的文件中

读入上一节保存的多个文件并创建一个数据框

读取 `yfinance` 已存储多级列名的 csv

Read `yfinance` csv already stored with multi-level column names