从 Google Cloud 存储读取 csv 到 pandas 数据框
问题描述
我正在尝试将 Google Cloud Storage 存储桶上的 csv 文件读取到 panda 数据帧中.
I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)
它显示了这个错误信息:
It shows this error message:
FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist
我做错了什么,我找不到任何不涉及 google datalab 的解决方案?
What am I doing wrong, I am not able to find any solution which does not involve google datalab?
解决方案
更新
从 pandas 0.24 版开始,read_csv
支持直接从 Google Cloud Storage 读取.只需像这样提供指向存储桶的链接:
UPDATE
As of version 0.24 of pandas, read_csv
supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:
df = pd.read_csv('gs://bucket/your_path.csv')
read_csv
然后将使用 gcsfs
模块来读取 Dataframe,这意味着它必须被安装(否则你会得到一个指向缺少依赖项的异常).
The read_csv
will then use gcsfs
module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).
为了完整起见,我留下了其他三个选项.
I leave three other options for the sake of completeness.
- 自制代码
- gcsfs
- 黎明
我将在下面介绍它们.
我编写了一些方便的函数来从 Google 存储中读取数据.为了使其更具可读性,我添加了类型注释.如果您碰巧在 Python 2 上,只需删除这些代码即可.
I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.
假设您已获得授权,它同样适用于公共和私人数据集.在这种方法中,您无需先将数据下载到本地驱动器.
It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.
使用方法:
fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)
代码:
from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account
def get_byte_fileobj(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> BytesIO:
"""
Retrieve data from a given blob on Google Storage and pass it as a file object.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: file object (BytesIO)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return byte_stream
def get_bytestring(project: str,
bucket: str,
path: str,
service_account_credentials_path: str = None) -> bytes:
"""
Retrieve data from a given blob on Google Storage and pass it as a byte-string.
:param path: path within the bucket
:param project: name of the project
:param bucket_name: name of the bucket
:param service_account_credentials_path: path to credentials.
TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
:return: byte-string (needs to be decoded)
"""
blob = _get_blob(bucket, path, project, service_account_credentials_path)
s = blob.download_as_string()
return s
def _get_blob(bucket_name, path, project, service_account_credentials_path):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return blob
gcsfs
gcsfs 是用于谷歌云存储的 Pythonic 文件系统".
gcsfs
gcsfs is a "Pythonic file-system for Google Cloud Storage".
使用方法:
import pandas as pd
import gcsfs
fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
df = pd.read_csv(f)
黎明
Dask为分析提供高级并行性,为您喜爱的工具实现大规模性能".当您需要在 Python 中处理大量数据时,它非常棒.Dask 尝试模仿 pandas
API 的大部分内容,使其易于新手使用.
dask
Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas
API, making it easy to use for newcomers.
这里是 read_csv
使用方法:
import dask.dataframe as dd
df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!
# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()
相关文章