从 Google Cloud 存储读取 csv 到 pandas 数据框

2022-01-25 00:00:00 python pandas google-cloud-platform csv google-cloud-storage

问题描述

我正在尝试将 Google Cloud Storage 存储桶上的 csv 文件读取到 panda 数据帧中.

I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline from io import BytesIO from google.cloud import storage storage_client = storage.Client() bucket = storage_client.get_bucket('createbucket123') blob = bucket.blob('my.csv') path = "gs://createbucket123/my.csv" df = pd.read_csv(path)

它显示了这个错误信息:

It shows this error message:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

我做错了什么，我找不到任何不涉及 google datalab 的解决方案?

What am I doing wrong, I am not able to find any solution which does not involve google datalab?

解决方案

更新
从 pandas 0.24 版开始，read_csv 支持直接从 Google Cloud Storage 读取.只需像这样提供指向存储桶的链接:

UPDATE

As of version 0.24 of pandas, read_csv supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:

df = pd.read_csv('gs://bucket/your_path.csv')

read_csv 然后将使用 gcsfs 模块来读取 Dataframe，这意味着它必须被安装(否则你会得到一个指向缺少依赖项的异常).

The read_csv will then use gcsfs module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).

为了完整起见，我留下了其他三个选项.

I leave three other options for the sake of completeness.

自制代码
gcsfs
黎明

我将在下面介绍它们.

我编写了一些方便的函数来从 Google 存储中读取数据.为了使其更具可读性，我添加了类型注释.如果您碰巧在 Python 2 上，只需删除这些代码即可.

I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.

假设您已获得授权，它同样适用于公共和私人数据集.在这种方法中，您无需先将数据下载到本地驱动器.

It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.

使用方法:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path') df = pd.read_csv(fileobj)

代码:

from io import BytesIO, StringIO from google.cloud import storage from google.oauth2 import service_account def get_byte_fileobj(project: str, bucket: str, path: str, service_account_credentials_path: str = None) -> BytesIO: """ Retrieve data from a given blob on Google Storage and pass it as a file object. :param path: path within the bucket :param project: name of the project :param bucket_name: name of the bucket :param service_account_credentials_path: path to credentials. TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM') :return: file object (BytesIO) """ blob = _get_blob(bucket, path, project, service_account_credentials_path) byte_stream = BytesIO() blob.download_to_file(byte_stream) byte_stream.seek(0) return byte_stream def get_bytestring(project: str, bucket: str, path: str, service_account_credentials_path: str = None) -> bytes: """ Retrieve data from a given blob on Google Storage and pass it as a byte-string. :param path: path within the bucket :param project: name of the project :param bucket_name: name of the bucket :param service_account_credentials_path: path to credentials. TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM') :return: byte-string (needs to be decoded) """ blob = _get_blob(bucket, path, project, service_account_credentials_path) s = blob.download_as_string() return s def _get_blob(bucket_name, path, project, service_account_credentials_path): credentials = service_account.Credentials.from_service_account_file( service_account_credentials_path) if service_account_credentials_path else None storage_client = storage.Client(project=project, credentials=credentials) bucket = storage_client.get_bucket(bucket_name) blob = bucket.blob(path) return blob

gcsfs
gcsfs 是用于谷歌云存储的 Pythonic 文件系统".

gcsfs

gcsfs is a "Pythonic file-system for Google Cloud Storage".

使用方法:

import pandas as pd import gcsfs fs = gcsfs.GCSFileSystem(project='my-project') with fs.open('bucket/path.csv') as f: df = pd.read_csv(f)

黎明
Dask为分析提供高级并行性，为您喜爱的工具实现大规模性能".当您需要在 Python 中处理大量数据时，它非常棒.Dask 尝试模仿 pandas API 的大部分内容，使其易于新手使用.

dask

Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas API, making it easy to use for newcomers.

这里是 read_csv

使用方法:

import dask.dataframe as dd df = dd.read_csv('gs://bucket/data.csv') df2 = dd.read_csv('gs://bucket/path/*.csv') # nice! # df is now Dask dataframe, ready for distributed processing # If you want to have the pandas version, simply: df_pd = df.compute()

相关文章