如何使用 Python 打开和处理存储在 Google Cloud Storage 中的 CSV 文件
问题描述
我正在使用 Google 云存储客户端库.
I am using the Google Cloud Storage Client Library.
我正在尝试使用以下代码打开和处理一个 CSV 文件(已上传到存储桶):
I am trying to open and process a CSV file (that was already uploaded to a bucket) using code like:
filename = '/<my_bucket/data.csv'
with gcs.open(filename, 'r') as gcs_file:
csv_reader = csv.reader(gcs_file, delimiter=',', quotechar='"')
响应 csv.reader(即 gcs_file)的第一个参数时,我收到错误参数 1 必须是迭代器".显然 gcs_file 不支持迭代器 .next 方法.
I get the error "argument 1 must be an iterator" in response to the first argument to csv.reader (i.e. the gcs_file). Apparently the gcs_file doesn't support the iterator .next method.
关于如何进行的任何想法?我需要包装 gcs_file 并在其上创建一个迭代器还是有更简单的方法?
Any ideas on how to proceed? Do I need to wrap the gcs_file and create an iterator on it or is there an easier way?
解决方案
我认为最好有自己的为 csv.reader 设计的包装器/迭代器.如果 gcs_file 要支持 Iterator 协议,不清楚是什么next() 应该返回以始终适应其消费者.
I think it's better you have your own wrapper/iterator designed for csv.reader. If gcs_file was to support Iterator protocol, it is not clear what next() should return to always accommodate its consumer.
根据 csv reader doc,它
According to csv reader doc, it
返回一个读取器对象,它将遍历给定 csvfile 中的行.csvfile 可以是任何支持迭代器协议并在每次调用其 next() 方法时返回一个字符串的对象——文件对象和列表对象都适用.如果 csvfile 是文件对象,则必须在不同的平台上使用b"标志打开它.
Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable. If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
它需要来自底层文件的一大块原始字节,不一定是一行.你可以有这样的包装器(未测试):
It expects a chunk of raw bytes from the underlying file, not necessarily a line. You can have a wrapper like this (not tested):
class CsvIterator(object)
def __init__(self, gcs_file, chunk_size):
self.gcs_file = gcs_file
self.chunk_size = chunk_size
def __iter__(self):
return self
def next(self):
result = self.gcs_file.read(size=self.chunk_size)
if not result:
raise StopIteration()
return result
关键是一次读取一个块,这样当你有一个大文件时,你不会炸毁内存或遇到 urlfetch 超时.
The key is to read a chunk at a time so that when you have a large file, you don't blow up memory or experience timeout from urlfetch.
甚至更简单.要使用 iter 内置:
Or even simpler. To use iter built in:
csv.reader(iter(gcs_file.readline, ''))
相关文章