Google DataFlow 无法在不同位置读写(Python SDK v0.5.5)

2022-01-08 00:00:00 python google-cloud-platform google-cloud-dataflow sdk

问题描述

我正在使用 Python SDK v0.5.5 编写一个非常基本的 DataFlow 管道.该管道使用带有传入查询的 BigQuerySource，该查询正在从位于欧盟的数据集中查询 BigQuery 表.

I'm writing a very basic DataFlow pipeline using the Python SDK v0.5.5. The pipeline uses a BigQuerySource with a query passed in, which is querying BigQuery tables from datasets that reside in EU.

执行管道时出现以下错误(项目名称匿名):

When executing the pipeline I'm getting the following error (project name anonymized):

HttpError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/XXXXX/queries/93bbbecbc470470cb1bbb9c22bd83e9d?alt=json&maxResults=10000>: response: <{'status': '400', 'content-length': '292', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Thu, 09 Feb 2017 10:28:04 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Thu, 09 Feb 2017 10:28:04 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"', 'content-type': 'application/json; charset=UTF-8'}>, content <{ "error": { "errors": [ { "domain": "global", "reason": "invalid", "message": "Cannot read and write in different locations: source: EU, destination: US" } ], "code": 400, "message": "Cannot read and write in different locations: source: EU, destination: US" } }

在指定项目、数据集和表名时也会出现该错误.但是，从可用的公共数据集(位于美国——如莎士比亚)中选择数据时没有错误.我也有运行 SDK 的 v0.4.4 的作业，但没有此错误.

The error also occurs when specifying a project, dataset and table name. However there's no error when selecting data from the public datasets available (which reside in US - like shakespeare). I also have jobs running v0.4.4 of the SDK which don't have this error.

这些版本之间的区别在于临时数据集的创建，如管道启动时的警告所示:

The difference between these versions is the creation of a temp dataset, as is shown by the warning at pipeline startup:

WARNING:root:Dataset does not exist so we will create it

我简要了解了 SDK 的不同版本，差异似乎在于这个临时数据集.看起来当前版本默认创建了一个临时数据集，其位置在美国(取自 master):

I've briefly taken a look at the different versions of the SDK and the difference seems to be around this temp dataset. It looks like the current version creates a temp dataset by default with a location in US (taken from master):

创建数据集
默认数据集位置

我还没有找到禁用创建这些临时数据集的方法.我是否忽略了某些东西，或者在从欧盟数据集中选择数据时这确实不再起作用?

I haven't found a way to disable the creation of these temp datasets. Am I overlooking something, or is this indeed not working anymore when selecting data from EU datasets?

解决方案

感谢您报告此问题.我假设您使用的是 DirectRunner.我们更改了 DirectRunner 的 BigQuery 读取转换的实现，以创建临时数据集(适用于 SDK 版本 0.5.1 及更高版本)以支持大型数据集.似乎我们在这里没有正确设置区域.我们会研究解决这个问题.

Thanks for reporting this issue. I assume you are using DirectRunner. We changed the implementation of BigQuery read transform for DirectRunner to create a temporary dataset (for SDK versions 0.5.1 and later) to support large datasets. Seems like we are not setting the region correctly here. We'll look into fixing this.

如果您使用在正确区域创建临时数据集的 DataflowRunner，则不会出现此问题.

This issue should not occur if you use DataflowRunner which creates temporary datasets in the correct region.

相关文章