打通谷歌Cloud Storage至bigquery的数据流

2022-04-02 00:00:00 创建 专区 服务 密钥 验证

背景

由于谷歌的bigquery不支持直接上传大于1G的数据,因此需要通过先上传到GCS(google cloud storage),再从GCS将数据load进bigquery。

年初谷歌的验证规则作了更新,基于googleAuthR的gar_auth()函数已经失效,再也不能通过点一下进行验证了,会报如下错误:

但是hadley大神开发的bigrquery包,却能够通过验证,

可能tidyverse API用的是Rstudio的官方邮箱,官方认证,所以能够用来验证。

但是,用bigrquery包有两个局限性:

  1. 不支持直接上传大数据集;
  2. 该验证只能使用bigquery服务。

下面介绍下GCS、bigquery通用的验证方法,

依赖

  1. bigquery、GCS服务已开通;
  2. BigQuery API、Google Cloud Storage JSON API已启用:
  1. 服务账号已创建,且为owner权限:

失败的尝试

  1. 使用gar_set_client()无法通过验证;
  1. 创建客户端id,并使用该凭据,无法通过验证。

成功的尝试

  1. 点击服务账号;
  1. 创建新密钥;
  1. 创建json私钥;

创建完后会自动下载新密钥的json文件。

  1. 使用私钥json文件完成GCS、bigquery验证。
  1. 官方例子测试*:

可自行测试文件 -> GCS -> bigquery的数据流,

## Not run: 

 library(googleCloudStorageR)
 library(bigQueryR)
 
 gcs_global_bucket("your-project")
 
 ## custom upload function to ignore quotes and column headers
 f <- function(input, output) {
   write.table(input, sep = ",", col.names = FALSE, row.names = FALSE, 
               quote = FALSE, file = output, qmethod = "double")}
   
 ## upload files to Google Cloud Storage
 gcs_upload(mtcars, name = "mtcars_test1.csv", object_function = f)
 gcs_upload(mtcars, name = "mtcars_test2.csv", object_function = f)
 
 ## create the schema of the files you just uploaded
 user_schema <- schema_fields(mtcars)
 
 ## load files from Google Cloud Storage into BigQuery
 bqr_upload_data(projectId = "your-project", 
                datasetId = "test", 
                tableId = "from_gcs_mtcars", 
                upload_data = c("gs://your-project/mtcars_test1.csv", 
                                "gs://your-project/mtcars_test2.csv"),
                schema = user_schema)
 
 ## for big files, its helpful to create your schema on a small sample
 ## a quick way to do this on the command line is:
 # "head bigfile.csv > head_bigfile.csv"

## upload nested lists as JSON
the_list <- list(list(col1 = "yes", col2 = "no", 
                      col3 = list(nest1 = 1, nest2 = 3), col4 = "oh"),
                 list(col1 = "yes2", 
                      col2 = "n2o", col3 = list(nest1 = 5, nest2 = 7), 
                      col4 = "oh2"), 
                 list(col1 = "yes3", col2 = "no3", 
                      col3 = list(nest1 = 7, nest2 = 55), col4 = "oh3"))
   
bqr_upload_data(datasetId = "test", 
                tableId = "nested_list_json", 
                upload_data = the_list, 
                autodetect = TRUE)


## End(Not run)


来源 https://zhuanlan.zhihu.com/p/157385761

相关文章