从选项(来自 rds - mysql)创建动态框架，提供带有 where 子句的自定义查询

2021-09-24 00:00:00 dataframe aws-glue where-clause mysql amazon-aurora

我想在我的 Glue 作业中从 Aurora-rds mysql 表创建一个 DynamicFrame.我可以使用自定义查询从我的 rds 表创建 DynamicFrame - 有 where 子句吗?我不想每次都在我的 DynamicFrame 中读取整个表格，然后再进行过滤.看了这个网站，但没有在这里或其他地方找到任何选项，https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html

I want to create a DynamicFrame in my Glue job from an Aurora-rds mysql table. Can I create DynamicFrame from my rds table using a custom query - having a where clause? I dont want to read the entire table every time in my DynamicFrame and then filter later. Looked at this website but didnt find any option here or elsewhere, https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html

构造 JDBC 连接选项

connection_mysql5_options = {"url": "jdbc:mysql://:3306/db","dbtable": "测试",用户":管理员"，"密码": "密码"}

connection_mysql5_options = { "url": "jdbc:mysql://:3306/db", "dbtable": "test", "user": "admin", "password": "pwd"}

从 MySQL 5 读取 DynamicFrame

df_mysql5 =glueContext.create_dynamic_frame.from_options(connection_type="mysql",connection_options=connection_mysql5_options)

df_mysql5 = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql5_options)

有什么方法可以给出一个 where 子句并说只从测试表中选择前 100 行，说它有一个名为id"的列，我想使用这个查询来获取:

Is there any way to give a where clause and say select only top 100 rows from test table, say it has a column named "id" and I want to fetch using this query:

select * from test where id<100;

select * from test where id<100;

感谢任何帮助.谢谢！

推荐答案

抱歉，我本来可以发表评论的，但我没有足够的声誉.我能够在 AWS Glue 中使用 Guillermo AMS 提供的解决方案，但它确实需要两个更改:

Apologies, I would have made a comment but I do not have sufficient reputation. I was able to make the solution that Guillermo AMS provided work within AWS Glue, but it did require two changes:

jdbc"格式无法识别(提供的错误是:py4j.protocol.Py4JJavaError:调用 o79.load 时发生错误.:java.lang.ClassNotFoundException:无法找到数据源:jbdc.请在 http://spark.apache.org/third-party- 找到软件包project.html") -- 我必须使用全名:org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"
查询选项对我不起作用(提供的错误是:"py4j.protocol.Py4JJavaError:调用 o72.load 时发生错误.: java.sql.SQLSyntaxErrorException: ORA-00911: invalid character")，但幸运的是，"dbtable"选项支持传入表或子查询 - 即在查询周围使用括号.

The "jdbc" format was unrecognized (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.ClassNotFoundException: Failed to find data source: jbdc. Please find packages at http://spark.apache.org/third-party-projects.html") -- I had to use the full name: "org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"

The query option was not working for me (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.load. : java.sql.SQLSyntaxErrorException: ORA-00911: invalid character"), but fortunately, the "dbtable" option supports passing in either a table or a subquery -- that is using parentheses around a query.

在我下面的解决方案中，我还围绕所需的对象和导入添加了一些上下文.
我的解决方案最终看起来像:

In my solution below I have also added a bit of context around the needed objects and imports.
My solution ended up looking like:

from awsglue.context import GlueContext from pyspark.context import SparkContext glue_context = GlueContext(SparkContext.getOrCreate()) tmp_data_frame = glue_context.spark_session.read\ .format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider")\ .option("url", jdbc_url)\ .option("user", username)\ .option("password", password)\ .option("dbtable", "(select * from test where id<100)")\ .load()

相关文章