将数据从 oracle 移动到 HDFS,处理并从 HDFS 移动到 Teradata
我的要求是
<块引用>- 将数据从 Oracle 移动到 HDFS
- 处理 HDFS 上的数据
- 将处理后的数据移至 Teradata.
还需要每 15 分钟进行一次整个处理.源数据量可能接近50GB,处理后的数据也可能相同.
在网上搜索了很多,我发现
<块引用>- ORAOOP 将数据从 Oracle 移动到 HDFS(将代码与 shell 脚本一起并安排它以所需的时间间隔运行).
- 通过自定义 MapReduce、Hive 或 PIG 进行大规模处理.
- SQOOP - Teradata 连接器,用于将数据从 HDFS 移动到 Teradata(再次使用带有代码的 shell 脚本,然后对其进行调度).
这首先是正确的选择吗?这在所需的时间段内是否可行(请注意,这不是每日批次左右)?
我发现的其他选项如下
<块引用>- STORM(用于实时数据处理).但是我找不到开箱即用的 oracle Spout 或 Teradata bolt.
- 任何开源 ETL 工具,如 Talend 或 Pentaho.
请分享您对这些选项以及任何其他可能性的看法.
解决方案看起来你有几个问题,让我们试着分解一下.
在 HDFS 中导入
您似乎正在寻找 Sqoop.Sqoop 是一个工具,可以让您轻松地将数据传入/传出 HDFS,并且可以本地连接到包括 Oracle 在内的各种数据库.Sqoop 与 Oracle JDBC 瘦驱动程序兼容.以下是从 Oracle 转移到 HDFS 的方法:
sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir/path/to/dir
有关更多信息:此处和此处.请注意,您也可以使用 Sqoop 直接导入到 Hive 表中,这可以方便您进行分析.
处理
正如您所指出的,由于您的数据最初是关系数据,因此最好使用 Hive 进行分析,因为您可能更熟悉类似 SQL 的语法.Pig 是更纯粹的关系代数,其语法与 SQL 不同,更多的是偏好问题,但两种方法都应该可以正常工作.
由于您可以使用 Sqoop 直接将数据导入 Hive,因此您的数据在导入后应该可以直接进行处理.
在 Hive 中,您可以运行查询并告诉它在 HDFS 中写入结果:
hive -e "插入覆盖目录 '/path/to/output' select * from mytable ..."
导出到 TeraData
Cloudera 去年发布了适用于 Sqoop 的 Teradata 连接器,如这里,所以你应该看看这看起来正是你想要的.以下是您的操作方法:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir/path/to/hive/output
<小时>
在您想要的任何时间段内,整个事情绝对是可行的,最终重要的是您的集群的大小,如果您希望它快速,则根据需要扩展您的集群.Hive 和 Sqoop 的好处是处理将分布在您的集群中,因此您可以完全控制计划.
My requirement is to
- Move data from Oracle to HDFS
- Process the data on HDFS
- Move processed data to Teradata.
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
After searching a lot on the internet, i found that
- ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
- Do large scale processing either by Custom MapReduce or Hive or PIG.
- SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
Other options that i found are the following
- STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
- Any open source ETL tools like Talend or Pentaho.
Please share your thoughts on these options as well and any other possibilities.
解决方案Looks like you have several questions so let's try to break it down.
Importing in HDFS
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
Processing
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
In Hive you could run your query and tell it to write the results in HDFS:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Exporting into TeraData
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.
相关文章