使用 SQL 命令中的 Order by/sort 和 Merge Join 优化数百万行的 SSIS 包

2021-12-30 00:00:00 merge sorting azure-data-factory mysql ssis

当我执行 Upsert/delete 时,我目前正在尝试优化 SSIS 包,该操作从远程 MariaDB 源中的表中处理大约 93 多万行.该表还包含大约 63 列.

目前我在我的包中使用 Sort and Merge Join 但正如我阅读了一些指南,它建议在服务器中进行排序部分而不是使用 SSIS DT 中的排序功能,因为它会给SSIS 服务器内存.

由于我目前在 Azure 数据工厂中使用此解决方案,运行包失败(最常见的是超时,即使我在包端和 Azure 数据工厂中都增加了超时属性).

解决这个问题的推荐方法是什么?

如果我理解正确并且正如我之前提到的那样,我可以通过对 DB-Server-Side 进行排序来跳过 SISS 服务器上的负载.但是由于我对整个 SQL 和 SSIS 的内容都不熟悉,所以我不太确定 SQL 命令中的排序方式.

我也考虑过批处理,但即使在这里我也不确定它在 SSIS 中的工作方式.

这里有什么推荐?

我的 SSIS 包现在看起来像这样:

我遵循了这种类型的示例:

(仅供参考:出现红色错误图标是因为我在截图过程中失去了连接,否则这是一个完全有效的解决方案.)

解决方案

我有两个建议:

服务端排序

在 OLE DB Source 中,将访问模式更改为 SQL 命令.并使用 ORDER BY 子句:

Select * from table ORDER BY col1, col2

之后,您应该打开 OLE DB 源高级编辑器 (右键单击 OLE DB 源,显示高级编辑器) 转到列选项卡并更改输出IsSorted属性为 True 并为 ORDER BY 子句中使用的列设置更改 SortKeyPosition.

  • SSIS 排序数据流
  • IsSorted 属性在哪里?

分块读取数据

我不太了解 MariaDB SQL 语法,但我将提供一些 SQLite 和 Oracle 示例:

  • 在预执行时从 Sqlite 读取大量数据到 SQL Server 失败
  • 从 db2 获取前 n 到 n 行
  • SSIS 无法保存包并重新启动视觉工作室
<小时>

更新 1 - 包问题

打包存在一些问题:

  • 你在同一张桌子上读写
  • 您正在对大量数据执行更新和删除表
  • 您正在使用合并连接

一些建议:

  • 尝试使用临时表而不是从同一个表中读取和写入,因为您正在从同一个目标表中读取、写入、删除和更新.
  • 在目标表中使用分区,允许从特定分区而不是整个表中删除和更新记录

Hi i am currently trying to optimize an SSIS package when i do an Upsert/delete which handels about 93+ million rows from a table in a remote MariaDB source. The table also contains approximately 63 columns.

Currently i'm using Sort and Merge Join in my package but as i've read some guides its recommended to do the sorting part in the server and not with the sort functionality in SSIS DT, as it puts a load on the SSIS Server Memory.

And as i'm currently using this solution in Azure Data Factory running the package fails (most often Times out, even though i've increased the Time Out properties both in package side and in Azure Data factory).

What is the recommended way to tackle this?

If i've understood it right and as i mentioned it before i can skip the load on the SISS server by sorting DB-Server-Side. But as i'm new to the whole SQL and SSIS stuff i'm not quite sure how a sort like that would be in the SQL Command.

Also i've though about batching but even here i'm uncertain how that would work in SSIS.

What is recommended here?

My SSIS-Package looks like this right now:

I Followed this type of example: Synchronize Table Data Using a Merge Join in SSIS

(FYI: The red error icons are there because i lost connection during the screenshot, this is a fully working solution otherwise.)

解决方案

I have two recommendations:

Server side sorting

In OLE DB Source change the access mode to SQL Command. And use ORDER BY clause:

Select * from table ORDER BY col1, col2

After that you should open OLE DB Source advanced editor (Right click on the OLE DB source, show advanced editor) go to the columns tab and change the outputIsSorted property to True and set change the SortKeyPosition for the columns used in the ORDER BY clause.

  • SSIS sorted data flows
  • Where is the IsSorted property?

Read data in chunks

I don't have good knowledge in MariaDB SQL syntax but i will provide some example in SQLite and Oracle:

  • Reading Huge volume of data from Sqlite to SQL Server fails at pre-execute
  • Getting top n to n rows from db2
  • SSIS failing to save packages and reboots Visual Studio

Update 1 - Package problems

There are some problems in the package:

  • You are reading and writing from the same table
  • You are performing Update and delete tables on a large amount of data
  • You are using Merge Join

Some recommendations:

  • Try using a staging table instead of reading and writing from the same table since you are reading, writing, deleting, updating from the same destination table.
  • Use partitioning in the destination table which allows to delete and update records from a specific partition instead of the entire table

相关文章