如何拆分平面文件数据并加载到数据库中的父子表中?

2021-12-25 00:00:00 import etl sql-server-2008 sql-server ssis

我有需要导入父子表的非规范化数据(来自文件).源数据是这样的:

I have denormalized data (coming from a file) that needs to be imported into parent-child tables. The source data is something like this:

Account#    Name        Membership    Email
101         J Burns     Gold          alpha@foo.com
101         J Burns     Gold          bravo@foo.com
101         J Burns     Gold          charlie@yay.com
227         H Gordon    Silver        red@color.com
350         B Clyde     Silver        italian@food.com
350         B Clyde     Silver        mexican@food.com

我应该使用 SSIS 的哪些部分、部分或策略将前三列读入父表,将第四列(电子邮件)读入子表?我有几个可以选择的父密钥选项:

What are the pieces, parts, or tactics of SSIS I should use to read the first three columns into a parent table, and the 4th column (Email) into a child table? I have several options for the parent key which I am permitted to take:

  • 直接使用 Account# 作为主键
  • 在导入过程中使用 SSIS 生成的代理密钥
  • 配置身份主键

我确定我已经按照难度递增的顺序列出了我的主键选项.我很想知道如何做第一个和最后一个选项 - 我将推断如何实现中间选项.再次强调,我对一个明确的 SSIS 解决方案感兴趣;我正在寻找使用 SSIS 语言的答案,而不是程序性的、技术中立的答案.

I'm sure I've listed my primary key options in increasing order of difficulty. I'd be interested in knowing how to do the first and the last option - I'll infer how to achieve the middle option. To emphasize again, I'm interested in a decidedly SSIS solution; I'm looking for an answer that uses the language of SSIS, rather than a procedural, technology neutral answer.

我的问题有点类似于 另一个 SO 问题,答案含糊不清.希望能给予更详细的指导.我已经知道如何通过创建一个暂存"中间步骤来解决这个问题,其中父子分离实际上是用直接的 SQL 处理的.但是,我很好奇如何在没有这种中间步骤的情况下做到这一点.

My question is somewhat similar to another SO question, having an answer of vague viability. I'm hoping more detailed guidance could be given. I already know how to solve this problem by creating a "staging" middle-step, where the parent-child separation is actually handled with straight SQL. However, I'm curious about how this can be done without that kind of middle-step.

在我看来,这种导入会非常普遍,以至于会有一种很好发布的公式化方法来处理它——一种 SSIS 擅长的技术.到目前为止,我还没有看到任何直接的答案.

It seems to me this kind of import would be so common, that there would be a well-published formulaic way to handle it - a technique that SSIS excels at. As yet, I've not quite seen any straight up answer to this.

Update #1:根据评论,我已将样本数据调整为更明显的非规范化.我还从平面文件"中删除了平面",以便语义不会干扰问题.

Update #1: Based on comments, I've adjusted the sample data to be more obviously denormalized. I also removed "flat" from "flat file," so that semantics don't interfere with the question.

更新 #2:我对使用 SSIS 语言的解决方案产生了浓厚的兴趣.

Update #2: I've amplified my interest in a solution spoken in the language of SSIS.

推荐答案

这是您在加载父子数据时可以考虑的一种可能选项.此选项包括两个步骤.在第一步中,读取源文件并将数据写入父表.在第二步中,再次读取源文件并使用查找转换获取父信息,以便将数据写入子表.以下示例使用问题中提供的数据.此示例是使用 SSIS 2008 R2 和 SQL Server 2008 数据库创建的.

Here is one possible option that you can consider in loading parent-child data. This option consists of two steps. In the first step, read the source file and write data to parent table. In the second step, read the source file again and use lookup transformation to fetch the parent info in order to write data to the child table. Following example uses the data provided in the question. This example was created using SSIS 2008 R2 and SQL Server 2008 database.

分步流程:

  1. 创建一个名为 Source.txt 的示例平面文件,如屏幕截图 #1 所示.

  1. Create a sample flat file named Source.txt as shown in screenshot #1.

在 SQL 数据库中,使用 SQL 脚本 下给出的脚本创建名为 dbo.Parentdbo.Child 的两个表部分.这两个表都有一个自动生成的标识列.

In the SQL database, create two tables named dbo.Parent and dbo.Child using the scripts given under SQL Scripts section. Both the tables have an auto generated identity column.

在包上,放置一个OLE DB连接以连接到SQL Server和平面文件连接以读取源文件,如截图#<强>2.如屏幕截图 #3 - #9 中所示配置平面文件连接.

On the package, place an OLE DB connection to connect to the SQL Server and Flat File connection to read the source file as shown in screenshot #2. Configure the flat file connection as shown in screenshots #3 - #9.

在控制流"选项卡上,放置两个 数据流任务,如屏幕截图 #10 所示.

On the Control Flow tab, place two Data Flow Tasks as shown in screenshot #10.

在名为 Parent 的数据流任务中,放置平面文件源、排序转换和 OLE DB 目标,如屏幕截图 #11 所示.

Inside the data flow task named Parent, place a Flat File source, Sort transformation and an OLE DB destination as shown in screenshot #11.

配置平面文件源,如屏幕截图 #12 和 #13 所示.我们需要读取平面文件源.

Configure the flat file source as shown in screenshots #12 and #13. We need to read the flat file source.

配置排序转换,如屏幕截图 #14 所示.我们需要消除重复值,以便仅将唯一记录插入父表 dbo.Parent.

Configure the sort transformation as shown in screenshot #14. We need to eliminate the duplicate values so that only the unique records are inserted into the parent table dbo.Parent.

配置 ole db 目标,如屏幕截图 #15 和 #16 中所示.我们需要将数据插入父表dbo.Parent.

Configure the ole db destination as shown in screenshots #15 and #16. We need to insert the data into the parent table dbo.Parent.

在名为 Child 的数据流任务中,放置平面文件源、查找转换和 OLE DB 目标,如屏幕截图 #17 所示.

Inside the data flow task named Child, place a Flat File source, Lookup transformation and an OLE DB destination as shown in screenshot #17.

配置平面文件源,如屏幕截图 #12 和 #13 所示.此配置与之前数据流任务中的平面文件源相同.

Configure the flat file source as shown in screenshots #12 and #13. This configuration is same as the flat file source in the previous data flow task.

配置查找转换,如屏幕截图 #18 和 #20 中所示.我们需要使用文件中存在的其他键列从表 dbo.Parent 中找到父 ID.此处的关键列是帐户、姓名和电子邮件.如果文件碰巧有一个唯一的列,您可以单独使用该列来获取父 ID.

Configure the lookup transformation as shown in screenshots #18 and #20. We need to find the parent id from the table dbo.Parent using the other key columns present in the file. The key columns here are the Account, Name and Email. If the file happened to have a unique column, you could just use that column alone to fetch the parent id.

配置 ole db 目标,如屏幕截图 #21 和 #22 中所示.我们需要将电子邮件列与父 ID 一起插入表 dbo.Child.

Configure the ole db destination as shown in screenshots #21 and #22. We need to insert the Email column along with the Parent id into the table dbo.Child.

屏幕截图 #23 显示了包执行之前表中的数据.

Screenshot #23 shows data in the tables before the package execution.

屏幕截图 #24 和 #25 显示示例包执行.

Screenshots #24 and #25 show sample package execution.

屏幕截图 #26 显示包执行后表格中的数据.

Screenshot #26 shows data in the tables after the package execution.

希望有所帮助.

SQL 脚本:

CREATE TABLE [dbo].[Child](
    [ChildId] [int] IDENTITY(1,1) NOT NULL,
    [ParentId] [int] NULL,
    [Email] [varchar](21) NULL,
CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED ([ChildId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Parent](
    [ParentId] [int] IDENTITY(1,1) NOT NULL,
    [Account] [varchar](12) NULL,
    [Name] [varchar](12) NULL,
    [Membership] [varchar](14) NULL,
CONSTRAINT [PK_Parent] PRIMARY KEY CLUSTERED ([ParentId] ASC)) ON [PRIMARY]
GO

屏幕截图 #1:

屏幕截图 #2:

屏幕截图 #3:

屏幕截图 #4:

屏幕截图 #5:

截图 #6:

屏幕截图 #7:

截图 #8:

屏幕截图 #9:

屏幕截图 #10:

屏幕截图 #11:

屏幕截图 #12:

屏幕截图 #13:

屏幕截图 #14:

屏幕截图 #15:

屏幕截图 #16:

屏幕截图 #17:

屏幕截图 #18:

屏幕截图 #19:

屏幕截图 #20:

屏幕截图 #21:

屏幕截图 #22:

屏幕截图 #23:

屏幕截图 #24:

屏幕截图 #25:

屏幕截图 #26:

相关文章