如何合成连接表的属性

2021-09-10 00:00:00 sql tsql sql-server

我有一个这样定义的视图:

I have a view defined like this:

CREATE VIEW [dbo].[PossiblyMatchingContracts] AS
SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts
FROM  [dbo].AllContracts AS C
    INNER JOIN [dbo].AllContracts AS CC
        ON C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
            OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
            OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
            OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
            OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE C.UniqueID NOT IN
    (
        SELECT UniqueID FROM [dbo].DefinitiveMatches
    )
    AND C.AssociatedUser IS NULL
    AND C.UniqueID <> CC.UniqueID

这基本上是寻找合同,其中 f.e.名字和生日匹配.这很好用.现在我想为每一行添加一个综合属性,其中的值仅来自一个源行.

Which is basically finding contracts where f.e. the first name and the birthday are matching. This works great. Now I want to add a synthetic attribute to each row with the value from only one source row.

让我举个例子让你更清楚.假设我有下表:

Let me give you an example to make it clearer. Suppose I have the following table:

UniqueID  | FirstName | LastName  | Birthday

1         | Peter     | Smith     | 1980-11-04
2         | Peter     | Gray      | 1980-11-04
3         | Peter     | Gray-Smith| 1980-11-04
4         | Frank     | May       | 1985-06-09
5         | Frank-Paul| May       | 1985-06-09
6         | Gina      | Ericson   | 1950-11-04

结果视图应如下所示:

UniqueID | PossiblyMatchingContracts | SyntheticID

1        | 2                         | PeterSmith1980-11-04
1        | 3                         | PeterSmith1980-11-04
2        | 1                         | PeterSmith1980-11-04
2        | 3                         | PeterSmith1980-11-04
3        | 1                         | PeterSmith1980-11-04
3        | 2                         | PeterSmith1980-11-04
4        | 5                         | FrankMay1985-06-09
5        | 4                         | FrankMay1985-06-09
6        | NULL                      | NULL [or] GinaEricson1950-11-04

请注意 SyntheticID 列仅使用匹配源行之一的值.哪一个都无所谓.我将此视图导出到另一个应用程序,之后需要能够识别每个匹配组".

Notice that the SyntheticID column uses ONLY values from one of the matching source rows. It doesn't matter which one. I am exporting this view to another application and need to be able to identify each "match group" afterwards.

明白我的意思了吗?任何想法如何在 sql 中完成?

Is it clear what I mean? Any ideas how this could be done in sql?

也许详细说明一下实际用例会有所帮助:

Maybe it helps to elaborate a bit on the actual use case:

我正在从不同系统导入合同.考虑到拼写错误或已婚但姓氏仅在一个系统中更新的人的可能性,我需要找到所谓的可能匹配".如果两个或多个合同包含相同的生日加上相同的名字、姓氏或出生姓名,则它们被视为可能匹配.这意味着,如果合约 A 匹配合约 B,则合约 B 也匹配合约 A.

I am importing contracts from different systems. To account for the possibility of typos or people that have married but the last name was only updated in one system, I need to find so called 'possible matches'. Two or more contracts are considered a possible match if they contain the same birthday plus the same first, last or birth name. That implies, that if contract A matches contract B, contract B also matches contract A.

目标系统使用多值引用属性来存储这些关系.最终目标是为这些合约创建用户对象.首先要注意的是,对于多个匹配的合约,它应该只是一个用户对象.因此,我在视图中创建这些匹配项.第二个问题是,用户对象的创建是通过工作流进行的,每个合约并行运行.为避免为匹配合约创建多个用户对象,每个工作流都需要检查是否已经存在匹配的用户对象或另一个工作流,即将创建该用户对象.由于工作流引擎与 sql 相比非常慢,因此工作流不应重复整个匹配测试.所以这个想法是,让工作流只检查syntheticID".

The target system uses multivalue reference attributes to store these relationships. The ultimate goal is to create user objects for these contracts. The catch first is, that the shall only be one user object for multiple matching contracts. Thus I'm creating these matches in the view. The second catch is, that the creation of user objects happens by workflows, which run parallel for each contract. To avoid creating multiple user objects for matching contracts, each workflow needs to check, if there is already a matching user object or another workflow, which is about to create said user object. Because the workflow engine is extremely slow compared to sql, the workflows should not repeat the whole matching test. So the idea is, to let the workflow check only for the 'syntheticID'.

推荐答案

我已经用多步方法解决了:

I have solved it with a multi step approach:

  1. 为每个合同创建可能的第一级匹配列表
  2. 创建基本组列表,为其分配不同的组每个合同(好像它们与任何人都没有关系)
  3. 当需要更多合约时,迭代匹配列表更新组列表加入群组
  4. 从最终组列表中递归构建 SyntheticID
  5. 输出结果

首先,让我解释一下我的理解,以便您判断我的方法是否正确.

First of all, let me explain what I have understood, so you can tell if my approach is correct or not.

1) 匹配在级联"中传播

1) matching propagates in "cascade"

我的意思是,如果Peter Smith"与Peter Gray"归为一组,则意味着所有 Smith 和所有 Gray 都是相关的(如果他们的出生日期相同),因此 Luke Smith 可以属于 John 的同一组灰色

I mean, if "Peter Smith" is grouped up with "Peter Gray", it means that all Smith and all Gray are related (if they have the same birth date) so Luke Smith can be in the same group of John Gray

2) 我不明白你说的出生名"是什么意思

2) I have not understood what you mean with "Birth Name"

你说合同匹配名字,姓氏或出生名",对不起,我是意大利人,我认为出生名和名字是一样的,在你的数据中也没有这样的列.也许它与名称之间的破折号有关?
当 FirstName 是 Frank-Paul 时,这意味着它应该同时匹配 Frank 和 Paul?
当姓氏是 Gray-Smith 时,这意味着它应该同时匹配 Gray 和 Smith?

You say contracts matches on "first, last or birth name", sorry, I'm italian, I thought birth name and first were the same, also in your data there is not such column. Maybe it is related to that dash symbol between names?
When FirstName is Frank-Paul it means it should match both Frank and Paul?
When LastName is Gray-Smith it means it should match both Gray and Smith?

在下面的代码中,我简单地忽略了这个问题,但如果需要的话可以处理它(我已经尝试过,打破名称,取消旋转它们并视为双重匹配).

In following code I have simply ignored this problem, but it could be handled if needed (I already did a try, breaking names, unpivoting them and treating as double match).

步骤零:一些声明和准备基础数据

declare @cli as table (UniqueID int primary key, FirstName varchar(20), LastName varchar(20), Birthday varchar(20))
declare @comb as table (id1 int, id2 int, done bit)
declare @grp as table (ix int identity primary key, grp int, id int, unique (grp,ix))
declare @str_id as table (grp int primary key, SyntheticID varchar(1000))
declare @id1 as int, @g int

;with
t as (
    select *
    from (values
    (1         , 'Peter'     , 'Smith'     , '1980-11-04'),
    (2         , 'Peter'     , 'Gray'      , '1980-11-04'),
    (3         , 'Peter'     , 'Gray-Smith', '1980-11-04'),
    (4         , 'Frank'     , 'May'       , '1985-06-09'),
    (5         , 'Frank-Paul', 'May'       , '1985-06-09'),
    (6         , 'Gina'      , 'Ericson'   , '1950-11-04')
    ) x (UniqueID  , FirstName , LastName  , Birthday)
)
insert into @cli
select * from t

第一步:为每个合同创建可能的第一级匹配列表

;with
p as(select UniqueID, Birthday, FirstName, LastName from @cli),
m as (
    select p.UniqueID UniqueID1, p.FirstName FirstName1, p.LastName LastName1, p.Birthday Birthday1, pp.UniqueID UniqueID2, pp.FirstName FirstName2, pp.LastName LastName2, pp.Birthday Birthday2
    from p
    join p pp on (pp.Birthday=p.Birthday) and (pp.FirstName = p.FirstName or pp.LastName = p.LastName)
    where p.UniqueID<=pp.UniqueID
)
insert into @comb
select UniqueID1,UniqueID2,0
from m

第二步:创建基本组列表

insert into @grp
select ROW_NUMBER() over(order by id1), id1 from @comb where id1=id2

第三步:迭代匹配列表更新组列表仅在需要时才循环具有可能匹配和更新的合约

Step Three: Iterate the matches list updating the group list Only loop on contracts that have possible matches and updates only if needed

set @id1 = 0
while not(@id1 is null) begin
    set @id1 = (select top 1 id1 from @comb where id1<>id2 and done=0)

    if not(@id1 is null) begin

        set @g = (select grp from @grp where id=@id1)
        update g set grp= @g
        from @grp g
        inner join @comb c on g.id = c.id2
        where c.id2<>@id1 and c.id1=@id1
        and grp<>@g

        update @comb set done=1 where id1=@id1
    end
end

第四步:建立 SyntheticID递归地将组的所有(不同的)名字和姓氏添加到 SyntheticID.
我使用_"作为出生日期、名字和姓氏的分隔符,并使用,"作为姓名列表的分隔符以避免冲突.

Step Four: Build up the SyntheticID Recursively add ALL (distinct) first and last names of group to SyntheticID.
I used '_' as separator for birth date, first names and last names, and ',' as separator for the list of names to avoid conflicts.

;with
c as(
    select c.*, g.grp
    from @cli c
    join @grp g on g.id = c.UniqueID
),
d as (
    select *, row_number() over (partition by g order by t,s) n1, row_number() over (partition by g order by t desc,s desc) n2
    from (
        select distinct c.grp g, 1 t, FirstName s from c
        union 
        select distinct c.grp, 2, LastName from c 
        ) l
),
r as (
    select d.*, cast(CONVERT(VARCHAR(10), t.Birthday, 112) + '_' + s as varchar(1000)) Names, cast(0 as bigint) i1, cast(0 as bigint) i2
    from d
    join @cli t on t.UniqueID=d.g
    where n1=1
    union all
    select d.*, cast(r.names + IIF(r.t<>d.t,'_',',') +  d.s as varchar(1000)), r.n1, r.n2
    from d
    join r on r.g = d.g and r.n1=d.n1-1 
)
insert into @str_id 
select g, Names
from r
where n2=1

第五步:输出结果

select c.UniqueID, case when id2=UniqueID then id1 else id2 end PossibleMatchingContract, s.SyntheticID
from @cli c
left join @comb cb on c.UniqueID in(id1,id2) and id1<>id2
left join @grp g on c.UniqueID = g.id
left join @str_id s on s.grp = g.grp

结果如下

UniqueID    PossibleMatchingContract    SyntheticID
1           2                           1980-11-04_Peter_Gray,Gray-Smith,Smith
1           3                           1980-11-04_Peter_Gray,Gray-Smith,Smith
2           1                           1980-11-04_Peter_Gray,Gray-Smith,Smith
2           3                           1980-11-04_Peter_Gray,Gray-Smith,Smith
3           1                           1980-11-04_Peter_Gray,Gray-Smith,Smith
3           2                           1980-11-04_Peter_Gray,Gray-Smith,Smith
4           5                           1985-06-09_Frank,Frank-Paul_May
5           4                           1985-06-09_Frank,Frank-Paul_May
6           NULL                        1950-11-04_Gina_Ericson

我认为这样产生的 SyntheticID 也应该是每个组的唯一"

I think that in this way the resulting SyntheticID should also be "unique" for each group

相关文章