从逗号或管道运算符字符串中删除重复项

2021-09-10 00:00:00 tsql sql-server-2008 sql-server

我已经研究了一段时间,但找不到从 SQL Server 中的逗号分隔字符串和管道分隔字符串中删除重复字符串的方法.

I have been looking into this for a while now and I cannot find a way to remove duplicate strings from a comma-separated as well as pipeline seperated string in SQL Server.

给定字符串

test1,test2,test1|test2,test3|test4,test4|test4

有谁知道你会如何返回 test1,test2,test3,test4?

does anyone know how would you return test1,test2,test3,test4?

推荐答案

方法

以下方法可用于对分隔的值列表进行重复数据删除.

The following approach can be used to de-duplicate a delimited list of values.

  1. 使用REPLACE() 函数将不同的分隔符转换为相同的分隔符.
  2. 使用 REPLACE() 函数注入 XML 结束和开始标记以创建 XML 片段
  3. 使用CAST(expr AS XML)函数将上述片段转换为XML数据类型
  4. 使用 OUTER APPLY 应用表值函数 nodes() 将 XML 片段拆分为其组成的 XML 标记.这将在单独的行中返回每个 XML 标记.
  5. 使用 value() 函数仅从 XML 标记中提取值,并使用指定的数据类型返回值.
  6. 在上述值后附加一个逗号.
  7. 请注意,这些值在不同的行中返回.DISTINCT 关键字的使用现在可以删除重复的行(即值).
  8. 使用 FOR XML PATH('') 子句将多行中的值连接成一行.
  1. Use the REPLACE() function to convert different delimiters into the same delimiter.
  2. Use the REPLACE() function to inject XML closing and opening tags to create an XML fragment
  3. Use the CAST(expr AS XML) function to convert the above fragment into the XML data type
  4. Use OUTER APPLY to apply the table-valued function nodes() to split the XML fragment into its constituent XML tags. This returns each XML tag on a separate row.
  5. Extract just the value from the XML tag using the value() function, and returns the value using the specified data type.
  6. Append a comma after the above-mentioned value.
  7. Note that these values are returned on separate rows. The usage of the DISTINCT keyword now removes duplicate rows (i.e. values).
  8. Use the FOR XML PATH('') clause to concatenate the values across multiple rows into a single row.

查询

将上述方法放在查询表单中:

Putting the above approach in query form:

SELECT DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)') + ',' 
FROM ( 
        -- This query returns the following in theDataXml column: 
        -- <tag>test1</tag><tag>test2</tag><tag>test1</tag><tag>test2</tag><tag>test3</tag><tag>test4</tag><tag>test4</tag><tag>test4</tag>
        -- i.e. it has turned the original delimited data into an XML fragment 
        SELECT 
          DataTable.DataColumn AS DataRaw 
        , CAST( 
            '<tag>' 
            -- First replace commas with pipes to have only a single delimiter 
            -- Then replace the pipe delimiters with a closing and opening tag 
            + replace(replace(DataTable.DataColumn, ',','|'), '|','</tag><tag>') 
            -- Add a final set of closing tags 
            + '</tag>' 
            AS XML) AS DataXml 
        FROM ( SELECT 'test1,test2,test1|test2,test3|test4,test4|test4' AS DataColumn) AS DataTable 
    ) AS x 
OUTER APPLY DataXml.nodes('tag') AS PivotedTable(PivotedColumn) 
-- Running the query without the following line will return the data in separate rows 
-- Running the query with the following line returns the rows concatenated, i.e. it returns: 
-- test1,test2,test3,test4, 
FOR XML PATH('') 

输入&结果

给定输入:

test1,test2,test1|test2,test3|test4,test4|test4

test1,test2,test1|test2,test3|test4,test4|test4

上面的查询会返回结果:

The above query will return the result:

测试1,测试2,测试3,测试4,

test1,test2,test3,test4,

注意末尾的尾随逗号.我将把它作为练习留给你来删除它.

Notice the trailing comma at the end. I'll leave it as an exercise to you to remove that.

重复次数

OP 在评论中请求我如何获得重复的计数?在单独的列中".

OP requested in a comment "how do i get t5he count of duplicates as well? in a seperate column".

最简单的方法是使用上述查询,但删除最后一行 FOR XML PATH('').然后,计算上述查询中 SELECT 表达式返回的所有值和不同值(即 PivotedTable.PivotedColumn.value('.','nvarchar(max)')).所有值的计数与不同值的计数之间的差值是重复值的计数.

The simplest way would be to use the above query but remove the last line FOR XML PATH(''). Then, counting all values and distinct values returned by the SELECT expression in the above query (i.e. PivotedTable.PivotedColumn.value('.','nvarchar(max)')). The difference between the count of all values and the count of distinct values is the count of duplicate values.

SELECT 
    COUNT(PivotedTable.PivotedColumn.value('.','nvarchar(max)'))            AS CountOfAllValues 
  , COUNT(DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)'))   AS CountOfUniqueValues 
    -- The difference of the previous two counts is the number of duplicate values 
  , COUNT(PivotedTable.PivotedColumn.value('.','nvarchar(max)')) 
    - COUNT(DISTINCT PivotedTable.PivotedColumn.value('.','nvarchar(max)')) AS CountOfDuplicateValues 
FROM ( 
        -- This query returns the following in theDataXml column: 
        -- <tag>test1</tag><tag>test2</tag><tag>test1</tag><tag>test2</tag><tag>test3</tag><tag>test4</tag><tag>test4</tag><tag>test4</tag>
        -- i.e. it has turned the original delimited data into an XML fragment 
        SELECT 
          DataTable.DataColumn AS DataRaw 
        , CAST( 
            '<tag>' 
            -- First replace commas with pipes to have only a single delimiter 
            -- Then replace the pipe delimiters with a closing and opening tag 
            + replace(replace(DataTable.DataColumn, ',','|'), '|','</tag><tag>') 
            -- Add a final set of closing tags 
            + '</tag>' 
            AS XML) AS DataXml 
        FROM ( SELECT 'test1,test2,test1|test2,test3|test4,test4|test4' AS DataColumn) AS DataTable 
    ) AS x 
OUTER APPLY DataXml.nodes('tag') AS PivotedTable(PivotedColumn) 

对于上面显示的相同输入,此查询的输出是:

For the same input shown above, the output of this query is:

CountOfAllValues CountOfUniqueValues CountOfDuplicateValues
---------------- ------------------- ----------------------
8                4                   4

相关文章