SQL中两个字符串的公共子串

2021-09-10 00:00:00 string tsql sql-server

我需要在 SQL 中找到两个字符串的公共子字符串(没有空格).

查询:

选择 *从 tbl 为 a，tbl 为 b其中 a.str <>b.str

示例数据:

str1 |str2 |不带空格的最大子串----------+-----------+---------------------------——aabcdfbas |里克迪瓦 |发展基金aaab akuc |aaabir |aaabab akuc |ab atr |AB

解决方案

我不同意那些说 SQL 不是这项工作的工具的人.在有人向我展示比我的任何编程语言的解决方案更快的方法之前，我会断言 SQL(以基于集合的方式编写，无副作用，仅使用不可变变量)是唯一em> 用于此工作的工具(处理 varchar(8000)- 或 nvarchar(4000) 时).下面的解决方案是针对 varchar(8000).

1.正确索引的计数(数字)表.

-- (1) 构建并填充一个持久化的(数字)计数如果 OBJECT_ID('dbo.tally') 不是 NULL DROP TABLE dbo.tally;CREATE TABLE dbo.tally (n int not null);WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))) t(N))插入数据库SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;-- (2) 为性能添加必需的约束(和索引)ALTER TABLE dbo.tally添加约束 pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;ALTER TABLE dbo.tally添加约束 uq_tally UNIQUE NONCLUSTERED(N);

请注意，计数表功能的性能将不佳.

2.使用我们的计数表将所有可能的子串返回一个字符串

以abcd"为例，让我们获取它的所有子字符串.注意我的评论.

声明@s1 varchar(8000) = 'abcd';选择位置 = t.N,tokenSize = x.N,string = substring(@s1, t.N, x.N)FROM dbo.tally t -- 令牌位置CROSS JOIN dbo.tally x -- 令牌长度WHERE t.N <= len(@s1) -- 所有位置AND x.N <= len(@s1) -- 所有长度AND len(@s1) - t.N - (x.N-1) >= 0 -- 过滤不必要的行 [e.g.substring('abcd',3,2)]

position tokenSize 字符串----------- ----------- -------1 1 一个2 1 乙3 1 c4 1 天1 2 抗体2 2 公元前3 2 cd1 3 美国广播公司2 3 bcd1 4 abcd

3.dbo.getshortstring8K

这个功能是什么?第一个重大优化.我们将把两个字符串中较短的字符串分成每个可能的子字符串，然后查看它是否存在于较长的字符串中.如果您有两个字符串(S1 和 S2)并且 S1 比 S2 长，我们知道 S1 的任何一个比 S2 长的子字符串都不是 S2 的子字符串.这就是 dbo.getshortstring 的目的:确保我们不执行任何不必要的子字符串比较.这会更有意义.

这非常重要，因为可以使用

...没有排序或不必要的操作.速度而已.对于更长的字符串(例如 50 个字符以上)，我有一个更快的技术，您可以阅读关于这里.

I need to find common substring (without space) of two strings in SQL.

Query:

select *
from tbl as a, tbl as b
where a.str <> b.str

Sample data:

str1      | str2      | max substring without spaces
----------+-----------+-----------------------------
aabcdfbas | rikcdfva  | cdf
aaab akuc | aaabir a  | aaab
ab akuc   | ab atr    | ab

解决方案

I disagree with those who say that SQL is not the tool for this job. Until someone can show me a faster way than my solution in ANY programming language, I will aver that SQL (written in a set-based, side-effect free, using only immutable variables) is the ONLY tool for this job (when dealing with varchar(8000)- or nvarchar(4000)). The solution below is for varchar(8000).

1. A correctly indexed tally (numbers) table.

-- (1) build and populate a persisted (numbers) tally 
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
CREATE TABLE dbo.tally (n int not null);

WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(N))
INSERT dbo.tally
SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;

-- (2) Add Required constraints (and indexes) for performance
ALTER TABLE dbo.tally 
ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;

ALTER TABLE dbo.tally 
ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);

Note that a tally table function will not perform as well.

2. Using our tally table to return all possible substrings a string

Using "abcd" as an example, let's get all of it's substrings. Note my comments.

DECLARE @s1 varchar(8000) = 'abcd';

SELECT
  position  = t.N,
  tokenSize = x.N,
  string    = substring(@s1, t.N, x.N)  
FROM       dbo.tally t -- token position
CROSS JOIN dbo.tally x -- token length
WHERE t.N <=  len(@s1) -- all positions
AND   x.N <=  len(@s1) -- all lengths
AND   len(@s1) - t.N - (x.N-1) >= 0 -- filter unessesary rows [e.g.substring('abcd',3,2)]

This returns

position    tokenSize   string
----------- ----------- -------
1           1           a
2           1           b
3           1           c
4           1           d
1           2           ab
2           2           bc
3           2           cd
1           3           abc
2           3           bcd
1           4           abcd

3. dbo.getshortstring8K

What's this function about? The first major optimization. We're going to break the shorter of the two strings into every possible substring then see if it exists in the longer string. If you have two strings (S1 and S2) and S1 is longer than S2, we know that none of the substrings of S1, that are longer than S2, will be a substring of S2. That's the purpose of dbo.getshortstring: to ensure that we don't perform any unnecessary substring comparisons. This will make more sense in a moment.

This is hugely important because, the number of substrings in a string can be calculated using a Triangle Number Function. With N as the length (number of characters) in a string, the number of substrings can be calculated as N*(N+1)/2. E.g. "abc" has 6 substrings: 3*(3+1)/2 = 6; a,b,c,ab,bc,abc. If we're comparing "abc" to "abcdefgh" we don't need to check if "abcd" is a substring of "abc".

Breaking "abcdefgh" (length=8) into all possible substrings requires 8*(8+1)/2 = 36 operations (vs 6 for "abc").

IF OBJECT_ID('dbo.getshortstring8k') IS NOT NULL DROP FUNCTION dbo.getshortstring8k;
GO
CREATE FUNCTION dbo.getshortstring8k(@s1 varchar(8000), @s2 varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN 
SELECT s1 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s1 ELSE @s2 END,
       s2 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s2 ELSE @s1 END;

4. Finding all subsrings of the shorter string that exist in the longer string:

DECLARE @s1 varchar(8000) = 'bcdabc', @s2 varchar(8000) = 'abcd';

SELECT
  s.s1, -- test to make sure s.s1 is the shorter of the two strings
  position  = t.N,
  tokenSize = x.N,
  string    = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(@s1, @s2) s --<< get the shorter string
CROSS JOIN dbo.tally t  
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND   x.N between 1 and len(s.s1)
AND   len(s.s1) - t.N - (x.N-1) >= 0
AND   charindex(substring(s.s1, t.N, x.N), s.s2) > 0;

5. Retrieving ONLY the longest common substring(s)

This is the easy part. We simply Add TOP (1) WITH TIES to our SELECT statement and we're all set. Here, the longest common substring is "bc" and "xx"

DECLARE @s1 varchar(8000) = 'xxabcxx', @s2 varchar(8000) = 'bcdxx';

SELECT TOP (1) WITH TIES 
  position  = t.N,
  tokenSize = x.N,
  string    = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(@s1, @s2) s
CROSS JOIN dbo.tally t  
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND   x.N between 1 and len(s.s1)
AND   len(s.s1) - t.N - (x.N-1) >= 0
AND   charindex(substring(s.s1, t.N, x.N), s.s2) > 0
ORDER BY x.N DESC;

6. Applying this logic to your table

Using APPLY we replace my variables @s1 and @s2 with the t.str1 & t.str2. I add a filter to exclude matches that contain spaces (see my comments)... And we're off:

-- easily consumbable sample data
DECLARE @yourtable TABLE (str1 varchar(8000), str2 varchar(8000));
INSERT @yourtable 
VALUES ('aabcdfbas','rikcdfva'),('aaab akuc','aaabir a'),('ab akuc','ab atr');

SELECT str1, str2,  [max substring without spaces] = string
FROM @yourtable t
CROSS APPLY 
(
    SELECT TOP (1) WITH TIES 
      position  = t.N,
      tokenSize = x.N,
      string    = substring(s.s1, t.N, x.N)
    FROM dbo.getshortstring8k(t.str1, t.str2) s -- @s1 & @s2 replaced with str1 & str2 
    CROSS JOIN dbo.tally t  
    CROSS JOIN dbo.tally x
    WHERE t.N between 1 and len(s.s1)
    AND   x.N between 1 and len(s.s1)
    AND   len(s.s1) - t.N - (x.N-1) >= 0
    AND   charindex(substring(s.s1, t.N, x.N), s.s2) > 0
    AND   charindex(' ',substring(s.s1, t.N, x.N)) = 0 -- exclude substrings with spaces
    ORDER BY x.N DESC
) lcss;

Results:

str1        str2      max substring without spaces
----------- --------- ------------------------------
aabcdfbas   rikcdfva  cdf
aaab akuc   aaabir a  aaab
ab akuc     ab atr    ab

And the execution plan:

... No sorts or unnecessary operations. Just speed. For longer strings (e.g. 50 characters+) I have an even faster technique you can read about here.

相关文章