计算 UTF8 字符串的 MD5 哈希值

2021-12-10 00:00:00 encoding hash tsql sql-server sql-server-2008-r2

我有一个 SQL 表，我在其中存储必须唯一的大字符串值.为了确保唯一性，我在一个列上有一个唯一索引，我在其中存储了大字符串的 MD5 哈希的字符串表示.

I have an SQL table in which I store large string values that must be unique. In order to ensure the uniqueness, I have a unique index on a column in which I store a string representation of the MD5 hash of the large string.

保存这些记录的 C# 应用程序使用以下方法进行散列:

The C# app that saves these records uses the following method to do the hashing:

public static string CreateMd5HashString(byte[] input) { var hashBytes = MD5.Create().ComputeHash(input); return string.Join("", hashBytes.Select(b => b.ToString("X"))); }

为了调用它，我首先使用UTF-8编码将string转换为byte[]:

In order to call this, I first convert the string to byte[] using the UTF-8 encoding:

// this is what I use in my app CreateMd5HashString(Encoding.UTF8.GetBytes("abc")) // result: 90150983CD24FB0D6963F7D28E17F72

现在我希望能够在 SQL 中实现这个散列函数，使用 HASHBYTES 函数，但我得到不同的值:

Now I would like to be able to implement this hashing function in SQL, using the HASHBYTES function, but I get a different value:

print hashbytes('md5', N'abc') -- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315

这是因为 SQL 计算字符串的 UTF-16 表示的 MD5.如果我执行 CreateMd5HashString(Encoding.Unicode.GetBytes("abc"))，我在 C# 中得到相同的结果.

This is because SQL computes the MD5 of the UTF-16 representation of the string. I get the same result in C# if I do CreateMd5HashString(Encoding.Unicode.GetBytes("abc")).

我无法更改应用程序中进行散列的方式.

I cannot change the way hashing is done in the application.

有没有办法让 SQL Server 计算字符串的 UTF-8 字节的 MD5 哈希值?

Is there a way to get SQL Server to compute the MD5 hash of the UTF-8 bytes of the string?

我查找了类似的问题，我尝试使用排序规则，但到目前为止还没有运气.

I looked up similar questions, I tried using collations, but had no luck so far.

推荐答案

您需要创建一个 UDF 来将 NVARCHAR 数据转换为 UTF-8 表示形式的字节.假设它被称为 dbo.NCharToUTF8Binary 那么你可以这样做:

You need to create a UDF to convert the NVARCHAR data to bytes in UTF-8 Representation. Say it is called dbo.NCharToUTF8Binary then you can do:

hashbytes('md5', dbo.NCharToUTF8Binary(N'abc', 1))

这是一个可以做到这一点的 UDF:

Here is a UDF which will do that:

create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit) returns varbinary(max) as begin -- Note: This is not the fastest possible routine. -- If you want a fast routine, use SQLCLR set @modified = isnull(@modified, 0) -- First shred into a table. declare @chars table ( ix int identity primary key, codepoint int, utf8 varbinary(6) ) declare @ix int set @ix = 0 while @ix < datalength(@txt)/2 -- trailing spaces begin set @ix = @ix + 1 insert @chars(codepoint) select unicode(substring(@txt, @ix, 1)) end -- Now look for surrogate pairs. -- If we find a pair (lead followed by trail) we will pair them -- High surrogate is uD800 to uDBFF -- Low surrogate is uDC00 to uDFFF -- Look for high surrogate followed by low surrogate and update the codepoint update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000 from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF -- Get rid of the trailing half of the pair where found delete c2 from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 where c1.codepoint >= 0x10000 -- Now we utf-8 encode each codepoint. -- Lone surrogate halves will still be here -- so they will be encoded as if they were not surrogate pairs. update c set utf8 = case -- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding) when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0) then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6)) -- Two-byte encodings when codepoint <= 0x07ff then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) -- Three-byte encodings when codepoint <= 0x0ffff then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) -- Four-byte encodings when codepoint <= 0x1FFFFF then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) end from @chars c -- Finally concatenate them all and return. declare @ret varbinary(max) set @ret = cast('' as varbinary(max)) select @ret = @ret + utf8 from @chars c order by ix return @ret end

相关文章