SQL中二进制字符串的汉明距离
我的数据库中有一个表,我将 SHA256 哈希值存储在 BINARY(32) 列中.我正在寻找一种方法来计算列中条目与提供的值的汉明距离,即:
SELECT * FROM 表ORDER BY HAMMINGDISTANCE(hash, UNHEX(<在此处插入提供的 sha256 hash>)) ASC限制 10
(如果您想知道,字符串 A 和 B 的汉明距离定义为 BIT_COUNT(A^B)
,其中 ^ 是按位异或运算符,BIT_COUNT 返回 1 的数量在二进制字符串中).
现在,我知道 ^ 运算符和 BIT_COUNT 函数都只适用于整数,所以我想说可能唯一的方法是分解子字符串中的二进制字符串,将每个二进制子字符串转换为整数,逐串计算汉明距离,然后将它们相加.问题在于它听起来非常复杂,效率低下,绝对不优雅.因此,我的问题是:你能提出更好的方法吗?(请注意,我在共享主机上,因此我无法修改数据库服务器或加载库)
edit(1):显然可以在 PHP 中加载整个表并在那里进行计算,但我宁愿避免它,因为该表可能会变得非常大.
edit(2): 数据库服务器是 MySQL 5.1
edit(3):我在下面的回答中包含我刚刚在上面描述的代码.
edit(4):我刚刚发现使用 4 个 BIGINT 来存储散列而不是 BINARY(32) 会产生巨大的速度改进(快 100 倍以上).请参阅下面对我的回答的评论.
解决方案将数据存储在 BINARY
列中似乎是一种性能不佳的方法.获得良好性能的唯一快速方法是将 BINARY
列的内容拆分为多个 BIGINT
列,每个列包含原始数据的 8 字节子字符串.>
在我的例子中(32 字节),这意味着使用 4 个 BIGINT
列并使用这个函数:
创建函数汉明距离(A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT,B0 BIGINT、B1 BIGINT、B2 BIGINT、B3 BIGINT)返回 INT 确定性返回BIT_COUNT(A0 ^ B0) +BIT_COUNT(A1 ^ B1) +BIT_COUNT(A2 ^ B2) +BIT_COUNT(A3 ^ B3);
在我的测试中,使用这种方法比使用 BINARY
方法快 100 多倍.
FWIW,这是我在解释问题时暗示的代码.欢迎使用更好的方法来完成同样的事情(我特别不喜欢二进制 > 十六进制 > 十进制转换):
CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))返回 INT 确定性返回BIT_COUNT(转换(十六进制(子字符串(A,1,8)),16,10)^Conv(HEX(SUBSTRING(B, 1, 8)), 16, 10)) +BIT_COUNT(转换(十六进制(子字符串(A,9,8)),16,10)^Conv(HEX(SUBSTRING(B, 9, 8)), 16, 10)) +BIT_COUNT(转换(十六进制(子字符串(A,17,8)),16,10)^Conv(HEX(SUBSTRING(B, 17, 8)), 16, 10)) +BIT_COUNT(转换(十六进制(子字符串(A,25,8)),16,10)^Conv(HEX(SUBSTRING(B, 25, 8)), 16, 10));
I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I'm looking for a way to compute the Hamming distance of the entries in the column to a supplied value, i.e. something like:
SELECT * FROM table
ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC
LIMIT 10
(in case you're wondering, the Hamming distance of strings A and B is defined as BIT_COUNT(A^B)
, where ^ is the bitwise XOR operator and BIT_COUNT returns the number of 1s in the binary string).
Now, I know that both the ^ operator and BIT_COUNT function only work on INTEGERs and so I'd say that probably the only way to do it would be to break up the binary strings in substrings, cast each binary substring to integer, compute the Hamming distance substring-wise and then add them. The problem with this is that it sounds terribly complicated, not efficient and definitely not elegant. My question therefore is: could you suggest any better way? (please note that I'm on shared hosting and therefore I can't modify the DB server or load libraries)
edit(1): Obviously loading the whole table in PHP and doing the computations there would be possible but I'd rather avoid it because this table will probably grow quite large.
edit(2): The DB server is MySQL 5.1
edit(3): My answer below contains the code that I just described above.
edit(4): I just found out that using 4 BIGINTs to store the hash instead of a BINARY(32) yields massive speed improvements (more than 100 times faster). See the comments to my answer below.
解决方案It appears that storing the data in a BINARY
column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY
column in multiple BIGINT
columns, each containing an 8-byte substring of the original data.
In my case (32 bytes) this would mean using 4 BIGINT
columns and using this function:
CREATE FUNCTION HAMMINGDISTANCE(
A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT,
B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN
BIT_COUNT(A0 ^ B0) +
BIT_COUNT(A1 ^ B1) +
BIT_COUNT(A2 ^ B2) +
BIT_COUNT(A3 ^ B3);
Using this approach, in my testing, is over 100 times faster than using the BINARY
approach.
FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):
CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 1, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 1, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 9, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 9, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
);
相关文章