Microsoft SQL 2005 中的自然(人类字母数字)排序
我们有一个大型数据库,上面有 DB 端分页.这很快,在几分之一秒内从数百万条记录中返回 50 行的页面.
We have a large database on which we have DB side pagination. This is quick, returning a page of 50 rows from millions of records in a small fraction of a second.
用户可以定义自己的排序方式,基本上是选择要排序的列.列是动态的 - 有些列包含数值、一些日期和一些文本.
Users can define their own sort, basically choosing what column to sort by. Columns are dynamic - some have numeric values, some dates and some text.
虽然大多数按预期排序的文本以愚蠢的方式排序.好吧,我说愚蠢,它对计算机有意义,但让用户感到沮丧.
While most sort as expected text sorts in a dumb way. Well, I say dumb, it makes sense to computers, but frustrates users.
例如,按字符串记录 id 排序给出类似:
For instance, sorting by a string record id gives something like:
rec1
rec10
rec14
rec2
rec20
rec3
rec4
...等等.
我希望考虑到这个数字,所以:
I want this to take account of the number, so:
rec1
rec2
rec3
rec4
rec10
rec14
rec20
我无法控制输入(否则我只会在前导 000 中格式化)并且我不能依赖单一格式 - 有些是诸如{alpha 代码}-{dept 代码}-{rec id"之类的东西}".
I can't control the input (otherwise I'd just format in leading 000s) and I can't rely on a single format - some are things like "{alpha code}-{dept code}-{rec id}".
我知道在 C# 中执行此操作的几种方法,但无法下拉所有记录对其进行排序,因为那样会很慢.
I know a few ways to do this in C#, but can't pull down all the records to sort them, as that would be to slow.
有谁知道在Sql server中快速应用自然排序的方法吗?
Does anyone know a way to quickly apply a natural sort in Sql server?
我们正在使用:
ROW_NUMBER() over (order by {field name} asc)
然后我们通过它进行分页.
And then we're paging by that.
我们可以添加触发器,但我们不会.他们所有的输入都是参数化的等等,但我不能改变格式——如果他们输入rec2"和rec10",他们希望它们像那样返回,并且以自然顺序返回.
We can add triggers, although we wouldn't. All their input is parametrised and the like, but I can't change the format - if they put in "rec2" and "rec10" they expect them to be returned just like that, and in natural order.
我们有针对不同客户采用不同格式的有效用户输入.
We have valid user input that follows different formats for different clients.
可能是 rec1, rec2, rec3, ... rec100, rec101
One might go rec1, rec2, rec3, ... rec100, rec101
虽然另一个可能是:grp1rec1、grp1rec2、... grp20rec300、grp20rec301
While another might go: grp1rec1, grp1rec2, ... grp20rec300, grp20rec301
当我说我们不能控制输入时,我的意思是我们不能强迫用户改变这些标准——他们有一个像 grp1rec1 这样的值,我不能把它重新格式化为 grp01rec001,因为这会改变使用的东西用于查找和链接到外部系统.
When I say we can't control the input I mean that we can't force users to change these standards - they have a value like grp1rec1 and I can't reformat it as grp01rec001, as that would be changing something used for lookups and linking to external systems.
这些格式变化很大,但通常是字母和数字的混合.
These formats vary a lot, but are often mixtures of letters and numbers.
在 C# 中对它们进行排序很容易 - 只需将其分解为 { "grp", 20, "rec", 301 }
然后依次比较序列值.
Sorting these in C# is easy - just break it up into { "grp", 20, "rec", 301 }
and then compare sequence values in turn.
但是可能有数百万条记录并且数据是分页的,我需要在 SQL 服务器上进行排序.
However there may be millions of records and the data is paged, I need the sort to be done on the SQL server.
SQL 服务器按值排序,而不是比较 - 在 C# 中,我可以将值分开进行比较,但在 SQL 中,我需要一些逻辑(非常快)获得一致排序的单个值.
SQL server sorts by value, not comparison - in C# I can split the values out to compare, but in SQL I need some logic that (very quickly) gets a single value that consistently sorts.
@moebius - 您的答案可能会奏效,但为所有这些文本值添加排序键确实感觉像是一种丑陋的妥协.
@moebius - your answer might work, but it does feel like an ugly compromise to add a sort-key for all these text values.
推荐答案
我见过的大多数基于 SQL 的解决方案在数据变得足够复杂时(例如,其中包含一两个以上的数字)就会失效.最初,我尝试在 T-SQL 中实现符合我要求的 NaturalSort 函数(除其他外,处理字符串中任意数量的数字),但性能方式太慢了.
Most of the SQL-based solutions I have seen break when the data gets complex enough (e.g. more than one or two numbers in it). Initially I tried implementing a NaturalSort function in T-SQL that met my requirements (among other things, handles an arbitrary number of numbers within the string), but the performance was way too slow.
最终,我在 C# 中编写了一个标量 CLR 函数以实现自然排序,即使使用未优化的代码,从 SQL Server 调用它的性能也非常快.它具有以下特点:
Ultimately, I wrote a scalar CLR function in C# to allow for a natural sort, and even with unoptimized code the performance calling it from SQL Server is blindingly fast. It has the following characteristics:
- 将对前 1,000 个左右的字符进行正确排序(易于在代码中修改或制成参数)
- 正确排序小数,因此 123.333 排在 123.45 之前
- 由于上述原因,可能无法正确对 IP 地址等内容进行排序;如果您希望不同的行为,请修改代码
- 支持对包含任意数量数字的字符串进行排序
- 将正确排序长达 25 位的数字(易于在代码中修改或制成参数)
代码在这里:
using System;
using System.Data.SqlTypes;
using System.Text;
using Microsoft.SqlServer.Server;
public class UDF
{
[SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic=true)]
public static SqlString Naturalize(string val)
{
if (String.IsNullOrEmpty(val))
return val;
while(val.Contains(" "))
val = val.Replace(" ", " ");
const int maxLength = 1000;
const int padLength = 25;
bool inNumber = false;
bool isDecimal = false;
int numStart = 0;
int numLength = 0;
int length = val.Length < maxLength ? val.Length : maxLength;
//TODO: optimize this so that we exit for loop once sb.ToString() >= maxLength
var sb = new StringBuilder();
for (var i = 0; i < length; i++)
{
int charCode = (int)val[i];
if (charCode >= 48 && charCode <= 57)
{
if (!inNumber)
{
numStart = i;
numLength = 1;
inNumber = true;
continue;
}
numLength++;
continue;
}
if (inNumber)
{
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
inNumber = false;
}
isDecimal = (charCode == 46);
sb.Append(val[i]);
}
if (inNumber)
sb.Append(PadNumber(val.Substring(numStart, numLength), isDecimal, padLength));
var ret = sb.ToString();
if (ret.Length > maxLength)
return ret.Substring(0, maxLength);
return ret;
}
static string PadNumber(string num, bool isDecimal, int padLength)
{
return isDecimal ? num.PadRight(padLength, '0') : num.PadLeft(padLength, '0');
}
}
要注册它以便您可以从 SQL Server 调用它,请在查询分析器中运行以下命令:
To register this so that you can call it from SQL Server, run the following commands in Query Analyzer:
CREATE ASSEMBLY SqlServerClr FROM 'SqlServerClr.dll' --put the full path to DLL here
go
CREATE FUNCTION Naturalize(@val as nvarchar(max)) RETURNS nvarchar(1000)
EXTERNAL NAME SqlServerClr.UDF.Naturalize
go
然后,您可以像这样使用它:
Then, you can use it like so:
select *
from MyTable
order by dbo.Naturalize(MyTextField)
注意:如果您在 SQL Server 中遇到类似在 .NET Framework 中执行用户代码被禁用的错误.启用clr enabled"配置选项.,按照说明此处 启用它.确保在这样做之前考虑安全隐患.如果您不是数据库管理员,请确保在对服务器配置进行任何更改之前与您的管理员进行讨论.
Note: If you get an error in SQL Server along the lines of Execution of user code in the .NET Framework is disabled. Enable "clr enabled" configuration option., follow the instructions here to enable it. Make sure you consider the security implications before doing so. If you are not the db admin, make sure you discuss this with your admin before making any changes to the server configuration.
注意2:此代码未正确支持国际化(例如,假设小数点为.",未针对速度进行优化等.欢迎提出改进建议!
Note2: This code does not properly support internationalization (e.g., assumes the decimal marker is ".", is not optimized for speed, etc. Suggestions on improving it are welcome!
将该函数重命名为 Naturalize 而不是 NaturalSort,因为它不进行任何实际排序.
Renamed the function to Naturalize instead of NaturalSort, since it does not do any actual sorting.
相关文章