从 SQL Server 中的 VARCHAR 中删除非数字字符的最快方法

2021-12-02 00:00:00 performance optimization sql sql-server

我正在编写一个导入实用程序,它使用电话号码作为导入中的唯一键.

I'm writing an import utility that is using phone numbers as a unique key within the import.

我需要检查我的数据库中是否不存在电话号码.问题是数据库中的电话号码可能有破折号和括号之类的东西,也可能有其他东西.我写了一个函数来删除这些东西,问题是它慢并且在我的数据库中有数千条记录并且一次导入数千条记录,这个过程可能会慢得令人无法接受.我已经将电话号码列为索引.

I need to check that the phone number does not already exist in my DB. The problem is that phone numbers in the DB could have things like dashes and parenthesis and possibly other things. I wrote a function to remove these things, the problem is that it is slow and with thousands of records in my DB and thousands of records to import at once, this process can be unacceptably slow. I've already made the phone number column an index.

我尝试使用这篇文章中的脚本:
T-SQL 修剪 &nbsp(和其他非字母数字字符)

I tried using the script from this post:
T-SQL trim &nbsp (and other non-alphanumeric characters)

但这并没有加快速度.

有没有更快的方法来删除非数字字符?在必须比较 10,000 到 100,000 条记录时表现良好的东西.

Is there a faster way to remove non-numeric characters? Something that can perform well when 10,000 to 100,000 records have to be compared.

无论做什么都需要快速.

更新
鉴于人们的反应,我想我必须在运行导入实用程序之前清理字段.

Update
Given what people responded with, I think I'm going to have to clean the fields before I run the import utility.

为了回答我在其中编写导入实用程序的问题,它是一个 C# 应用程序.我现在正在将 BIGINT 与 BIGINT 进行比较,无需更改数据库数据,而且我仍然使用非常小的数据集(大约 2000 条记录)对性能造成影响.

To answer the question of what I'm writing the import utility in, it is a C# app. I'm comparing BIGINT to BIGINT now, with no need to alter DB data and I'm still taking a performance hit with a very small set of data (about 2000 records).

将 BIGINT 与 BIGINT 进行比较会减慢速度吗?

Could comparing BIGINT to BIGINT be slowing things down?

我已经尽可能地优化了我的应用程序的代码端(删除了正则表达式,删除了不必要的数据库调用).虽然我不能再孤立 SQL 作为问题的根源,但我仍然觉得它是.

I've optimized the code side of my app as much as I can (removed regexes, removed unneccessary DB calls). Although I can't isolate SQL as the source of the problem anymore, I still feel like it is.

推荐答案

我可能会误解,但是您有两组数据可以从一组数据中删除数据库中当前数据的字符串,然后在每次导入时删除一组新数据.

I may misunderstand, but you've got two sets of data to remove the strings from one for current data in the database and then a new set whenever you import.

为了更新现有的记录,我只会使用 SQL,这只需要发生一次.

For updating the existing records, I would just use SQL, that only has to happen once.

但是,SQL 并未针对此类操作进行优化,因为您说您正在编写导入实用程序,所以我会在导入实用程序本身的上下文中进行这些更新,而不是在 SQL 中.这将是更好的性能明智.你在用什么编写实用程序?

However, SQL isn't optimized for this sort of operation, since you said you are writing an import utility, I would do those updates in the context of the import utility itself, not in SQL. This would be much better performance wise. What are you writing the utility in?

另外,我可能完全误解了这个过程,所以如果不在基地,我很抱歉.

Also, I may be completely misunderstanding the process, so I apologize if off-base.


对于初始更新,如果您使用的是 SQL Server 2005,您可以尝试使用 CLR 函数.这是使用正则表达式的快速方法.不知道性能会如何比较,我自己从来没有使用过这个,除了现在的快速测试.


For the initial update, if you are using SQL Server 2005, you could try a CLR function. Here's a quick one using regex. Not sure how the performance would compare, I've never used this myself except for a quick test right now.

using System;  
using System.Data;  
using System.Text.RegularExpressions;  
using System.Data.SqlClient;  
using System.Data.SqlTypes;  
using Microsoft.SqlServer.Server;  

public partial class UserDefinedFunctions  
{  
    [Microsoft.SqlServer.Server.SqlFunction]  
    public static SqlString StripNonNumeric(SqlString input)  
    {  
        Regex regEx = new Regex(@"D");  
        return regEx.Replace(input.Value, "");  
    }  
};  

部署后,要更新,您可以使用:

After this is deployed, to update you could just use:

UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber)

相关文章