特殊字符(Hawaiian 'Okina)导致奇怪的字符串行为

2021-09-10 00:00:00 unicode tsql sql-server collation

夏威夷语引用在使用 T-SQL 时有一些奇怪的行为结合字符串函数.这里发生了什么?我错过了什么吗?其他角色是否也有同样的问题?

The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. What's going on here? Am I missing something? Do other characters suffer from this same problem?

SELECT UNICODE(N'ʻ') -- Returns 699 as expected.

SELECT REPLACE(N'"ʻ', '"', '_') -- Returns "ʻ, I expected _ʻ

SELECT REPLACE(N'aʻ', 'a', '_') -- Returns aʻ, I expected _ʻ

SELECT REPLACE(N'"ʻ', N'ʻ', '_') -- Returns __, I expected "_

SELECT REPLACE(N'-', N'ʻ', '_') -- Returns -, I expected -

另外,在 LIKE 中使用时很奇怪,例如:

Also, strange when used in a LIKE for example:

DECLARE @table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
    @table
VALUES
    ('John'),
    ('Jane')

SELECT
    *
FROM
    @table
WHERE
    [Name] LIKE N'%ʻ%' -- This returns both records. I expected none.

推荐答案

夏威夷语引号在与字符串函数结合使用时在 T-SQL 中有一些奇怪的行为....其他角色是否也有同样的问题?

The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. ... Do other characters suffer from this same problem?

一些事情:

  1. 这不是夏威夷语的引用":它是声门停顿"影响发音.
  2. 这不是奇怪"的行为:这不是您所期望的.
  3. 这种行为并不是一个问题",尽管是的,还有其他角色表现出类似的行为.例如,以下字符(U+02DA 环上方)的行为略有不同,具体取决于它位于字符的哪一侧:

  1. This is not a Hawaiian "quote": it's a "glottal stop" which affects pronunciation.
  2. It is not "weird" behavior: it's just not what you were expecting.
  3. This behavior is not specifically a "problem", though yes, there are other characters that exhibit similar behavior. For example, the following character (U+02DA Ring Above) behaves slightly differently depending on which side of a character it is on:

SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'˚a',  N'_'); -- Returns a_a
SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'a˚',  N'_'); -- Returns _aa

现在,任何使用 SQL Server 2008 或更新版本的人都应该使用 100(或更新)级别的排序规则.他们在 100 系列中添加了很多排序权重和大写/小写映射,这些映射不在 90 系列、非编号系列或大部分过时的 SQL Server 排序规则(名称以 SQL_).

Now, anyone using SQL Server 2008 or newer should be using a 100 (or newer) level collation. They added a lot of sort weights and uppercase/lowercase mappings in the 100 series that aren't in the 90 series, or the non-numbered series, or the mostly obsolete SQL Server collations (those with names starting with SQL_).

这里的问题不是它不等同于任何其他字符(在二进制排序规则之外),实际上它确实等同于另一个字符(U+0312 组合上方的转逗号):

The issue here is not that it doesn't equate to any other character (outside of a binary collation), and in fact it actually does equate to one other character (U+0312 Combining Turned Comma Above):

;WITH nums AS
(
  SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY @@MICROSOFTVERSION) - 1) AS [num]
  FROM   [master].sys.all_columns ac1
  CROSS JOIN   [master].sys.all_columns ac2
)
SELECT nums.[num] AS [INTvalue],
       CONVERT(BINARY(2), nums.[num]) AS [BINvalue],
       NCHAR(nums.[num]) AS [Character]
FROM   nums
WHERE  NCHAR(nums.[num]) = NCHAR(0x02BB) COLLATE Latin1_General_100_CI_AS;
/*
INTvalue    BINvalue    Character
699         0x02BB      ʻ
786         0x0312      ̒
*/

问题在于这是一个间距修饰符"字符,因此它会附加到它之前或之后的字符并修改其含义/发音,具体取决于您正在处理的修饰符字符.

The issue is that this is a "spacing modifier" character, and so it attaches to, and modifies the meaning / pronunciation of, the character before or after it, depending on which modifier character you are dealing with.

根据 Unicode 标准第 7 章(欧洲-I),第 7.8 节(修饰符),第 323 页(文档的,不是 PDF 的):

According to the Unicode Standard, Chapter 7 (Europe-I), Section 7.8 (Modifier Letters), Page 323 (of the document, not of the PDF):

修饰字母,在 Unicode 标准中使用的意义上,是通常与其他字母相邻书写的字母或符号,并以某种方式修改它们的用法.它们不是正式的组合标记(gc = Mn 或 gc = Mc),也没有与它们修改的基本字母以图形方式组合.他们本身就是基本角色.他们修改其他字母的意义更多的是他们在使用中的语义问题;它们的功能往往就像变音符号一样,表示字母发音的变化,或以其他方式区分字母的用法.通常,这种变音符号修饰适用于修饰符字母之前的字符,但修饰符字母有时可能会修饰后面的字符.有时,修饰字母可能只是单独代表它自己的声音.
...

7.8 Modifier Letters

Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way. They are not formally combining marks (gc = Mn or gc = Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, indicating a change in pronunciation of a letter, or otherwise distinguishing a letter’s use. Typically this diacritic modification applies to the character preceding the modifier letter, but modifier letters may sometimes modify a following character. Occasionally a modifier letter may simply stand alone representing its own sound.
...

拼音用法.此块中的大多数修饰字母都是拼音修饰符,包括覆盖国际音标所需的字符.在许多情况下,修饰字母用于表示相邻字母的发音在某些方面有所不同——因此得名修饰符".它们也用于标记重音或音调,或者可能只是代表他们自己的声音.

Phonetic Usage. The majority of the modifier letters in this block are phonetic modifiers, including the characters required for coverage of the International Phonetic Alphabet. In many cases, modifier letters are used to indicate that the pronunciation of an adjacent letter is different in some way—hence the name "modifier." They are also used to mark stress or tone, or may simply represent their own sound.

 
下面的例子应该有助于说明.我使用的是 100 级排序规则,它需要区分重音(即名称包含 _AS):

SELECT REPLACE(N'ʻ'    COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns _
SELECT REPLACE(N'ʻa'   COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns _a
SELECT REPLACE(N'ʻaa'  COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns __aa

SELECT REPLACE(N'ʻaa'  COLLATE Latin1_General_100_CI_AS, N'ʻa',  N'_'); -- Returns ʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa',  N'_'); -- Returns aʻ__

SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻ',  N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻa', N'_'); -- Returns _a

SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'a',   N'_'); -- Returns aʻ__
SELECT REPLACE(N'אʻaa' COLLATE Latin1_General_100_CI_AS, N'א',   N'_'); -- Returns אʻaa
SELECT REPLACE(N'ffʻaa' COLLATE Latin1_General_100_CI_AS, N'ff',   N'_'); -- Returns ffʻaa
SELECT REPLACE(N'ffaa'  COLLATE Latin1_General_100_CI_AS, N'ff',   N'_'); -- Returns _aa



SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AS); -- 3
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AI); -- 1



SELECT 1 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AS; -- (0 rows returned)
SELECT 2 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AI; -- 2

如果您需要以忽略其预期语言行为的方式处理此类字符,那么是的,您必须使用二进制排序规则.在这种情况下,请使用最新的排序规则和 BIN2 而不是 BIN(假设您使用的是 SQL Server 2005 或更新版本).含义:

If you need to deal with such characters in a way that ignores their intended linguistic behavior, then yes, you must use a binary collation. In such cases, please use the most recent level of collation, and BIN2 instead of BIN (assuming you are using SQL Server 2005 or newer). Meaning:

  • SQL Server 2000:Latin1_General_BIN
  • SQL Server 2005:Latin1_General_BIN2
  • SQL Server 2008、2008 R2、2012、2014 和 2016:Latin1_General_100_BIN2
  • SQL Server 2017 及更新版本:Japanese_XJIS_140_BIN2

如果您想知道我为什么提出这个建议,请参阅:

If you are curious why I make that recommendation, please see:

各种二进制排序规则之间的差异(文化、版本和 BIN 与 BIN2)

此外,有关排序规则/Unicode/编码等的更多信息,请访问:排序规则信息

And, for more information on collations / Unicode / encodings / etc, please visit: Collations Info

相关文章