什么是标准化的 UTF-8?

2021-12-26 00:00:00 unicode c php unicode-normalization

ICU 项目(现在也有一个 PHP 库) 包含帮助规范化 UTF-8 字符串所需的类，以便在搜索时更容易比较值.

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

但是，我正在尝试弄清楚这对应用程序意味着什么.例如，在哪些情况下我需要规范等效"而不是兼容性等效"，或者反之亦然?

However, I'm trying to figure out what this means for applications. For example, in which cases do I want "Canonical Equivalence" instead of "Compatibility equivalence", or vis-versa?

推荐答案

关于 Unicode 规范化你从未想知道的一切
规范归一化
Unicode 包括多种对某些字符进行编码的方法，尤其是重音字符.规范归一化将代码点更改为规范编码形式.除字体或渲染引擎中的任何错误外，生成的代码点应与原始代码点相同.

Everything You Never Wanted to Know about Unicode Normalization

Canonical Normalization

Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form. The resulting code points should appear identical to the original ones barring any bugs in the fonts or rendering engine.

因为结果看起来相同，所以在存储或显示字符串之前对字符串应用规范化始终是安全的，只要您能容忍结果与输入不完全相同.

Because the results appear identical, it is always safe to apply canonical normalization to a string before storing or displaying it, as long as you can tolerate the result not being bit for bit identical to the input.

规范标准化有两种形式:NFD 和 NFC.从可以在这两种形式之间转换而不会丢失的意义上说，这两者是等效的.在 NFC 下比较两个字符串将始终给出与在 NFD 下比较它们相同的结果.

Canonical normalization comes in 2 forms: NFD and NFC. The two are equivalent in the sense that one can convert between these two forms without loss. Comparing two strings under NFC will always give the same result as comparing them under NFD.

NFD 将字符完全展开.这是计算速度更快的归一化形式，但会产生更多代码点(即使用更多空间).

NFD has the characters fully expanded out. This is the faster normalization form to calculate, but the results in more code points (i.e. uses more space).

如果您只想比较两个尚未标准化的字符串，这是首选的标准化形式，除非您知道需要兼容性标准化.

If you just want to compare two strings that are not already normalized, this is the preferred normalization form unless you know you need compatibility normalization.

NFC 会在运行 NFD 算法后尽可能重新组合代码点.这需要更长的时间，但会产生更短的字符串.

NFC recombines code points when possible after running the NFD algorithm. This takes a little longer, but results in shorter strings.

Unicode 还包括许多真正不属于的字符，但在遗留字符集中使用.Unicode 添加这些是为了允许将这些字符集中的文本作为 Unicode 处理，然后无损地转换回来.

Unicode also includes many characters that really do not belong, but were used in legacy character sets. Unicode added these to allow text in those character sets to be processed as Unicode, and then be converted back without loss.

兼容性规范化将这些转换为相应的真实"序列.字符，并且还执行规范归一化.兼容性规范化的结果可能与原始结果不一致.

Compatibility normalization converts these to the corresponding sequence of "real" characters, and also performs canonical normalization. The results of compatibility normalization may not appear identical to the originals.

包含格式信息的字符将替换为不包含的字符.例如，字符 ⁹ 被转换为 9.其他不涉及格式差异.例如罗马数字字符Ⅸ被转换为常规字母IX.

Characters that include formatting information are replaced with ones that do not. For example the character ⁹ gets converted to 9. Others don't involve formatting differences. For example the roman numeral character Ⅸ is converted to the regular letters IX.

显然，一旦执行了这种转换，就不可能再无损地转换回原始字符集.

Obviously, once this transformation has been performed, it is no longer possible to losslessly convert back to the original character set.

Unicode Consortium 建议将兼容性规范化考虑为 ToUpperCase 转换.它在某些情况下可能有用，但您不应随意应用.

The Unicode Consortium suggests thinking of compatibility normalization like a ToUpperCase transform. It is something that may be useful in some circumstances, but you should not just apply it willy-nilly.

一个很好的用例是搜索引擎，因为您可能希望搜索 9 以匹配 ⁹.

An excellent use case would be a search engine since you would probably want a search for 9 to match ⁹.

您可能不应该做的一件事是向用户显示应用兼容性规范化的结果.

One thing you should probably not do is display the result of applying compatibility normalization to the user.

兼容性规范化形式有NFKD和NFKC两种形式.它们与 NFD 和 C 之间的关系相同.

Compatibility normalization form comes in two forms NFKD and NFKC. They have the same relationship as between NFD and C.

NFKC 中的任何字符串本质上也在 NFC 中，对于 NFKD 和 NFD 也是如此.因此NFKD(x)=NFD(NFKC(x))，和NFKC(x)=NFC(NFKD(x))，等等

Any string in NFKC is inherently also in NFC, and the same for the NFKD and NFD. Thus NFKD(x)=NFD(NFKC(x)), and NFKC(x)=NFC(NFKD(x)), etc.

如果有疑问，请使用规范标准化.根据适用的空间/速度权衡选择 NFC 或 NFD，或根据您要与之互操作的事物的要求选择 NFC 或 NFD.

If in doubt, go with canonical normalization. Choose NFC or NFD based on the space/speed trade-off applicable, or based on what is required by something you are inter-operating with.

相关文章