C++11 中字符串文字的 Unicode 编码
按照相关问题,我想询问 C++11 中新的字符和字符串文字类型.看起来我们现在有四种字符和五种字符串文字.字符类型:
Following a related question, I'd like to ask about the new character and string literal types in C++11. It seems that we now have four sorts of characters and five sorts of string literals. The character types:
char a = 'x30'; // character, no semantics
wchar_t b = L'xFFEF'; // wide character, no semantics
char16_t c = u'u00F6'; // 16-bit, assumed UTF16?
char32_t d = U'U0010FFFF'; // 32-bit, assumed UCS-4
和字符串文字:
char A[] = "Hellox0A"; // byte string, "narrow encoding"
wchar_t B[] = L"HellxF6x0A"; // wide string, impl-def'd encoding
char16_t C[] = u"Hellu00F6"; // (1)
char32_t D[] = U"HellU000000F6U0010FFFF"; // (2)
auto E[] = u8"u00F6U0010FFFF"; // (3)
问题是:x
/u
/U
字符引用是否可以与所有字符串类型自由组合?是否所有字符串类型都是固定宽度的,即数组包含与文本中出现的元素一样多的元素,或者 x
/u
/U
引用被扩展为可变字节数?u""
和 u8""
字符串是否具有编码语义,例如我可以说 char16_t x[] = u"U0010FFFF"
,并且非 BMP 代码点被编码为两个单元的 UTF16 序列吗?和 u8
类似?在 (1) 中,我可以用 u
编写单独的代理吗?最后,是否有任何字符串函数编码感知(即它们感知字符并且可以检测无效字节序列)?
The question is this: Are the x
/u
/U
character references freely combinable with all string types? Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to x
/u
/U
references get expanded into a variable number of bytes? Do u""
and u8""
strings have encoding semantics, e.g. can I say char16_t x[] = u"U0010FFFF"
, and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence? And similarly for u8
? In (1), can I write lone surrogates with u
? Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?
这是一个开放式问题,但我想尽可能完整地了解新 C++11 的新 UTF 编码和类型工具.
This is a bit of an open-ended question, but I'd like to get as complete a picture as possible of the new UTF-encoding and type facilities of the new C++11.
推荐答案
x/u/U 字符引用是否可以与所有字符串类型自由组合?
Are the x/u/U character references freely combinable with all string types?
没有.x
可以用在任何东西中,但 u
和 U
只能用在专门用 UTF 编码的字符串中.但是,对于任何 UTF 编码的字符串,u
和 U
可以根据需要使用.
No. x
can be used in anything, but u
and U
can only be used in strings that are specifically UTF-encoded. However, for any UTF-encoded string, u
and U
can be used as you see fit.
所有字符串类型是否都是固定宽度的,即数组包含的元素数量与文本中出现的元素数量一样多,或者 x/u/U 引用被扩展为可变字节数?
Are all the string types fixed-width, i.e. the arrays contain precisely as many elements as appear in the literal, or to x/u/U references get expanded into a variable number of bytes?
不是你的意思.x
、u
和 U
根据字符串编码进行转换.这些代码单元"(使用 Unicode 术语.char16_t
是 UTF-16 代码单元)值的数量取决于包含字符串的编码.文字 u8"u1024"
将创建一个包含 2 个 char
加上一个空终止符的字符串.文字 u"u1024"
将创建一个包含 1 个 char16_t
加上一个空终止符的字符串.
Not in the way you mean. x
, u
, and U
are converted based on the string encoding. The number of those "code units" (using Unicode terms. A char16_t
is a UTF-16 code unit) values depends on the encoding of the containing string. The literal u8"u1024"
would create a string containing 2 char
s plus a null terminator. The literal u"u1024"
would create a string containing 1 char16_t
plus a null terminator.
所使用的代码单元数量基于 Unicode 编码.
The number of code units used is based on the Unicode encoding.
u"" 和 u8"" 字符串是否具有编码语义,例如我可以说 char16_t x[] = u"U0010FFFF",并且非 BMP 代码点被编码为两个单元的 UTF16 序列吗?
Do u"" and u8"" strings have encoding semantics, e.g. can I say char16_t x[] = u"U0010FFFF", and the non-BMP codepoint gets encoded into a two-unit UTF16 sequence?
u""
创建一个 UTF-16 编码的字符串.u8""
创建一个 UTF-8 编码的字符串.它们将按照 Unicode 规范进行编码.
u""
creates a UTF-16 encoded string. u8""
creates a UTF-8 encoded string. They will be encoded per the Unicode specification.
在 (1) 中,我可以用 u 编写单独的代理吗?
In (1), can I write lone surrogates with u?
绝对不是.该规范明确禁止使用 UTF-16 代理对 (0xD800-0xDFFF) 作为 u
或 U
的代码点.
Absolutely not. The specification expressly forbids using the UTF-16 surrogate pairs (0xD800-0xDFFF) as codepoints for u
or U
.
最后,是否有任何字符串函数可以识别编码(即它们可以识别字符并且可以检测无效的字节序列)?
Finally, are any of the string functions encoding aware (i.e. they are character-aware and can detect invalid byte sequences)?
绝对不是.好吧,让我重新表述一下.
Absolutely not. Well, allow me to rephrase that.
std::basic_string
不处理 Unicode 编码.它们当然可以存储 UTF 编码的字符串.但是他们只能将它们视为char
、char16_t
或char32_t
的序列;他们不能将它们视为使用特定机制编码的 Unicode 代码点序列.basic_string::length()
将返回代码单元的数量,而不是代码点.显然,C 标准库字符串函数完全没用
std::basic_string
doesn't deal with Unicode encodings. They certainly can store UTF-encoded strings. But they can only think of them as sequences of char
, char16_t
, or char32_t
; they can't think of them as a sequence of Unicode codepoints that are encoded with a particular mechanism. basic_string::length()
will return the number of code units, not code points. And obviously, the C standard library string functions are totally useless
但是应该注意的是,Unicode 字符串的长度"并不意味着代码点的数量.一些代码点正在组合字符"(一个不幸的名字),它与之前的代码点组合.因此多个代码点可以映射到单个视觉字符.
It should be noted however that "length" for a Unicode string does not mean the number of codepoints. Some code points are combining "characters" (an unfortunate name), which combine with the previous codepoint. So multiple codepoints can map to a single visual character.
Iostreams 实际上可以读/写 Unicode 编码的值.为此,您必须使用语言环境来指定编码并将其正确地注入各个地方.这说起来容易做起来难,我身上没有任何代码可以向您展示如何操作.
Iostreams can in fact read/write Unicode-encoded values. To do so, you will have to use a locale to specify the encoding and properly imbue it into the various places. This is easier said than done, and I don't have any code on me to show you how.
相关文章