不同的字符集有什么用?

2021-12-27 00:00:00 utf-8 encoding c++

C++ 标准提到了多种不同的字符集.特别是，它提到了以下字符集:

The C++ standard mentions multiple different character sets. In particular, it mentions the following character sets:

在 2.2 [lex.phases] bullet 1 物理源文件字符及其到基本源字符集的映射被提及.
在 2.2 [lex.phases] bullet 2 中提到了执行字符集.
在 2.3 [lex.charset] 第 3 段中提到了一个基本执行字符集和一个基本执行宽字符集.
同样的第 2.3 节 [lex.charset] 3 还提到了一个执行字符集和一个执行宽字符集.
在读取或写入文件时，它们使用其他一些字符集.

In 2.2 [lex.phases] bullet 1 physical source file characters and their mapping to the basic source character set is mentioned.

In 2.2 [lex.phases] bullet 2 execution character set is mentioned.

In 2.3 [lex.charset] paragraph 3 a basic execution character set and a basic execution wide-character set are mentioned.

The same section 2.3 [lex.charset] 3 also mentions an execution character set and an execution wide-character set.

When reading or writing files these use some other character set.

所有这些不同的字符集用于什么，它们之间的转换是如何完成的，以及这些值中的哪些取决于语言环境?特别是，字符串文字是如何表示的?

What are all those different character set used for, how are conversions between them done, and which of these values are locale dependent? In particular, how are string literals represented?

推荐答案

这里是编译器本身使用的不同字符集的分解(所有参考标准都是针对 C++14，实际上):

Here is a break down of the different character sets used by the compiler itself (all reference to the standard are for C++14, actually):

物理源文件字符是在 C++ 源代码中使用的字符.很可能这些现在使用某种 Unicode 编码进行编码，例如，UTF-8 或 UTF-16.如果您来自欧洲或美国背景，您可能正在使用 ASCII 其字符被方便地编码在 UTF-8 中相同(每个 ASCII 文件都是 UTF-8 文件，但不是相反).物理源文件 characters_ 也可能不寻常，例如 EBCDIC.
基本源字符集是编译器所使用的，至少在概念上是这样.它是从物理源文件字符生成的，并将它们映射到它们各自的基本字符或使用通用字符名称映射到代表物理源字符的基本字符序列(参见 2.2 [lex.phases]第 1 段).基本的源字符集只是一组 96 个字符(2.3 [lex.charset] 第 1 段):

The physical source file characters are those used in the C++ source. Most likely these are now encoded using some Unicode encoding, e.g., UTF-8 or UTF-16. If you are from a European or an American background you may be using ASCII whose characters are conveniently encoded identically in UTF-8 (every ASCII file is a UTF-8 file but not the other way around). The physical source file characters_ may also be something unusual like EBCDIC.

The basic source character set is what the compiler, at least conceptually, consumes. It is produced from the physical source file characters and either mapping them to their respective basic character or to a sequence of basic characters representing the physical source character using a universal character name (see 2.2 [lex.phases] paragraph 1). The basic source character set is a just a set of 96 character (2.3 [lex.charset] paragraph 1):

a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) <> % : ;.?* + -/^ &|～！= , " ’

a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , " ’

以及5个特殊字符空格(' ')、水平制表符( )、垂直制表符(v)、换页符(f)和换行符( )

and the 5 special characters space (' '), horizontal tab ( ), vertical tab (v), form feed (f), and newline ( )

物理源字符集和基本字符集之间的映射是实现定义的.

The mapping between the physical source character set and the basic character set is implementation defined.

基本执行字符集和基本执行宽字符集是能够表示由一些特殊字符扩展的基本源字符集的字符集:

The basic execution character set and the basic execution wide-character set are characters set capable of representing the basic source character set expanded by a few special character:

警告('a')、退格('')、回车('')和空字符('')

alert ('a'), backspace (''), carriage return (''), and a null character ('')

非宽版和宽版的区别在于字符是用char还是wchar_t表示.

The difference between the non-wide and the wide version is whether the characters are represented using char or wchar_t.

执行字符集和执行宽字符集是基本字符集和基本宽字符集的实现定义的扩展.在 2.3 [lex.charset] 的第 3 段中指出，执行字符集的附加成员和附加成员的值是特定于语言环境的.不清楚引用的是哪个语言环境，但我怀疑编译期间使用的语言环境是指.在任何情况下，执行字符集都是实现定义的(也根据 2.3 [lex.charset] 第 3 段).

The execution character set and the execution wide-character set are implementation defined extensions of the basic character set and the basic wide-character set. In 2.3 [lex.charset] paragraph 3 it is stated that the additional members and the values of the additional members of execution character set are locale specific. It isn't clear which locale is referred to but I suspect the locale used during compilation is meant. In any case, the execution character sets are implementation defined (also according to 2.3 [lex.charset] paragraph 3).

字符和字符串文字最初使用基本源字符集表示，其中一些字符可能使用通用字符名称.所有这些都在编译时转换为执行字符集.根据 2.14.3 [lex.ccon] 字符文字可表示为执行字符集中的一个 char 就可以了.如果需要多个 char ，则可以有条件地支持字符文字(并且它们的类型为 int).对于字符串文字，转换在 2.14.5 [lex.string] 中描述.第 9 段指出 UTF-8 字符串文字(例如 u8"hello")产生与 UTF-8 字符串的代码单元对应的值序列.否则字符和通用字符名称的翻译与字符文字的翻译相同(特别是它是实现定义的)，尽管导致窄字符串的多字节序列的字符只会导致多个字符(这种情况不需要支持)用于字符文字).

Character and string literals are originally represented using the basic source character set with some characters possibly using universal character names. All of these are converted at compile time into the execution character set. According to 2.14.3 [lex.ccon] character literals representable as one char in the execution character set just work. If multiple chars are needed the character literals may be conditionally supported (and they'd have type int). For string literals the conversion is described in 2.14.5 [lex.string]. Paragraph 9 states that UTF-8 string literals (e.g. u8"hello") result in a sequence of values corresponding to the code units of the UTF-8 string. Otherwise the translation of characters and universal character names is the same as that for character literals (in particular, it is implementation defined) although characters resulting in multi-byte sequences for narrow string just result in multiple characters (this case isn't necessary support for character literals).

目前只考虑编译的结果.任何不属于字符或字符串文字的字符都用于指定代码的作用.有趣的问题是文字发生了什么?文字基本上都被翻译成实现定义的表示.实现定义意味着它在某处记录了应该发生的事情，但在不同的实现之间可能有所不同.

So far, only the result of compilation is considered. Any character which isn't part of a character or a string literal is used to specify what the code does. The interesting question is what happened to the literals? The literals are all basically translated into an implementation defined representation. That is implementation defined means that it is somewhere documented what is supposed to happen but it can differ between different implementations.

在处理来自某个地方的字符或字符串时，这有什么帮助?那么，任何读取的字符或字符串都会转换为相应的执行字符集.特别是，当读取文件时，所有字符都转换为这种通用表示.当然，要使此转换起作用，需要根据该文件的编码设置用于读取文件的区域设置.如果未明确提及区域设置，则使用全局区域设置，该区域设置最初由系统确定.初始全局语言环境可能是基于用户偏好以某种方式设置的，例如，不基于环境变量.如果读取的文件使用与该全局语言环境不同的编码，则需要使用与文件编码相匹配的相应不同语言环境.

How does that help when dealing with characters or strings coming from somewhere? Well, any character or string which is read is converted to the corresponding execution character set. In particular, when a file is read, all characters are transformed to this common representation. Of course, for this transformation to work, the locale used for reading a file needs to be setup according to the encoding of that file. If the locale isn't explicitly mentioned the global locale is used which is initially determined by the system is used. The initial global locale is probably set somehow based on user preferences, e.g., based no environment variables. If a a file is read which uses a different encoding than this global locale, a corresponding different locale matching the encoding of the file needs to be used.

相应地，当使用执行字符集之一编写字符时，这些字符将根据当前语言环境指定的编码进行转换.同样，如果需要特定编码，则可能需要替换语言环境.

Correspondingly, when writing characters using one of the execution character sets, these are converted according to the encoding specified by the current locale. Again, it may be necessary to replace the locale if a specific encoding is needed.

所有这些实际上意味着在程序内部，所有字符串和字符处理都使用实现定义的执行字符集进行.程序读取的所有字符都需要转换为该字符集，所有写入的字符都作为该执行字符集中的字符开始，并且需要适当地转换为外部编码.当然，在理想的设置中，执行字符集和外部表示之间的转换是身份，例如，因为执行字符集使用 UTF-8 而外部表示也使用 UTF-8.相应地，对于执行宽字符集，除了在这种情况下将使用 UTF-16(UTF-16 的两种变体之一可以使用大端或小端表示).

All this effectively means that internally to a program all string and character processing happens using the implementation defined execution character set. All characters being read by a program need to be converted to this character set and all characters written start as characters in this execution character set and need to be converted appropriately to the external encoding. If course, in an ideal set up the conversion between the execution character set and the external representation is the identity, e.g., because the execution character set uses UTF-8 and the external representation also uses UTF-8. Correspondingly for the execution wide-character set except in this case UTF-16 would be used (one of the two variations as UTF-16 can either use big endian or little endian representation).

相关文章