wchar_t 到底能代表什么?

2022-01-07 00:00:00 unicode character-encoding c++

根据 cppreference.com 在 wchar_t 上的文档:

According to cppreference.com's doc on wchar_t:

wchar_t - 宽字符表示的类型(参见宽字符串).需要足够大以表示任何受支持的字符代码点(在支持 Unicode 的系统上为 32 位.一个值得注意的例外是 Windows，其中 wchar_t 为 16 位并保存 UTF-16 代码单元)它具有相同的大小、签名和对齐方式作为整数类型之一，但它是一个不同的类型.

wchar_t - type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.

标准在 [basic.fundamental]/5:

Type wchar_t 是一个独特的类型，其值可以代表支持的语言环境中指定的最大扩展字符集的所有成员的不同代码.类型 wchar_t 应具有与其他整数类型之一(称为其基础类型)相同的大小、符号和对齐要求.类型 char16_t 和 char32_t 分别表示与 uint_least16_t 和 uint_least32_t 具有相同大小、符号和对齐的不同类型，在中，称为底层类型.

Type wchar_-t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_-t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_-t and char32_-t denote distinct types with the same size, signedness, and alignment as uint_-least16_-t and uint_-least32_-t, respectively, in <cstdint>, called the underlying types.

那么，如果我想处理unicode字符，我应该使用wchar_t吗?

So, if I want to deal with unicode characters, should I use wchar_t?

同样地，我如何知道wchar_t是否支持"一个特定的Unicode字符?

Equivalently, how do I know if a specific unicode character is "supported" by wchar_t?

推荐答案

所以，如果我想处理unicode字符，我应该使用wchar_t?

首先，请注意编码并不会强制您使用任何特定类型来表示某个字符.您可以使用 char 来表示 Unicode 字符，就像 wchar_t 一样 - 您只需要记住最多 4 个 char 一起将形成一个有效的代码点取决于 UTF-8、UTF-16 或 UTF-32 编码，而 wchar_t 可以使用 1 个(Linux 上的 UTF-32)或最多 2 个一起工作(UTF-16 上视窗).

First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char to represent Unicode characters just as wchar_t can - you only have to remember that up to 4 chars together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).

接下来，没有明确的 Unicode 编码.一些 Unicode 编码使用固定宽度来表示代码点(如 UTF-32)，其他编码(如 UTF-8 和 UTF-16)具有可变长度(例如字母 'a' 肯定只会用完 1 个字节，但分开从英文字母表中，其他字符肯定会使用更多字节来表示).

Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).

因此，您必须决定要表示的字符类型，然后相应地选择您的编码.根据您要表示的字符类型，这将影响您的数据将占用的字节数.例如.使用 UTF-32 来表示大部分英文字符会导致很多 0 字节.UTF-8 是许多基于拉丁语的语言的更好选择，而 UTF-16 通常是东亚语言的更好选择.

So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.

一旦决定了这一点，您就应该尽量减少转化次数，并与您的决定保持一致.

Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.

在下一步中，您可以决定适合表示数据的数据类型(或您可能需要的转换类型).

In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).

如果您想在代码点的基础上进行文本操作/解释，char 如果您有例如日本汉字.但是，如果您只是想传达您的数据并且不再将其视为字节的定量序列，则可以使用 char.

If you would like to do text-manipulation/interpretation on a code-point basis, char certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char.

UTF-8 的链接已经作为评论发布，我建议你也看看那里.另一个不错的读物是每个程序员都应该了解的有关编码的内容.

The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.

到目前为止，C++ 中只有基本的 Unicode 语言支持(例如 char16_t 和 char32_t 数据类型，以及 u8/u/U 字面前缀).因此，选择一个库来管理编码(尤其是转换)当然是一个很好的建议.

As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t and char32_t data types, and u8/u/U literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.

相关文章