什么是“错"使用 C++ wchar_t 和 wstrings?宽字符有哪些替代方案?

我在 C++ 社区中看到很多人(尤其是 freenode 上的##c++)对 wstringswchar_t 的使用以及它们在 windows 中的使用表示不满接口.wchar_twstring 到底有什么错误",如果我想支持国际化,有哪些宽字符的替代方案?

I have seen a lot of people in the C++ community(particularly ##c++ on freenode) resent the use of wstrings and wchar_t, and their use in the windows api. What is exactly "wrong" with wchar_t and wstring, and if I want to support internationalization, what are some alternatives to wide characters?

推荐答案

wchar_t 是什么?

wchar_t 的定义使得任何语言环境的 char 编码都可以转换为 wchar_t 表示,其中每个 wchar_t 仅表示一个代码点:

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint:

类型 wchar_t 是一个独特的类型,其值可以代表支持的语言环境 (22.3.1) 中指定的最大扩展字符集的所有成员的不同代码.

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

               ;              ;              ;    ―C++ [basic.fundamental] 3.9.1/5

                                                                               ― C++ [basic.fundamental] 3.9.1/5

这不要求 wchar_t 足够大以同时表示来自所有语言环境的任何字符.也就是说,用于 wchar_t 的编码可能因地区而异.这意味着您不一定使用一种语言环境将字符串转换为 wchar_t,然后使用另一种语言环境将其转换回 char.1

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.1

由于使用 wchar_t 作为所有语言环境之间的通用表示似乎是 wchar_t 在实践中的主要用途,因此您可能想知道它有什么好处.

Since using wchar_t as a common representation between all locales seems to be the primary use for wchar_t in practice you might wonder what it's good for if not that.

wchar_t 的最初意图和目的是通过定义它来简化文本处理,使其需要从字符串的代码单元到文本字符的一对一映射,从而允许使用相同的简单算法as 与 ascii 字符串一起用于其他语言.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

不幸的是,wchar_t 规范的措辞假设字符和代码点之间存在一对一的映射来实现这一点.Unicode 打破了这个假设2,所以你也不能安全地将 wchar_t 用于简单的文本算法.

Unfortunately the wording of wchar_t's specification assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption2, so you can't safely use wchar_t for simple text algorithms either.

这意味着便携式软件不能将 wchar_t 用作语言环境之间文本的通用表示,也不能使用简单的文本算法.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

不管怎样,对于可移植代码来说并不多.如果定义了 __STDC_ISO_10646__,则 wchar_t 的值直接表示在所有语言环境中具有相同值的 Unicode 代码点.这使得进行前面提到的区域间转换是安全的.但是,您不能仅仅依靠它来决定您可以以这种方式使用 wchar_t,因为尽管大多数 unix 平台都定义了它,但即使 Windows 在所有语言环境中使用相同的 wchar_t 语言环境,Windows 也不会这样做.

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

Windows 没有定义 __STDC_ISO_10646__ 的原因是因为 Windows 使用 UTF-16 作为其 wchar_t 编码,并且因为 UTF-16 使用代理对来表示大于 U+FFFF 的代码点,这意味着UTF-16 不满足 __STDC_ISO_10646__ 的要求.

The reason Windows doesn't define __STDC_ISO_10646__ is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

对于平台特定的代码 wchar_t 可能更有用.它本质上在 Windows 上是必需的(例如,某些文件在不使用 wchar_t 文件名的情况下根本无法打开),尽管据我所知,Windows 是唯一的平台(所以也许我们可以将 wchar_t 视为Windows_char_t").

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

事后看来,wchar_t 显然对于简化文本处理或作为独立于语言环境的文本的存储没有用处.可移植代码不应试图将其用于这些目的.非可移植代码可能仅仅因为某些 API 需要它而发现它很有用.

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

我喜欢的替代方法是使用 UTF-8 编码的 C 字符串,即使在对 UTF-8 不是特别友好的平台上也是如此.

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly toward UTF-8.

通过这种方式,人们可以使用跨平台的通用文本表示编写可移植代码,将标准数据类型用于其预期目的,获得语言对这些类型的支持(例如字符串文字,尽管需要一些技巧才能使其适用于某些编译器)、一些标准库支持、调试器支持(可能需要更多技巧)等.使用宽字符通常更难或不可能获得所有这些,并且您可能会在不同平台上获得不同的部分.

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

UTF-8 没有提供的一件事是能够使用简单的文本算法,例如 ASCII.在这方面 UTF-8 并不比任何其他 Unicode 编码差.事实上,它可能被认为更好,因为 UTF-8 中的多代码单元表示更常见,因此与尝试坚持使用 UTF 相比,处理此类可变宽度字符表示的代码中的错误更有可能被注意到和修复-32 使用 NFC 或 NFKC.

One thing UTF-8 does not provide is the ability to use simple text algorithms such as are possible with ASCII. In this UTF-8 is no worse than any other Unicode encoding. In fact it may be considered to be better because multi-code unit representations in UTF-8 are more common and so bugs in code handling such variable width representations of characters are more likely to be noticed and fixed than if you try to stick to UTF-32 with NFC or NFKC.

许多平台使用 UTF-8 作为其原生字符编码,并且许多程序不需要任何重要的文本处理,因此在这些平台上编写国际化程序与不考虑国际化编写代码几乎没有什么不同.编写更广泛可移植的代码,或在其他平台上编写,需要在使用其他编码的 API 边界插入转换.

Many platforms use UTF-8 as their native char encoding and many programs do not require any significant text processing, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use other encodings.

某些软件使用的另一种替代方法是选择跨平台表示,例如保存 UTF-16 数据的无符号短数组,然后提供所有库支持并简单地承担语言支持等方面的成本.

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

C++11 添加了新的宽字符作为 wchar_t、char16_t 和 char32_t 的替代品,并具有附带的语言/库功能.这些实际上并不能保证是 UTF-16 和 UTF-32,但我不认为任何主要实现会使用其他任何东西.C++11 还改进了 UTF-8 支持,例如使用 UTF-8 字符串文字,因此没有必要欺骗 VC++ 生成 UTF-8 编码的字符串(尽管我可能会继续这样做而不是使用 u8 前缀).

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8 prefix).

TCHAR:TCHAR 用于迁移旧的 Windows 程序,这些程序采用从 char 到 wchar_t 的传统编码,最好忘记,除非您的程序是在前一千年编写的.它不是可移植的,并且其编码甚至其数据类型本质上都是不确定的,因此无法与任何基于非 TCHAR 的 API 一起使用.由于它的目的是迁移到 wchar_t,我们在上面看到这不是一个好主意,因此使用 TCHAR 没有任何价值.

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.

<子>1.可以在 wchar_t 字符串中表示但在任何语言环境中都不支持的字符不需要用单个 wchar_t 值表示.这意味着 wchar_t 可以对某些字符使用可变宽度编码,这又明显违反了 wchar_t 的意图.尽管 wchar_t 可表示的字符足以说明语言环境支持"该字符是有争议的,但在这种情况下,可变宽度编码是不合法的,并且 Window 对 UTF-16 的使用不符合规范.

2.Unicode 允许用多个代码点表示许多字符,这对于简单的文本算法与可变宽度编码产生了相同的问题.即使严格维护组合规范化,某些字符仍然需要多个代码点.请参阅:http://www.unicode.org/standard/where/

相关文章