什么是“错"?使用 C++ wchar_t 和 wstrings?宽字符有哪些替代方法?

我在 C++ 社区中看到很多人(尤其是 freenode 上的##c++)对 wstringswchar_t 的使用以及它们在 windows 中的使用感到不满api.wchar_twstring 到底有什么错误",如果我想支持国际化,宽字符有哪些替代方案?

I have seen a lot of people in the C++ community(particularly ##c++ on freenode) resent the use of wstrings and wchar_t, and their use in the windows api. What is exactly "wrong" with wchar_t and wstring, and if I want to support internationalization, what are some alternatives to wide characters?

推荐答案

什么是wchar_t?

wchar_t 的定义使得任何语言环境的 char 编码都可以转换为 wchar_t 表示,其中每个 wchar_t 恰好代表一个代码点:

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint:

类型 wchar_t 是一个独特的类型,其值可以表示支持的语言环境 (22.3.1) 中指定的最大扩展字符集的所有成员的不同代码.

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

                        ;                         ;                         ;     ― C++ [basic.fundamental] 3.9.1/5

                                                                               ― C++ [basic.fundamental] 3.9.1/5

此不要求 wchar_t 足够大以同时表示来自所有语言环境的任何字符.也就是说,用于 wchar_t 的编码可能因地区而异.这意味着您不一定要使用一种语言环境将字符串转换为 wchar_t,然后再使用另一种语言环境转换回 char.1

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.1

由于使用 wchar_t 作为所有语言环境之间的通用表示似乎是 wchar_t 在实践中的主要用途,您可能想知道如果不是这样有什么好处.

Since using wchar_t as a common representation between all locales seems to be the primary use for wchar_t in practice you might wonder what it's good for if not that.

wchar_t 的最初意图和目的是通过定义它来简化文本处理,以便它需要从字符串的代码单元到文本字符的一对一映射,从而允许使用相同的简单算法as 与 ascii 字符串一起使用以与其他语言一起使用.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

不幸的是,wchar_t 规范的措辞假设字符和代码点之间的一对一映射来实现这一点.Unicode 打破了这一假设2,因此您也不能安全地将 wchar_t 用于简单的文本算法.

Unfortunately the wording of wchar_t's specification assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption2, so you can't safely use wchar_t for simple text algorithms either.

这意味着可移植软件既不能使用 wchar_t 作为语言环境之间文本的通用表示,也不能使用简单的文本算法.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

不多,反正对于可移植代码.如果定义了 __STDC_ISO_10646__,则 wchar_t 的值直接表示在所有语言环境中具有相同值的 Unicode 代码点.这样就可以安全地进行前面提到的跨语言环境转换.但是,您不能仅依靠它来决定是否可以以这种方式使用 wchar_t,因为尽管大多数 unix 平台都定义了它,但 Windows 并没有,即使 Windows 在所有语言环境中都使用相同的 wchar_t 语言环境.

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

Windows 没有定义 __STDC_ISO_10646__ 的原因是因为 Windows 使用 UTF-16 作为其 wchar_t 编码,并且因为 UTF-16 使用代理对来表示大于 U+FFFF 的代码点,这意味着UTF-16 不满足 __STDC_ISO_10646__ 的要求.

The reason Windows doesn't define __STDC_ISO_10646__ is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

对于特定于平台的代码,wchar_t 可能更有用.它在 Windows 上基本上是必需的(例如,某些文件根本无法在不使用 wchar_t 文件名的情况下打开),尽管据我所知,Windows 是唯一正确的平台(所以也许我们可以将 wchar_t 视为Windows_char_t").

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

事后看来,wchar_t 显然对于简化文本处理或作为与区域设置无关的文本的存储没有用.可移植代码不应尝试将其用于这些目的.非可移植代码可能会因为某些 API 需要它而发现它很有用.

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

我喜欢的替代方法是使用 UTF-8 编码的 C 字符串,即使在对 UTF-8 不是特别友好的平台上也是如此.

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly toward UTF-8.

通过这种方式,人们可以使用跨平台的通用文本表示来编写可移植代码,将标准数据类型用于预期目的,获得语言对这些类型的支持(例如字符串文字,尽管需要一些技巧才能使其适用于某些编译器)、一些标准库支持、调试器支持(可能需要更多技巧)等.对于宽字符,通常很难或不可能获得所有这些,并且您可能会在不同的平台上获得不同的部分.

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

UTF-8 不提供的一件事是能够使用简单的文本算法,例如 ASCII 可能.在这种情况下,UTF-8 并不比任何其他 Unicode 编码差.事实上,它可能被认为更好,因为 UTF-8 中的多代码单元表示更常见,因此与尝试坚持使用 UTF 相比,处理这种可变宽度字符表示的代码中的错误更容易被注意到和修复-32 使用 NFC 或 NFKC.

One thing UTF-8 does not provide is the ability to use simple text algorithms such as are possible with ASCII. In this UTF-8 is no worse than any other Unicode encoding. In fact it may be considered to be better because multi-code unit representations in UTF-8 are more common and so bugs in code handling such variable width representations of characters are more likely to be noticed and fixed than if you try to stick to UTF-32 with NFC or NFKC.

许多平台使用 UTF-8 作为其本机字符编码,并且许多程序不需要任何重要的文本处理,因此在这些平台上编写国际化程序与编写不考虑国际化的代码几乎没有区别.编写更广泛可移植的代码,或在其他平台上编写需要在使用其他编码的 API 边界插入转换.

Many platforms use UTF-8 as their native char encoding and many programs do not require any significant text processing, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use other encodings.

某些软件使用的另一种替代方法是选择跨平台表示,例如保存 UTF-16 数据的无符号短数组,然后提供所有库支持并简单地承受语言支持的成本等.

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

C++11 添加了新类型的宽字符作为 wchar_t、char16_t 和 char32_t 的替代品,并具有附带的语言/库功能.这些实际上并不能保证是 UTF-16 和 UTF-32,但我不认为任何主要的实现会使用其他任何东西.C++11 还改进了对 UTF-8 的支持,例如使用 UTF-8 字符串文字,因此没有必要欺骗 VC++ 生成 UTF-8 编码的字符串(尽管我可能会继续这样做而不是使用 u8 前缀).

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8 prefix).

TCHAR:TCHAR 用于将采用传统编码的古老 Windows 程序从 char 迁移到 wchar_t,除非您的程序是在上个千年编写的,否则最好忘记它.它不是可移植的,并且在其编码甚至数据类型方面本质上是不确定的,这使得它无法与任何基于非 TCHAR 的 API 一起使用.由于它的目的是迁移到 wchar_t,我们在上面已经看到这不是一个好主意,所以使用 TCHAR 没有任何价值.

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.

<子> 1.可以在 wchar_t 字符串中表示但在任何语言环境中不受支持的字符不需要用单个 wchar_t 值表示.这意味着 wchar_t 可以对某些字符使用可变宽度编码,这显然违反了 wchar_t 的意图.尽管有争议的是 wchar_t 可以表示的字符足以说明语言环境支持"该字符,但在这种情况下,可变宽度编码是不合法的,并且 Window 对 UTF-16 的使用不符合标准.

<子>2.Unicode 允许用多个代码点表示许多字符,这对简单的文本算法产生了与可变宽度编码相同的问题.即使严格维护组合规范化,某些字符仍然需要多个代码点.请参阅:http://www.unicode.org/standard/where/

相关文章