为什么默认情况下 C++ 中的宽文件流会缩小写入的数据?

2021-12-26 00:00:00 unicode file c++ wofstream

老实说，我只是在 C++ 标准库中没有得到以下设计决策.将宽字符写入文件时，wofstream 会将 wchar_t 转换为 char 字符:

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:

#include <fstream> #include <string> int main() { using namespace std; wstring someString = L"Hello StackOverflow!"; wofstream file(L"Test.txt"); file << someString; // the output file will consist of ASCII characters! }

我知道这与标准 codecvt 有关.utf8的codecvt" rel="nofollow noreferrer">Boost.此外，utf16 提供了一个 codecvtbinary-mode/208431#208431">马丁约克在这里.问题是为什么 standard codecvt 转换宽字符?为什么不按原样写字符！

I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!

另外，我们会用 C++0x 获得真正的 unicode 流 还是我在这里遗漏了什么?

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

推荐答案

C++ 用于字符集的模型继承自 C，因此至少可以追溯到 1989 年.

The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

两个要点:

IO 是根据字符完成的.
确定字符序列化的宽度是语言环境的工作
默认语言环境(名为C")非常小(我不记得标准中的约束，这里它只能将 7 位 ASCII 作为窄字符集和宽字符集处理).
有一个名为"的环境确定的语言环境

所以要得到任何东西，你必须设置语言环境.

So to get anything, you have to set the locale.

如果我使用简单的程序

#include <locale> #include <fstream> #include <ostream> #include <iostream> int main() { wchar_t c = 0x00FF; std::locale::global(std::locale("")); std::wofstream os("test.dat"); os << c << std::endl; if (!os) { std::cout << "Output failed "; } }

使用环境语言环境并将代码 0x00FF 的宽字符输出到文件中.如果我要求使用C"语言环境，我得到

which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get

$ env LC_ALL=C ./a.out Output failed

语言环境无法处理宽字符，我们会在 IO 失败时收到问题通知.如果我运行询问 UTF-8 语言环境，我会得到

the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get

$ env LC_ALL=en_US.utf8 ./a.out $ od -t x1 test.dat 0000000 c3 bf 0a 0000003

(od -t x1 只是转储以十六进制表示的文件)，正是我对 UTF-8 编码文件的期望.

(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.

相关文章