为什么默认情况下 C++ 中的宽文件流会缩小写入的数据?
老实说,我只是在 C++ 标准库中没有得到以下设计决策.将宽字符写入文件时,wofstream
会将 wchar_t
转换为 char
字符:
Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream
converts wchar_t
into char
characters:
#include <fstream>
#include <string>
int main()
{
using namespace std;
wstring someString = L"Hello StackOverflow!";
wofstream file(L"Test.txt");
file << someString; // the output file will consist of ASCII characters!
}
我知道这与标准 codecvt
有关.utf8的codecvt
" rel="nofollow noreferrer">Boost
.此外,utf16 提供了一个 codecvt
binary-mode/208431#208431">马丁约克在这里.问题是为什么 standard codecvt
转换宽字符?为什么不按原样写字符!
I am aware that this has to do with the standard codecvt
. There is codecvt
for utf8
in Boost
. Also, there is a codecvt
for utf16
by Martin York here on SO. The question is why the standard codecvt
converts wide-characters? why not write the characters as they are!
另外,我们会用 C++0x 获得真正的 unicode 流
还是我在这里遗漏了什么?
Also, are we gonna get real unicode streams
with C++0x or am I missing something here?
推荐答案
C++ 用于字符集的模型继承自 C,因此至少可以追溯到 1989 年.
The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.
两个要点:
- IO 是根据字符完成的.
- 确定字符序列化的宽度是语言环境的工作
- 默认语言环境(名为C")非常小(我不记得标准中的约束,这里它只能将 7 位 ASCII 作为窄字符集和宽字符集处理).
- 有一个名为"的环境确定的语言环境
所以要得到任何东西,你必须设置语言环境.
So to get anything, you have to set the locale.
如果我使用简单的程序
#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>
int main()
{
wchar_t c = 0x00FF;
std::locale::global(std::locale(""));
std::wofstream os("test.dat");
os << c << std::endl;
if (!os) {
std::cout << "Output failed
";
}
}
使用环境语言环境并将代码 0x00FF 的宽字符输出到文件中.如果我要求使用C"语言环境,我得到
which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get
$ env LC_ALL=C ./a.out
Output failed
语言环境无法处理宽字符,我们会在 IO 失败时收到问题通知.如果我运行询问 UTF-8 语言环境,我会得到
the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get
$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003
(od -t x1 只是转储以十六进制表示的文件),正是我对 UTF-8 编码文件的期望.
(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.
相关文章