为什么默认情况下 C++ 中的宽文件流会缩小写入的数据?

2021-12-26 00:00:00 unicode file c++ wofstream

老实说,我只是在 C++ 标准库中没有得到以下设计决策.将宽字符写入文件时,wofstream 会将 wchar_t 转换为 char 字符:

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:

#include <fstream>
#include <string>

int main()
{
    using namespace std;

    wstring someString = L"Hello StackOverflow!";
    wofstream file(L"Test.txt");

    file << someString; // the output file will consist of ASCII characters!
}

我知道这与标准 codecvt 有关.utf8的codecvt" rel="nofollow noreferrer">Boost.此外,utf16 提供了一个 codecvtbinary-mode/208431#208431">马丁约克在这里.问题是为什么 standard codecvt 转换宽字符?为什么不按原样写字符!

I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!

另外,我们会用 C++0x 获得真正的 unicode 流 还是我在这里遗漏了什么?

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

推荐答案

C++ 用于字符集的模型继承自 C,因此至少可以追溯到 1989 年.

The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

两个要点:

  • IO 是根据字符完成的.
  • 确定字符序列化的宽度是语言环境的工作
  • 默认语言环境(名为C")非常小(我不记得标准中的约束,这里它只能将 7 位 ASCII 作为窄字符集和宽字符集处理).
  • 有一个名为"的环境确定的语言环境

所以要得到任何东西,你必须设置语言环境.

So to get anything, you have to set the locale.

如果我使用简单的程序

#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>

int main()
{
    wchar_t c = 0x00FF;
    std::locale::global(std::locale(""));
    std::wofstream os("test.dat");
    os << c << std::endl;
    if (!os) {
        std::cout << "Output failed
";
    }
}

使用环境语言环境并将代码 0x00FF 的宽字符输出到文件中.如果我要求使用C"语言环境,我得到

which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get

$ env LC_ALL=C ./a.out
Output failed

语言环境无法处理宽字符,我们会在 IO 失败时收到问题通知.如果我运行询问 UTF-8 语言环境,我会得到

the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get

$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003

(od -t x1 只是转储以十六进制表示的文件),正是我对 UTF-8 编码文件的期望.

(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.

相关文章