如何读取 UCS-2 文件?

我正在编写一个程序来获取 UCS-2 Little Endian 中 *.rc 文件编码中的信息.

I'm writing a program to get the infomation in *.rc file encoding in UCS-2 Little Endian.

int _tmain(int argc, _TCHAR* argv[]) {
  wstring csvLine(wstring sLine);
  wifstream fin("en.rc");
  wofstream fout("table.csv");
  wofstream fout_rm("temp.txt");
  wstring sLine;
  fout << "en
";
  while(getline(fin,sLine)) {
    if (sLine.find(L"IDS") == -1)
      fout_rm << sLine << endl;
    else
      fout << csvLine(sLine);
  }
  fout << flush;
  system("pause");
  return 0;
}

en.rc"中的第一行是#include sLine 显示如下:

The first line in "en.rc" is #include <windows.h> but sLine shows as below:

[0]     255 L'?'
[1]     254 L't'
[2]     35  L'#'
[3]     0
[4]     105 L'i'
[5]     0
[6]     110 L'n'
[7]     0
[8]     99  L'c'
.       .
.       .
.       .

这个程序可以正确地用于 UTF-8.我怎样才能做到 UCS-2?

This program can work out correctly for UTF-8. How can I do it to UCS-2?

推荐答案

宽流使用宽流缓冲区来访问文件.宽流缓冲区从文件中读取字节,并使用其 codecvt facet 将这些字节转换为宽字符.默认的 codecvt facet 是 std::codecvt 它在 wchar_tchar 的本地字符集之间进行转换> (即,像 mbstowcs() 那样).

Wide streams use a wide stream buffer to access the file. The Wide stream buffer reads bytes from the file and uses its codecvt facet to convert these bytes to wide characters. The default codecvt facet is std::codecvt<wchar_t, char ,std::mbstate_t> which converts between the native character sets for wchar_t and char (i.e., like mbstowcs() does).

您没有使用本机 char 字符集,因此您需要的是一个 codecvt facet,它将 UCS-2 作为多字节序列读取并将其转换为宽字符.

You're not using the native char character set, so what you want is a codecvt facet that reads UCS-2 as a multibyte sequence and converts it to wide characters.

#include <fstream>
#include <string>
#include <codecvt>
#include <iostream>

int main(int argc, char *argv[])
{
    wifstream fin("en.rc", std::ios::binary); // You need to open the file in binary mode

    // Imbue the file stream with a codecvt facet that uses UTF-16 as the external multibyte encoding
    fin.imbue(std::locale(fin.getloc(),
              new std::codecvt_utf16<wchar_t, 0xffff, consume_header>));

    // ^ We set 0xFFFF as the maxcode because that's the largest that will fit in a single wchar_t
    //   We use consume_header to detect and use the UTF-16 'BOM'

    // The following is not really the correct way to write Unicode output, but it's easy
    std::wstring sLine;
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
    while (getline(fin, sLine))
    {
        std::cout << convert.to_bytes(sLine) << '
';
    }
}

请注意,这里的 UTF-16 存在问题.wchar_t 的目的是让一个 wchar_t 代表一个代码点.然而,Windows 使用 UTF-16 将一些代码点表示为 two wchar_ts.这意味着标准 API 不能很好地与 Windows 配合使用.

Note that there's an issue with UTF-16 here. The purpose of wchar_t is for one wchar_t to represent one codepoint. However Windows uses UTF-16 which represents some codepoints as two wchar_ts. This means that the standard API doesn't work very well with Windows.

这里的结果是,当文件包含代理对时,codecvt_utf16 将读取该对,将其转换为大于 16 位的单个代码点值,并且必须将该值截断为 16 位以将其粘贴在 wchar_t 中.这意味着此代码确实仅限于 UCS-2.我已将 maxcode 模板参数设置为 0xFFFF 以反映这一点.

The consequence here is that when the file contains a surrogate pair, codecvt_utf16 will read that pair, convert it to a single codepoint value greater than 16 bits and have to truncate the value to 16 bits to stick it in a wchar_t. This means this code really is limited to UCS-2. I've set the maxcode template parameter to 0xFFFF to reflect this.

wchar_t 还存在许多其他问题,您可能只想完全避免它:C++ wchar_t 有什么问题"?

There are a number of other problems with wchar_t, and you might want to just avoid it entirely: What's "wrong" with C++ wchar_t?

相关文章