(w)ifstream 是否支持不同的编码
当我使用 wifstream 将文本文件读取到宽字符串 (std::wstring) 时,流实现是否支持不同的编码 - 即它可以用于读取例如ASCII、UTF-8 和 UTF-16 文件?
When I read a text file to a wide character string (std::wstring) using an wifstream, does the stream implementation support different encodings - i.e. can it be used to read e.g. ASCII, UTF-8, and UTF-16 files?
如果没有,我该怎么办?
If not, what would I have to do?
(我需要阅读整个文件,如果这有区别的话)
(I need to read the entire file, if that makes a difference)
推荐答案
C++ 通过 std::locale
和方面 std::codecvt
支持字符编码.一般的想法是,locale
对象描述了系统的各个方面,这些方面可能因文化而异,(人类)语言因语言而异.这些方面被分解为 facet
,它们是定义如何构造依赖于本地化的对象(包括 I/O 流)的模板参数.当您从 istream
读取或写入 ostream
时,每个字符的实际写入都会通过区域设置的方面进行过滤.这些方面不仅涵盖了 Unicode 类型的编码,还涵盖了诸如大数字的书写方式(例如,使用逗号或句点)、货币、时间、大小写以及大量其他详细信息等各种特征.
C++ supports character encodings by means of std::locale
and the facet std::codecvt
. The general idea is that a locale
object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down into facet
s, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from an istream
or write to a ostream
, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.
然而,仅仅因为存在进行编码的工具并不意味着标准库实际上可以处理所有编码,也不会使此类代码易于正确执行.即使诸如您应该读入的字符大小(更不用说编码部分)这样的基本内容也很困难,因为 wchar_t
可能太小(损坏数据)或太大(浪费空间),以及最常见的编译器(例如 Visual C++ 和 Gnu C++)确实在它们的实现有多大上有所不同.所以一般需要找外部库来做实际的编码.
However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as wchar_t
can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.
- iconv 通常被认为是正确的,但如何将其绑定到的示例很难找到 C++ 机制.
- jla3ep 提及 libICU,非常彻底,但是 C++ API 并没有尝试与标准很好地配合(据我所知:您可以扫描 examples 看看你是否可以做得更好.)
- iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
- jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)
我能找到的涵盖所有基础的最直接的例子来自 Boost 的 UTF-8 codecvt facet,有一个专门尝试编码 UTF-8 (UCS4) 以供 IO 流使用的示例.它看起来像这样,但我不建议只是逐字复制它.需要更多地挖掘源 理解它(我并不声称):
The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):
typedef wchar_t ucs4_t;
std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);
...
std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }
要了解有关语言环境以及它们如何使用方面(包括 codecvt
)的更多信息,请查看以下内容:
To understand more about locales, and how they use facets (including codecvt
), take a look at the following:
- Nathan Myers 对语言环境和方面进行了详尽的解释.迈尔斯是语言环境概念的设计者之一.如果您想深入了解,他有更正式的文档.
- Apache 的标准库实现(以前是 RogueWave 的)有一个完整的方面列表.
- Nicolai Josuttis 的 C++ 标准库第 14 章专门讨论该主题.
- Angelika Langer 和 Klaus Kreft 的标准 C++ IOStreams 和语言环境 写了一整本书.
- Nathan Myers has a thorough explanation of locales and facets. Myers was one of the designers of the locale concept. He has more formal documentation if you want to wade through it.
- Apache's Standard Library implementation (formerly RogueWave's) has a full list of facets.
- Nicolai Josuttis' The C++ Standard Library Chapter 14 is devoted to the subject.
- Angelika Langer and Klaus Kreft's Standard C++ IOStreams and Locales devotes a whole book.
相关文章