C++ &提升:编码/解码 UTF-8

2021-12-24 00:00:00 unicode utf-8 c++ boost

我正在尝试做一个非常简单的任务:获取 unicode-aware wstring 并将其转换为 string,编码为 UTF8 字节,然后相反解决方法:取一个包含 UTF8 字节的 string 并将其转换为 unicode-aware wstring.

问题是,我需要它跨平台,我需要它与 Boost 一起工作......我似乎无法找到让它工作的方法.我一直在玩

  • http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html 和
  • http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html

尝试将代码转换为使用 stringstream/wstringstream 而不是任何文件,但似乎没有任何效果.

例如,在 Python 中它看起来像这样:

<预><代码>>>>u"????"你'u05e9u05dcu05d5u05dd'>>>u"????".encode("utf8")'xd7xa9xd7x9cxd7x95xd7x9d'>>>'xd7xa9xd7x9cxd7x95xd7x9d'.decode("utf8")你'u05e9u05dcu05d5u05dd'

我最终想要的是:

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};wstring ws(uchars);字符串 s = encode_utf8(ws);//s 现在保存 "xd7xa9xd7x9cxd7x95xd7x9d"wstring ws2 = decode_utf8(s);//ws2 现在持有 {0x5e9, 0x5dc, 0x5d5, 0x5dd}

我真的不想再增加对 ICU 的依赖或本着这种精神的东西......但据我所知,Boost 应该是可能的.

一些示例代码将不胜感激!谢谢

解决方案

谢谢大家,但最终我求助于 http://utfcpp.sourceforge.net/ -- 它是一个非常轻量级且易于使用的仅标头库.我在这里分享一个演示代码,如果有人觉得它有用:

inline void decode_utf8(const std::string& bytes, std::wstring& wstr){utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));}内联 void encode_utf8(const std::wstring& wstr, std::string& 字节){utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));}

用法:

wstring ws(L"u05e9u05dcu05d5u05dd");字符串 s;encode_utf8(ws, s);

I'm trying to do a very simple task: take a unicode-aware wstring and convert it to a string, encoded as UTF8 bytes, and then the opposite way around: take a string containing UTF8 bytes and convert it to unicode-aware wstring.

The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. I've been toying with

  • http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html and
  • http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html

Trying to convert the code to use stringstream/wstringstream instead of files of whatever, but nothing seems to work.

For instance, in Python it would look like so:

>>> u"????"
u'u05e9u05dcu05d5u05dd'
>>> u"????".encode("utf8")
'xd7xa9xd7x9cxd7x95xd7x9d'
>>> 'xd7xa9xd7x9cxd7x95xd7x9d'.decode("utf8")
u'u05e9u05dcu05d5u05dd'

What I'm ultimately after is this:

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds "xd7xa9xd7x9cxd7x95xd7x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}

I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost.

Some sample code would greatly be appreciated! Thanks

解决方案

Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

Usage:

wstring ws(L"u05e9u05dcu05d5u05dd");
string s;
encode_utf8(ws, s);

相关文章