如何通过指针读取 UTF-8 字符?

2022-01-07 00:00:00 unicode utf-8 character-encoding c++

假设我在内存中存储了 UTF-8 内容,如何使用指针读取字符?我想我需要注意指示多字节字符的第 8 位,但是我究竟如何将序列转换为有效的 Unicode 字符?另外,wchar_t 是存储单个 Unicode 字符的正确类型吗?

这是我的想法:

<前><代码>wchar_t readNextChar (char*& p){wchar_t unicodeChar;字符 ch = *p++;如果 ((ch & 128) != 0){//这是一个多字节字符,我现在该怎么办?//字符 chNext = *p++;//...但是我如何组合Unicode字符?...}...返回 unicodeChar;}

解决方案

您必须将 UTF-8 位模式解码为其未编码的 UTF-32 表示.如果您想要实际的 Unicode 代码点,则必须使用 32 位数据类型.

在 Windows 上,wchar_t 不够大,因为它只有 16 位.您必须使用 unsigned intunsigned long 代替.仅在处理 UTF-16 代码单元时使用 wchar_t.

在其他平台上,wchar_t 通常是 32 位.但是在编写可移植代码时,除了绝对需要的地方(例如 std::wstring)外,您应该远离 wchar_t.

尝试更像这样的事情:

#define IS_IN_RANGE(c, f, l) (((c) >= (f)) && ((c) <= (l)))u_long readNextChar (char* &p){//TODO: 因为 UTF-8 是可变长度的//编码,你应该传入输入//缓冲区的实际字节长度,以便您//可以判断是否是格式错误的 UTF-8//序列将超过缓冲区的末尾...u_char c1, c2, *ptr = (u_char*) p;u_long uc = 0;整数序列;//int datalen = ... p 的可用长度 ...;/*如果(数据len <1 ){//格式错误的数据,做点什么!!!返回(u_long)-1;}*/c1 = ptr[0];if( (c1 & 0x80) == 0 ){uc = (u_long) (c1 & 0x7F);序列= 1;}否则如果((c1& 0xE0)== 0xC0){uc = (u_long) (c1 & 0x1F);序列= 2;}否则如果((c1& 0xF0)== 0xE0){uc = (u_long) (c1 & 0x0F);序列= 3;}否则如果((c1& 0xF8)== 0xF0){uc = (u_long) (c1 & 0x07);序列= 4;}别的{//格式错误的数据,做点什么!!!返回(u_long)-1;}/*if( seqlen > datalen ){//格式错误的数据,做点什么!!!返回(u_long)-1;}*/for(int i = 1; i < seqlen; ++i){c1 = ptr[i];如果( (c1 & 0xC0) != 0x80 ){//格式错误的数据,做点什么!!!返回(u_long)-1;}}开关(序列){案例2:{c1 = ptr[0];如果(!IS_IN_RANGE(c1,0xC2,0xDF)){//格式错误的数据,做点什么!!!返回(u_long)-1;}休息;}案例3:{c1 = ptr[0];c2 = ptr[1];开关 (c1){案例 0xE0:如果(!IS_IN_RANGE(c2,0xA0,0xBF)){//格式错误的数据,做点什么!!!返回(u_long)-1;}休息;案例 0xED:如果 (!IS_IN_RANGE(c2, 0x80, 0x9F)){//格式错误的数据,做点什么!!!返回(u_long)-1;}休息;默认:如果 (!IS_IN_RANGE(c1, 0xE1, 0xEC) && !IS_IN_RANGE(c1, 0xEE, 0xEF)){//格式错误的数据,做点什么!!!返回(u_long)-1;}休息;}休息;}案例4:{c1 = ptr[0];c2 = ptr[1];开关 (c1){案例 0xF0:如果(!IS_IN_RANGE(c2,0x90,0xBF)){//格式错误的数据,做点什么!!!返回(u_long)-1;}休息;案例0xF4:如果 (!IS_IN_RANGE(c2, 0x80, 0x8F)){//格式错误的数据,做点什么!!!返回(u_long)-1;}休息;默认:如果 (!IS_IN_RANGE(c1, 0xF1, 0xF3)){//格式错误的数据,做点什么!!!返回(u_long)-1;}休息;}休息;}}for(int i = 1; i < seqlen; ++i){uc = ((uc <<6) | (u_long)(ptr[i] & 0x3F));}p += 序列;返回 uc;}

Suppose I have UTF-8 content stored in memory, how do I read the characters using a pointer? I presume I need to watch for the 8th bit indicating a multi-byte character, but how exactly do I turn the sequence into a valid Unicode character? Also, is wchar_t the proper type to store a single Unicode character?

This is what I have in mind:


   wchar_t readNextChar (char*& p)
   { 
       wchar_t unicodeChar;
       char ch = *p++;

       if ((ch & 128) != 0)
       {
           // This is a multi-byte character, what do I do now?
           // char chNext = *p++; 
           // ... but how do I assemble the Unicode character?   
           ...
       }
       ...
       return unicodeChar;
   }  

解决方案

You have to decode the UTF-8 bit pattern to its unencoded UTF-32 representation. If you want the actual Unicode codepoint, you have to use a 32-bit data type.

On Windows, wchar_t is NOT large enough, as it is only 16-bit. You have to use an unsigned int or unsigned long instead. Use wchar_t only when dealing with UTF-16 codeunits instead.

On other platforms, wchar_t is usually 32bit. But when writing portable code, you should stay away from wchar_t except where absolutely needed (like std::wstring).

Try something more like this:

#define IS_IN_RANGE(c, f, l)    (((c) >= (f)) && ((c) <= (l)))

u_long readNextChar (char* &p) 
{  
    // TODO: since UTF-8 is a variable-length
    // encoding, you should pass in the input
    // buffer's actual byte length so that you
    // can determine if a malformed UTF-8
    // sequence would exceed the end of the buffer...

    u_char c1, c2, *ptr = (u_char*) p;
    u_long uc = 0;
    int seqlen;
    // int datalen = ... available length of p ...;    

    /*
    if( datalen < 1 )
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }
    */

    c1 = ptr[0];

    if( (c1 & 0x80) == 0 )
    {
        uc = (u_long) (c1 & 0x7F);
        seqlen = 1;
    }
    else if( (c1 & 0xE0) == 0xC0 )
    {
        uc = (u_long) (c1 & 0x1F);
        seqlen = 2;
    }
    else if( (c1 & 0xF0) == 0xE0 )
    {
        uc = (u_long) (c1 & 0x0F);
        seqlen = 3;
    }
    else if( (c1 & 0xF8) == 0xF0 )
    {
        uc = (u_long) (c1 & 0x07);
        seqlen = 4;
    }
    else
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }

    /*
    if( seqlen > datalen )
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }
    */

    for(int i = 1; i < seqlen; ++i)
    {
        c1 = ptr[i];

        if( (c1 & 0xC0) != 0x80 )
        {
            // malformed data, do something !!!
            return (u_long) -1;
        }
    }

    switch( seqlen )
    {
        case 2:
        {
            c1 = ptr[0];

            if( !IS_IN_RANGE(c1, 0xC2, 0xDF) )
            {
                // malformed data, do something !!!
                return (u_long) -1;
            }

            break;
        }

        case 3:
        {
            c1 = ptr[0];
            c2 = ptr[1];

            switch (c1)
            {
                case 0xE0:
                    if (!IS_IN_RANGE(c2, 0xA0, 0xBF))
                    {
                        // malformed data, do something !!!
                        return (u_long) -1;
                    }
                    break;

                case 0xED:
                    if (!IS_IN_RANGE(c2, 0x80, 0x9F))
                    {
                        // malformed data, do something !!!
                        return (u_long) -1;
                    }
                    break;

                default:
                    if (!IS_IN_RANGE(c1, 0xE1, 0xEC) && !IS_IN_RANGE(c1, 0xEE, 0xEF))
                    {
                        // malformed data, do something !!!
                        return (u_long) -1;
                    }
                    break;
            }

            break;
        }

        case 4:
        {
            c1 = ptr[0];
            c2 = ptr[1];

            switch (c1)
            {
                case 0xF0:
                    if (!IS_IN_RANGE(c2, 0x90, 0xBF))
                    {
                        // malformed data, do something !!!
                        return (u_long) -1;
                    }
                    break;

                case 0xF4:
                    if (!IS_IN_RANGE(c2, 0x80, 0x8F))
                    {
                        // malformed data, do something !!!
                        return (u_long) -1;
                    }
                    break;

                default:
                    if (!IS_IN_RANGE(c1, 0xF1, 0xF3))
                    {
                        // malformed data, do something !!!
                        return (u_long) -1;
                    }
                    break;                
            }

            break;
        }
}

    for(int i = 1; i < seqlen; ++i)
    {
        uc = ((uc << 6) | (u_long)(ptr[i] & 0x3F));
    }

    p += seqlen;
    return uc; 
}

相关文章