在 UTF-8 字符串上使用数组索引时输出错误

2022-01-12 00:00:00 string arrays char utf-8 php

我在使用 UTF-8 字符串时遇到了问题.我想从字符串中读取单个字符,例如:

I have encountered a problem when using a UTF-8 string. I want to read a single character from the string, for example:

$string = "üÜöÖäÄ";
echo $string[0];

我期待看到 ü,但我得到了 - 为什么?

I am expecting to see ü, but I get � -- why?

推荐答案

使用 mb_substr($string, 0, 1, 'utf-8') 代替获取字符.

Use mb_substr($string, 0, 1, 'utf-8') to get the character instead.

在您的代码中发生的情况是表达式 $string[0] 获取了字符串的 UTF-8 编码表示的第一个 byte,因为 PHP 字符串是有效的字节数组(PHP 内部不识别编码).

What happens in your code is that the expression $string[0] gets the first byte of the UTF-8 encoded representation of your string because PHP strings are effectively arrays of bytes (PHP does not internally recognize encodings).

由于字符串中的第一个字符由多个字节组成(UTF-8 编码规则),你实际上只得到了角色的一部分.此外,这些规则使您检索的字节无效,无法单独作为一个字符,这就是您看到问号的原因.

Since the first character in your string is composed in more than one byte (UTF-8 encoding rules), you are effectively only getting part of the character. Furthermore, these rules make the byte you are retrieving invalid to stand as a character on its own, which is why you see the question mark.

mb_substr 知道编码规则,所以它不会天真地只给你一个字节;它将获得对第一个字符进行编码所需的数量.

mb_substr knows the encoding rules, so it will not naively give you back just one byte; it will get as many as needed to encode the first character.

你可以看到 $string[0] 只给你一个字节:

You can see that $string[0] gives you back just one byte with:

$string = "üÜöÖäÄ";
echo strlen($string[0]);

mb_substr 会返回两个字节:

While mb_substr gives you back two bytes:

$string = "üÜöÖäÄ";
echo strlen(mb_substr($string, 0, 1, 'utf-8'));

而这两个字节其实只是一个字符(需要使用mb_strlen 为此):

And these two bytes are in fact just one character (you need to use mb_strlen for this):

$string = "üÜöÖäÄ";
echo mb_strlen(mb_substr($string, 0, 1, 'utf-8'), 'utf-8');

最后,正如 Marwelln 在下面指出的那样,如果您使用 ,情况会变得更容易接受mb_internal_encoding 摆脱 'utf-8' 冗余:

Finally, as Marwelln points out below, the situation becomes more tolerable if you use mb_internal_encoding to get rid of the 'utf-8' redundancy:

$string = "üÜöÖäÄ";
mb_internal_encoding('utf-8');
echo mb_strlen(mb_substr($string, 0, 1));

您可以查看上述大部分内容.

相关文章