strlen() 和 UTF-8 编码

2021-12-28 00:00:00 unicode utf-8 php strlen

假设 UTF-8 编码，PHP 中的 strlen()，有没有可能这个字符串的长度是 4?

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?

我只对 strlen() 感兴趣，而不是其他函数

I'm only interested to know about strlen(), not other functions

这是字符串:

$1ï¿½2

我在自己的电脑上测试过，验证过UTF-8编码，得到的答案是6.

I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.

我在 strlen 的手册中或我在 UTF-8 上阅读的任何内容都没有看到任何内容可以解释为什么上述某些字符的计数小于 1.

I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.

PS:这道题和答案(4)来自我在Ebay上买的ZCE的模拟测试.

PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.

推荐答案

您发布的字符串长度为 6 个字符:$1ï¿½2(美元符号，数字 1，带分音符的小写 i，倒问号，二分之一分数，数字二)

The string you posted is six character long: $1ï¿½2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)

如果使用该字符串的 UTF-8 表示调用 strlen()，您将得到 9 个结果(可能，尽管有多种长度不同的表示).

If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).

然而，如果我们将该字符串存储为 ISO 8859-1 或 CP1252，我们将有一个 6 字节长的序列，它作为 UTF-8 是合法的.将这 6 个字节重新解释为 UTF-8 将产生 4 个字符:$1 2(美元符号，数字 1，Unicode 替换字符，数字 2).也就是说，单个字符 ' ' 的 UTF-8 编码与三个字符ï¿½"的 ISO-8859-1 编码相同.

However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "ï¿½".

当 UTF-8 解码器读取的数据不是有效的 UTF-8 数据时，通常会插入替换字符.

The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.

看来原来的字符串是经过多层曲解处理的；通过在非 UTF-8 数据上使用 UTF-8 解码器(产生 $1 2)，然后通过用于分析该数据的任何东西(产生 $1ï¿½2).

It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1ï¿½2).

相关文章