utf8_encode 函数用途

2022-01-07 00:00:00 regex utf-8 character-encoding php

假设我用 UTF-8 编码我的文件.

Supposed that im encoding my files with UTF-8.

在 PHP 脚本中,将比较字符串:

Within PHP script, a string will be compared:

$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...

它是没有 utf8_encode() 函数的 string 真的是 UTF-8 吗?如果你用 UTF-8 编码你的文件不需要这个功能吗?

Its that string really UTF-8 without the utf8_encode() function? If you encode your files with UTF-8 dont need this function?

推荐答案

如果您阅读了utf8_encode<的手册条目/code>,它将 ISO-8859-1 编码的字符串转换为 UTF-8.函数名是一个可怕的用词不当,因为它暗示了某种必要的自动编码.事实并非如此.如果您的源代码保存为 UTF-8 并且您将あ"分配给 $string,则 $string 保存以 UTF-8 编码的字符あ".无需采取进一步措施.事实上,尝试将 UTF-8 字符串(错误地)从 ISO-8859-1 转换为 UTF-8 会造成乱码.

If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.

为了详细说明,您的源代码是作为字节序列读取的.PHP 以 ASCII 解释对其重要的内容(所有关键字和运算符等).UTF-8 向后兼容 ASCII.这意味着,在 ASCII 和 UTF-8 中,所有正常"ASCII 字符都使用相同的字节表示.所以一个 " 被 PHP 解释为一个 ",不管它是应该保存在 ASCII 还是 UTF-8 中.引号之间的任何内容,PHP 都只是简单地将其作为文字位序列.所以 PHP 将您的 "あ" 视为 "11100011 10000001 10000010".它不关心引号之间到底是什么,它会按原样使用它.

To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.

相关文章