什么因素使 PHP Unicode 不兼容?

2021-12-26 00:00:00 unicode php

我可以在我的脚本中很好地使用 UTF-8 字符.

I am able use UTF-8 characters just fine in my scripts.

事实上,变量名和函数名可以包含Unicode字符.

还有 mb_string extension 处理多字节字符串,但在无数文章中 PHP 是因其缺乏 Unicode 支持而受到批评.

There is also the mb_string extension which deals with multi-byte strings, yet in countless articles PHP is criticized for its lack of Unicode support.

我不明白;为什么说PHP不支持Unicode?

I don't get it; why is PHP said to not support Unicode?

推荐答案

当 PHP 在几年前开始时,UTF-8 并没有得到真正的支持.我们谈论的是 Windows 98/Me 等非 Unicode 操作系统仍然流行的时代,而 Delphi 等其他大型语言也是非 Unicode 的时代.并非所有语言从第一天起就考虑到 Unicode,并且在不破坏很多东西的情况下将您的语言完全更改为 Unicode 是很困难的.例如,Delphi 在一两年前才兼容 Unicode,而 Java 或 C# 等其他语言从第一天起就采用 Unicode 设计.

When PHP was started several years ago, UTF-8 was not really supported. We are talking about a time when non-Unicode OS like Windows 98/Me was still current and when other big languages like Delphi were also non-Unicode. Not all languages were designed with Unicode in mind from day 1, and completely changing your language to Unicode without breaking a lot of stuff is hard. Delphi only became Unicode compatible a year or two ago for example, while other languages like Java or C# were designed in Unicode from Day 1.

因此,当 PHP 发展成为 PHP 3、PHP 4 和现在的 PHP 5 时,根本没有人决定添加 Unicode.为什么?大概是为了与现有脚本保持兼容,或者因为 utf8_de/encode 和 mb_string 已经存在并且可以工作.我不确定,但我坚信这与有机增长有关.特性并不是简单地默认存在,它们必须由某人编写,而这在 PHP 中还没有发生.

So when PHP grew and became PHP 3, PHP 4 and now PHP 5, simply no one decided to add Unicode. Why? Presumably to keep compatible with existing scripts or because utf8_de/encode and mb_string already existed and work. I do not know for sure, but I strongly believe that it has something to do with organic growth. Features do not simply exist by default, they have to be written by someone, and that simply did not happen for PHP yet.

好的,我读错了问题.问题是:字符串是如何在内部存储的?如果我输入Währung"或Écriture",哪个编码用于创建使用的字节?在 PHP 的情况下,它是带有代码页的 ASCII.这意味着:如果我使用 ISO-8859-15 对字符串进行编码,然后使用一些中文代码页对其进行解码,则会得到奇怪的结果.另一种选择是在 C# 或 Java 等语言中,所有内容都存储为 Unicode,这意味着:不再有代码页,理论上你不会搞砸.我推荐 Joel 的文章关于 Unicode 和字符集,但本质上它归结为:字符串存储在内部,而 PHP 的答案是Not in Unicode",这意味着在处理字符串时必须非常小心和明确,以确保在输入、存储(数据库)和输出,这很容易出错.

Ok, I read the question wrong. The question is: How are strings stored internally? If I type in "Währung" or "Écriture", which Encoding is used to create the bytes used? In case of PHP, it is ASCII with a Codepage. That means: If I encode the string using ISO-8859-15 and you decode it with some chinese codepage, you will get weird results. The alternative is in languages like C# or Java where everything is stored as Unicode, which means: There is no codepage anymore, and theoretically you cannot mess up. I recommend Joel's article about Unicode and Character Sets, but essentially it boils down to: How are strings stored internally, and the answer with PHP is "Not in Unicode", which means that you have to be very careful and explicit when processing strings to make sure to always keep the string in the proper encoding during input, storage (database) and output, which is very errorprone.

相关文章