我的 PHP 应用程序是否正确支持 UTF-8?

2021-12-26 00:00:00 unicode utf-8 php

我想确保我对 UTF-8 的了解都是正确的.我已经尝试使用 UTF-8 有一段时间了,但我不断遇到越来越多的错误和其他奇怪的事情,这使得拥有 100% UTF-8 站点几乎是不可能的.总有一个我似乎想念的地方.也许这里有人可以更正我的列表或确定它,这样我就不会错过任何重要的事情.

I would like to make sure that everything I know about UTF-8 is correct. I have been trying to use UTF-8 for a while now but I keep stumbling across more and more bugs and other weird things that make it seem almost impossible to have a 100% UTF-8 site. There is always a gotcha somewhere that I seem to miss. Perhaps someone here can correct my list or OK it so I don't miss anything important.

数据库

每个站点都必须将数据存储在某处.无论您的 PHP 设置是什么,您还必须配置数据库.如果您无法访问配置文件,请确保在连接后立即SET NAMES 'utf8'".另外,请确保在所有表上使用 utf8_unicode_ci.这假设 MySQL 作为数据库,您将不得不为其他数据库更改.

Every site has to store there data somewhere. No matter what your PHP settings are you must also configure the DB. If you can't access the config files then make sure to "SET NAMES 'utf8'" as soon as you connect. Also, make sure to use utf8_ unicode_ ci on all of your tables. This assumes MySQL for a database, you will have to change for others.

正则表达式

我做了很多 更复杂的正则表达式 比您的平均搜索替换.我必须记住使用/u"修饰符,以便 PCRE 不会破坏我的字符串.然而,即便如此,显然仍然存在问题.

I do a LOT of regex that is more complex than your average search-replace. I have to remember to use the "/u" modifier so that PCRE doesn't corrupt my strings. Yet, even then there are still problems apparently.

字符串函数

所有默认字符串函数(strlen()、strpos() 等)都应替换为 多字节字符串函数查看字符而不是字节.

All of the default string functions (strlen(), strpos(), etc.) should be replaced with Multibyte String Functions that look at the character instead of the byte.

标题您应该确保您的服务器为浏览器返回正确的标头,以了解您尝试使用的字符集(就像您必须告诉 MySQL 一样).

Headers You should make sure that your server is returning the correct header for the browser to know what charset you are trying to use (just like you must tell MySQL).

header('内容类型:text/html;charset=utf-8');

header('Content-Type: text/html; charset=utf-8');

输入正确的 < 也是一个好主意.meta > 页头中的标签.尽管实际的标题会在它们不同时覆盖它.

It is also a good idea to put the correct < meta > tag in the page head. Though the actual header will override this should they differ.

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

问题

我是否需要在页面加载时将从用户代理(HTML 表单的 & URI)接收到的所有内容转换为 UTF-8,或者我是否可以将字符串/值保持原样并仍然运行它们?功能没有问题?

Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads or if I can just leave the strings/values as they are and still run them through these functions without a problem?

如果我确实需要将所有内容都转换为 UTF-8 - 那么我应该采取哪些步骤?mb_detect_encoding 似乎是为此而构建的,但我保留看到人们抱怨它并不总是有效.mb_check_encoding 似乎也有问题告诉来自格式错误的一个很好的 UTF-8 字符串.

If I do need to convert everything to UTF-8 - then what steps should I take? mb_detect_encoding seems to be built for this but I keep seeing people complain that it doesn't always work. mb_check_encoding also seems to have a problem telling a good UTF-8 string from a malformed one.

PHP 是否根据使用的编码(如文件类型)以不同方式在内存中存储字符串,还是仍像常规字符串一样存储,其中某些字符的解释方式不同(如 & amp; vs &; 在 HTML 中). chazomaticus 回答了这个问题:

在 PHP 中(至少到 PHP5),字符串只是字节序列.有没有隐含或显式的字符集与他们有关;那是什么程序员必须跟踪.

In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of.

如果将非 UTF-8 字符串提供给 mb_* 函数会导致问题吗?

If a give a non-UTF-8 string to a mb_* function will it ever cause a problem?

如果 UTF 字符串编码不正确,会出现问题(例如正则表达式中的解析错误?)还是只会将实体标记为错误 (html)?是否有可能不正确编码的字符串会导致函数返回 FALSE,因为字符串是坏的?

If a UTF string is improperly encoded will something go wrong (like a parsing error in regex?) or will it just mark an entity as bad (html)? Is there ever a chance that improperly encoded strings will result in function returning FALSE because the string is bad?

我听说您也应该将表单标记为 UTF-8 (accept-charset="UTF-8"),但我不确定这样做的好处是什么..?

I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8") but I am not sure what the benefit is..?

编写 UTF-16 是为了解决 UTF-8 中的限制吗?就像 UTF-8 的字符空间不足?(Y2(UTF)k?)

Was UTF-16 written to address a limit in UTF-8? Like did UTF-8 run out of space for characters? (Y2(UTF)k?)

功能

以下是我发现的几个自定义 PHP 函数,但我没有任何方法来验证它们是否确实有效.也许有人有一个我可以使用的例子.首先是 convertToUTF8() 然后是seek_utf8来自 wordpress.

Here are are a couple of the custom PHP functions I have found but I haven't any way to verify that they actually work. Perhaps someone has an example which I can use. First is convertToUTF8() and then seems_utf8 from wordpress.

function seems_utf8($str) {
    $length = strlen($str);
    for ($i=0; $i < $length; $i++) {
        $c = ord($str[$i]);
        if ($c < 0x80) $n = 0; # 0bbbbbbb
        elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                return false;
        }
    }
    return true;
}

function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
}

如果有人感兴趣,我找到了一个很好的示例页面来使用 在测试 UTf-8 时.

If anyone is interested I found a great example page to use when testing UTf-8.

推荐答案

我是否需要在页面加载时将从用户代理(HTML 表单和 URI)收到的所有内容转换为 UTF-8

Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads

没有.用户代理应以 UTF-8 格式提交数据;否则,您将失去 Unicode 的优势.

No. The user agent should be submitting data in UTF-8 format; if not you are losing the benefit of Unicode.

确保用户代理以 UTF-8 格式提交的方法是提供包含它以 UTF-8 编码提交的表单的页面.使用 Content-Type 标头(如果您打算保存表单并独立工作,也可以使用元 http-equiv).

The way to ensure a user-agent submits in UTF-8 format is to serve the page containing the form it's submitting in UTF-8 encoding. Use the Content-Type header (and meta http-equiv too if you intend the form to be saved and work standalone).

我听说您也应该将表单标记为 UTF-8 (accept-charset="UTF-8")

I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8")

不要. 在 HTML 标准中这是一个不错的主意,但 IE 从来没有把它做好.它应该声明一个允许字符集的排他列表,但 IE 将其视为一个额外的字符集列表,以每个字段为基础进行尝试.因此,如果您有一个 ISO-8859-1 页面和一个accept-charset="UTF-8""形式,IE 将首先尝试将字段编码为 ISO-8859-1,如果有非 8859-1字符在那里,然后它会求助于 UTF-8.

Don't. It was a nice idea in the HTML standard, but IE never got it right. It was supposed to state an exclusive list of allowable charsets, but IE treats it as a list of additional charsets to try, on a per-field basis. So if you have an ISO-8859-1 page and an "accept-charset="UTF-8"" form, IE will first try to encode a field as ISO-8859-1, and if there's a non-8859-1 character in there, then it'll resort to UTF-8.

但是由于 IE 没有告诉您它使用的是 ISO-8859-1 还是 UTF-8,所以这对您绝对没有用.对于每个字段,您必须分别猜测正在使用哪种编码!没用处.省略该属性并将您的页面作为 UTF-8 提供;这是你目前能做的最好的事情.

But since IE does not tell you whether it has used ISO-8859-1 or UTF-8, that's of absolutely no use to you. You would have to guess, for each field separately, which encoding was in use! Not useful. Omit the attribute and serve your pages as UTF-8; that's the best you can do at the moment.

如果 UTF 字符串编码不当会出错

If a UTF string is improperly encoded will something go wrong

如果您让这样的序列进入浏览器,您可能会遇到麻烦.存在超长序列",它们在比所需更长的字节序列中编码低编号的代码点.这意味着如果您通过在字节序列中查找该 ASCII 字符来过滤<",您可能会遗漏一个,并让脚本元素进入您认为是安全文本的内容.

If you let such a sequence get through to the browser you could be in trouble. There are ‘overlong sequences’ which encode an low-numbered codepoint in a longer sequence of bytes than is necessary. This means if you are filtering ‘<’ by looking for that ASCII character in a sequence of bytes, you could miss one, and let a script element into what you thought was safe text.

过长的序列在 Unicode 的早期就被禁止了,但是微软花了很长时间才把它们放在一起:IE 将字节序列 'xC0xBC' 解释为 '<' 直到 IE6Service Pack 1.Opera 在(我认为)版本 7 之前也出错了.幸运的是,这些较旧的浏览器正在消亡,但仍然值得过滤过长的序列,以防这些浏览器现在仍然存在(或新的白痴浏览器使以后犯同样的错误).您可以这样做,并使用仅允许正确 UTF-8 通过的正则表达式来修复其他错误序列,例如 这个来自 W3.

Overlong sequences were banned back in the early days of Unicode, but it took Microsoft a very long time to get their shit together: IE would interpret the byte sequence ‘xC0xBC’ as a ‘<’ up until IE6 Service Pack 1. Opera also got it wrong up to (about, I think) version 7. Luckily these older browsers are dying out, but it's still worth filtering overlong sequences in case those browsers are still about now (or new idiot browsers make the same mistake in future). You can do this, and fix other bad sequences, with a regex that allows only proper UTF-8 through, such as this one from W3.

如果您在 PHP 中使用 mb_ 函数,您可能不会遇到这些问题.我不能肯定,因为当我还在编写 PHP 时 mb_* 是无法使用的脆弱的.

If you are using mb_ functions in PHP, you might be insulated from these issues. I can't say for sure as mb_* was unusable fragile when I was still writing PHP.

无论如何,这也是删除控制字符的好时机,这是一个大的且通常不被重视的错误来源.除了 W3 正则表达式删除的其他字符之外,我还会从提交的字符串中删除字符 9 和 13;对于您知道不应该是多行文本框的字符串,删除纯换行符也是值得的.

In any case, this is also a good time to remove control characters, which are a large and generally unappreciated source of bugs. I would remove chars 9 and 13 from submitted string in addition to the others the W3 regex takes out; it is also worth removing plain newlines for strings you know aren't supposed to be multiline textboxes.

编写 UTF-16 是为了解决 UTF-8 中的限制吗?

Was UTF-16 written to address a limit in UTF-8?

不,UTF-16 是每个代码点两个字节的编码,用于在内存中更轻松地对 Unicode 字符串进行索引(从所有 Unicode 都适合两个字节的日子开始;Windows 和 Java 等系统仍然这样做)就这样).与 UTF-8 不同,它与 ASCII 不兼容,并且在 Web 上几乎没有用处.但是您偶尔会在保存的文件中遇到它,通常是 Windows 用户保存的文件,这些用户被 Windows 在另存为"菜单中将 UTF-16LE 描述为Unicode"所误导.

No, UTF-16 is a two-byte-per-codepoint encoding that's used to make indexing Unicode strings easier in-memory (from the days when all of Unicode would fit in two bytes; systems like Windows and Java still do it that way). Unlike UTF-8 it is not compatible with ASCII, and is of little-to-no use on the Web. But you occasionally meet it in saved files, usually ones saved by Windows users who have been misled by Windows's description of UTF-16LE as "Unicode" in Save-As menus.

seems_utf8

与正则表达式相比,这非常低效!

This is very inefficient compared to the regex!

另外,请确保在您的所有表上使用 utf8_unicode_ci.

Also, make sure to use utf8_unicode_ci on all of your tables.

如果没有这个,您实际上可以逃脱,将 MySQL 视为仅存储字节的存储,并且仅在脚本中将它们解释为 UTF-8.使用 utf8_unicode_ci 的优点是它会根据非 ASCII 字符的知识进行整理(排序和不区分大小写的比较),例如.‘ŕ’和‘Ŕ’是同一个字符.如果您使用非 UTF8 归类,则应坚持使用二进制(区分大小写)匹配.

You can actually sort of get away without this, treating MySQL as a store for nothing but bytes and only interpreting them as UTF-8 in your script. The advantage of using utf8_unicode_ci is that it will collate (sort and do case-insensitive compares) with knowledge about non-ASCII characters, so eg. ‘ŕ’ and ‘Ŕ’ are the same character. If you use a non-UTF8 collation you should stick to binary (case-sensitive) matching.

无论您选择哪种方式,都要始终如一:为您的表使用与为您的连接所做的相同的字符集.您想要避免的是脚本和数据库之间的有损字符集转换.

Whichever you choose, do it consistently: use the same character set for your tables as you do for your connection. What you want to avoid is a lossy character set conversion between your scripts and the database.

相关文章