用问号替换无效的 UTF-8 字符，mbstring.substitute_character 似乎被忽略了

2021-12-28 00:00:00 utf-8 character-encoding php mbstring

我想用引号 (PHP 5.3.5) 替换无效的 UTF-8 字符.

I would like to replace invalid UTF-8 chars with quotation marks (PHP 5.3.5).

到目前为止我有这个解决方案，但无效字符被删除，而不是被?"替换.

So far I have this solution, but invalid characters are removed, instead of being replaced by '?'.

function replace_invalid_utf8($str) { return mb_convert_encoding($str, 'UTF-8', 'UTF-8'); } echo mb_substitute_character()." "; echo replace_invalid_utf8('éééaaaàààeeÃ©')." "; echo replace_invalid_utf8('eeeaaaaaaeeÃ©')." ";

应该输出:

63 // ASCII code for '?' character ???aaa???eé // or ??aa??eé eeeaaaaaaeeé

但目前输出:

63 aaaee // removed invalid characters eeeaaaaaaeeé

有什么建议吗?

你会用另一种方式来做吗(例如使用 preg_replace()?)

Would you do it another way (using a preg_replace() for example?)

谢谢.

推荐答案

您可以使用mb_convert_encoding() 或 htmlspecialchars() 的 ENT_SUBSTITUTE> 自 PHP 5.4 起的选项.当然，您也可以使用 preg_match().如果您使用 intl，则可以使用 UConverter 自 PHP 5.5 起.

You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.

无效字节序列的推荐替代字符是U+FFFD.参见3.1.2 替换格式错误的子序列"；在 UTR #36:Unicode 安全注意事项中的详细信息.

Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.

使用 mb_convert_encoding() 时，您可以通过将 Unicode 代码点传递给 mb_substitute_character() 或 mbstring.substitute_character 指令来指定替换字符.替换的默认字符是?(问号 - U+003F).

When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).

// REPLACEMENT CHARACTER (U+FFFD) mb_substitute_character(0xFFFD); function replace_invalid_byte_sequence($str) { return mb_convert_encoding($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence2($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); }

UConverter 提供面向过程和面向对象的 API.

UConverter offers both procedual and object-oriented API.

function replace_invalid_byte_sequence3($str) { return UConverter::transcode($str, 'UTF-8', 'UTF-8'); } function replace_invalid_byte_sequence4($str) { return (new UConverter('UTF-8', 'UTF-8'))->convert($str); }

使用preg_match()时，需要注意字节范围，避免UTF-8非最短格式的漏洞.尾字节的范围根据前导字节的范围而变化.

When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.

lead byte: 0x00 - 0x7F, 0xC2 - 0xF4 trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)

您可以参考以下资源来检查字节范围.

you can refer to the following resources for checking the byte range.

"UTF-8 字节序列的语法"在 RFC 3629 中
"表 3-7.格式良好的 UTF-8 字节序列"在 Unicode 标准 6.1 中
"多语言表单编码"在 W3C 国际化中"

"Syntax of UTF-8 Byte Sequences" in RFC 3629

"Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.1

"Multilingual form encoding" in W3C Internationalization"

字节范围表如下.

Code Points First Byte Second Byte Third Byte Fourth Byte U+0000 - U+007F 00 - 7F U+0080 - U+07FF C2 - DF 80 - BF U+0800 - U+0FFF E0 A0 - BF 80 - BF U+1000 - U+CFFF E1 - EC 80 - BF 80 - BF U+D000 - U+D7FF ED 80 - 9F 80 - BF U+E000 - U+FFFF EE - EF 80 - BF 80 - BF U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF U+40000 - U+FFFFF F1 - F3 80 - BF 80 - BF 80 - BF U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF

如何在不破坏有效字符的情况下替换无效字节序列见"3.1.1 格式错误的子序列"在 UTR #36:Unicode 安全注意事项和表 3-8.U+FFFD在UTF-8转换中的使用"在 Unicode 标准中.

How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.

Unicode 标准显示了一个示例:

The Unicode Standard shows an example:

before: <61 F1 80 80 E1 80 C2 62 80 63 80 BF 64 > after: <0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064>

这里是 preg_replace_callback() 根据上述规则的实现.

Here is the implementation by preg_replace_callback() according to the above rule.

function replace_invalid_byte_sequence5($str) { // REPLACEMENT CHARACTER (U+FFFD) $substitute = "xEFxBFxBD"; $regex = '/ ([x00-x7F] # U+0000 - U+007F |[xC2-xDF][x80-xBF] # U+0080 - U+07FF | xE0[xA0-xBF][x80-xBF] # U+0800 - U+0FFF |[xE1-xECxEExEF][x80-xBF]{2} # U+1000 - U+CFFF | xED[x80-x9F][x80-xBF] # U+D000 - U+D7FF | xF0[x90-xBF][x80-xBF]{2} # U+10000 - U+3FFFF |[xF1-xF3][x80-xBF]{3} # U+40000 - U+FFFFF | xF4[x80-x8F][x80-xBF]{2}) # U+100000 - U+10FFFF |(xE0[xA0-xBF] # U+0800 - U+0FFF (invalid) |[xE1-xECxEExEF][x80-xBF] # U+1000 - U+CFFF (invalid) | xED[x80-x9F] # U+D000 - U+D7FF (invalid) | xF0[x90-xBF][x80-xBF]? # U+10000 - U+3FFFF (invalid) |[xF1-xF3][x80-xBF]{1,2} # U+40000 - U+FFFFF (invalid) | xF4[x80-x8F][x80-xBF]?) # U+100000 - U+10FFFF (invalid) |(.) # invalid 1-byte /xs'; // $matches[1]: valid character // $matches[2]: invalid 3-byte or 4-byte character // $matches[3]: invalid 1-byte $ret = preg_replace_callback($regex, function($matches) use($substitute) { if (isset($matches[2]) || isset($matches[3])) { return $substitute; } return $matches[1]; }, $str); return $ret; }

通过这种方式可以直接比较字节，避免preg_match对字节大小的限制.

You can compare byte directly and avoid preg_match's restriction about byte size by this way.

function replace_invalid_byte_sequence6($str) { $size = strlen($str); $substitute = "xEFxBFxBD"; $ret = ''; $pos = 0; $char; $char_size; $valid; while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) { $ret .= $valid ? $char : $substitute; } return $ret; } function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid) { $valid = false; if ($str_size <= $pos) { return false; } if ($str[$pos] < "x80") { $valid = true; $char_size = 1; } else if ($str[$pos] < "xC2") { $char_size = 1; } else if ($str[$pos] < "xE0") { if (!isset($str[$pos+1]) || $str[$pos+1] < "x80" || "xBF" < $str[$pos+1]) { $char_size = 1; } else { $valid = true; $char_size = 2; } } else if ($str[$pos] < "xF0") { $left = "xE0" === $str[$pos] ? "xA0" : "x80"; $right = "xED" === $str[$pos] ? "x9F" : "xBF"; if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) { $char_size = 1; } else if (!isset($str[$pos+2]) || $str[$pos+2] < "x80" || "xBF" < $str[$pos+2]) { $char_size = 2; } else { $valid = true; $char_size = 3; } } else if ($str[$pos] < "xF5") { $left = "xF0" === $str[$pos] ? "x90" : "x80"; $right = "xF4" === $str[$pos] ? "x8F" : "xBF"; if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) { $char_size = 1; } else if (!isset($str[$pos+2]) || $str[$pos+2] < "x80" || "xBF" < $str[$pos+2]) { $char_size = 2; } else if (!isset($str[$pos+3]) || $str[$pos+3] < "x80" || "xBF" < $str[$pos+3]) { $char_size = 3; } else { $valid = true; $char_size = 4; } } else { $char_size = 1; } $char = substr($str, $pos, $char_size); $pos += $char_size; return true; }

测试用例在这里.

function run(array $callables, array $arguments) { return array_map(function($callable) use($arguments) { return array_map($callable, $arguments); }, $callables); } $data = [ // Table 3-8. Use of U+FFFD in UTF-8 Conversion // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf) "x61"."xF1x80x80"."xE1x80"."xC2"."x62"."x80"."x63" ."x80"."xBF"."x64", // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence "xF0x9Fx8Cx95"."xF0x9Fx8C"."xF0x9Fx8C" ]; var_dump(run([ 'replace_invalid_byte_sequence', 'replace_invalid_byte_sequence2', 'replace_invalid_byte_sequence3', 'replace_invalid_byte_sequence4', 'replace_invalid_byte_sequence5', 'replace_invalid_byte_sequence6' ], $data));

请注意，mb_convert_encoding 有一个错误，它会在无效字节序列之后立即中断有效字符，或者在不添加 U+FFFD 的情况下删除有效字符之后的无效字节序列.

As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.

$data = [ // U+20AC "xE2x82xAC"."xE2x82xAC"."xE2x82xAC", "xE2x82" ."xE2x82xAC"."xE2x82xAC", // U+24B62 "xF0xA4xADxA2"."xF0xA4xADxA2"."xF0xA4xADxA2", "xF0xA4xAD" ."xF0xA4xADxA2"."xF0xA4xADxA2", "xA4xADxA2"."xF0xA4xADxA2"."xF0xA4xADxA2", // 'FULL MOON SYMBOL' (U+1F315) "xF0x9Fx8Cx95" . "xF0x9Fx8C", "xF0x9Fx8Cx95" . "xF0x9Fx8C" . "xF0x9Fx8C" ];

尽管 preg_match() 可以代替 preg_replace_callback 使用，但此函数对字节大小有限制.有关详细信息，请参阅错误报告 #36463.可以通过下面的测试用例来确认.

Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.

str_repeat('a', 10000)

最后，我的基准测试结果如下.

Finally, the result of my benchmark is following.

mb_convert_encoding() 0.19628190994263 htmlspecialchars() 0.082863092422485 UConverter::transcode() 0.15999984741211 UConverter::convert() 0.29843020439148 preg_replace_callback() 0.63967490196228 direct comparision 0.71933102607727

基准代码在这里.

function timer(array $callables, array $arguments, $repeat = 10000) { $ret = []; $save = $repeat; foreach ($callables as $key => $callable) { $start = microtime(true); do { array_map($callable, $arguments); } while($repeat -= 1); $stop = microtime(true); $ret[$key] = $stop - $start; $repeat = $save; } return $ret; } $functions = [ 'mb_convert_encoding()' => 'replace_invalid_byte_sequence', 'htmlspecialchars()' => 'replace_invalid_byte_sequence2', 'UConverter::transcode()' => 'replace_invalid_byte_sequence3', 'UConverter::convert()' => 'replace_invalid_byte_sequence4', 'preg_replace_callback()' => 'replace_invalid_byte_sequence5', 'direct comparision' => 'replace_invalid_byte_sequence6' ]; foreach (timer($functions, $data) as $description => $time) { echo $description, PHP_EOL, $time, PHP_EOL; }

相关文章