PHP str_word_count() 多字节安全吗?

2021-12-28 00:00:00 utf-8 php utf

我想在 UTF-8 字符串上使用 str_word_count().

I want to use str_word_count() on a UTF-8 string.

这在 PHP 中安全吗?在我看来应该是(特别是考虑到没有 mb_str_word_count()).

Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).

但是在 php.net 上有很多人通过展示他们自己的多字节兼容"版本函数.

But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.

所以我想我想知道...

So I guess I want to know...

鉴于 str_word_count 只是计算由 " "(空格)分隔的所有字符序列，它在多字节字符串上应该是安全的，即使它不一定知道字符序列，对吗?

Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?

UTF-8 中是否有任何等效的空格"字符，它们不是 ASCII " " (space)?#

Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#

我猜这就是问题所在.

推荐答案

我觉得你猜对了.事实上，UTF-8 中有一些不属于 US-ASCII 的空格字符.给你一个这样的空间的例子:

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

Unicode 字符 'NO-BREAK SPACE' (U+00A0):UTF-8 中的 2 个字节:0xC2 0xA0 (c2a0)

Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)

也许还有:

Unicode 字符 'NEXT LINE (NEL)' (U+0085)):UTF-8 中的 2 个字节:0xC2 0x85 (c285)
Unicode 字符 'LINE SEPARATOR' (U+2028):UTF-8 中的 3 个字节:0xE2 0x80 0xA8 (e280a8)
Unicode 字符PARAGRAPH SEPARATOR"(U+2029):UTF-8 中的 3 个字节:0xE2 0x80 0xA8 (e280a8)

Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)

Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

无论如何，第一个 - 'NO-BREAK SPACE' (U+00A0) - 是一个很好的例子，因为它也是拉丁 X 字符集的一部分.并且 PHP 手册已经提供了一个提示，即 str_word_count 将取决于语言环境.

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

如果我们想对此进行测试，我们可以将语言环境设置为 UTF-8，传入一个包含 xA0 序列的无效字符串，如果这仍然算作断字字符，该函数显然不是 UTF-8 安全的，因此不是多字节安全的(与问题中未定义的相同):

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php /** * is PHP str_word_count() multibyte safe? * @link https://stackoverflow.com/q/8290537/367456 */ echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), " "; $test = "awordxA0bword aword"; $result = str_word_count($test, 2); var_dump($result);

输出:

New Locale: en_US.utf8 array(3) { [0]=> string(5) "aword" [6]=> string(5) "bword" [12]=> string(5) "aword" }

正如这个演示所展示的，该功能在手册页上给出的区域设置承诺完全失败(我不要对此感到奇怪或抱怨，最常见的是，如果您读到某个函数在 PHP 中是特定于语言环境的，那么您将终生运行并找到一个不是的)，我在这里利用它来证明它对 UTF- 没有任何作用-8个字符编码.

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

对于 UTF-8，您应该查看 PCRE 扩展名:

Instead for UTF-8 you should take a look into the PCRE extension:

在 PCRE/PHP 中匹配 Unicode 字母字符

PCRE 对 PHP 中的 Unicode 和 UTF-8 有很好的理解.如果您仔细制作正则表达式模式，它也可以非常快.

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

相关文章