str_word_count() 用于非拉丁词?

2021-12-30 00:00:00 count php

我正在尝试计算用非拉丁语言(保加利亚语)编写的变量中的单词数.但似乎 str_word_count() 没有计算非拉丁词.php文件的编码为UTF-8

im trying to count the number of words in variable written in non-latin language (Bulgarian). But it seems that str_word_count() is not counting non-latin words. The encoding of the php file is UTF-8

$str = "текст на кирилица"; echo 'Number of words: '.str_word_count($str); //this returns 0

推荐答案

您可以使用正则表达式:

You may do it with regex:

$str = "текст на кирилица"; echo 'Number of words: '.count(preg_split('/s+/', $str));

这里我将单词定界符定义为空格字符.如果可能还有其他东西将被视为单词分隔符，您需要将其添加到您的正则表达式中.

here I'm defining word delimiter as space characters. If there may be something else that will be treated as word delimiter, you'll need to add it into your regex.

另外，请注意，由于在正则表达式中没有 utf 字符 (不在字符串中) - /u 修饰符不是必需的.但是如果你想要一些 utf 字符作为分隔符，你需要添加这个正则表达式修饰符.

Also, note, that since there's no utf characters in regex (not in string) - /u modifier isn't required. But if you'll want some utf characters to act as delimiter, you'll need to add this regex modifier.

更新:

如果您只想在文字中处理西里尔文字母，您可以使用:

If you want only cyrillic letters to be treated in words, you may use:

$str = "текст на 12453 кирилица"; echo 'Number of words: '.count(preg_split('/[^А-Яа-яЁё]+/u', $str));

相关文章