带有变音符号的 Unicode 字符串,按字符分割
我有这个 Unicode 字符串:Аа́Ббб́Ввв́Г㥴Дд
I have this Unicode string: Ааа́Ббб́Ввв́Г㥴Дд
我想用字符分割它.现在,如果我尝试循环所有字符,我会得到这样的结果:A a a ' Б ...
And I want to it split by chars.
Right now if I try to loop truth all chars I get something like this:
A a a ' Б ...
有没有办法将此字符串正确拆分为字符:А а́
?
Is there a way to properly split this string to chars: А а а́
?
推荐答案
要正确执行此操作,您需要的是计算字素簇边界的算法,如 UAX 29.不幸的是,这需要从 Unicode 字符数据库中了解哪些字符是哪些类的成员,而 JavaScript 不提供该信息(*).因此,您必须在脚本中包含 UCD 的副本,这会使其非常庞大.
To do this properly, what you want is the algorithm for working out the grapheme cluster boundaries, as defined in UAX 29. Unfortunately this requires knowledge of which characters are members of which classes, from the Unicode Character Database, and JavaScript doesn't make that information available(*). So you'd have to include a copy of the UCD with your script, which would make it pretty bulky.
如果您只需要担心拉丁语或西里尔语使用的基本重音,另一种选择是仅使用组合变音符号块 (U+0300-U+036F).这对于其他语言和符号可能会失败,但对于您想要做的事情可能就足够了.
An alternative if you only need to worry about the basic accents used by Latin or Cyrillic would be to take only the Combining Diacritical Marks block (U+0300-U+036F). This would fail for other languages and symbols, but might be enough for what you want to do.
function findGraphemesNotVeryWell(s) {
var re= /.[u0300-u036F]*/g;
var match, matches= [];
while (match= re.exec(s))
matches.push(match[0]);
return matches;
}
findGraphemesNotVeryWell('Ааа́Ббб́Ввв́Г㥴Дд');
["А", "а", "а́", "Б", "б", "б́", "В", "в", "в́", "Г", "г", "Ґ", "ґ", "Д", "д"]
(*: 可能有一种方法可以通过让浏览器呈现字符串并测量其中的选择位置来提取信息...但这肯定会非常混乱和困难让跨浏览器工作.)
(*: there might be a way to extract the information by letting the browser render the string, and measuring the positions of selections in it... but it would surely be very messy and difficult to get working cross-browser.)
相关文章