使用 PHP 转换所有类型的智能引号

2021-12-25 00:00:00 unicode double-quotes replace php html

我正在尝试在处理文本时将所有类型的智能引号转换为常规引号.但是,我编译的以下函数似乎仍然缺乏支持和适当的设计.

I am trying to convert all types of smart quotes to regular quotes when working with text. However, the following function I've compiled still seems to be lacking support and proper design.

有谁知道如何正确转换所有引号字符?

Does anyone know how to properly get all quote characters converted?

function convert_smart_quotes($string)
{
    $quotes = array(
        "xC2xAB"   => '"', // « (U+00AB) in UTF-8
        "xC2xBB"   => '"', // » (U+00BB) in UTF-8
        "xE2x80x98" => "'", // ‘ (U+2018) in UTF-8
        "xE2x80x99" => "'", // ’ (U+2019) in UTF-8
        "xE2x80x9A" => "'", // ‚ (U+201A) in UTF-8
        "xE2x80x9B" => "'", // ‛ (U+201B) in UTF-8
        "xE2x80x9C" => '"', // " (U+201C) in UTF-8
        "xE2x80x9D" => '"', // " (U+201D) in UTF-8
        "xE2x80x9E" => '"', // „ (U+201E) in UTF-8
        "xE2x80x9F" => '"', // ‟ (U+201F) in UTF-8
        "xE2x80xB9" => "'", // ‹ (U+2039) in UTF-8
        "xE2x80xBA" => "'", // › (U+203A) in UTF-8
    );
    $string = strtr($string, $quotes);

    // Version 2
    $search = array(
        chr(145),
        chr(146),
        chr(147),
        chr(148),
        chr(151)
    );
    $replace = array("'","'",'"','"',' - ');
    $string = str_replace($search, $replace, $string);

    // Version 3
    $string = str_replace(
        array('‘','’','“','”'),
        array("'", "'", '"', '"'),
        $string
    );

    // Version 4
    $search = array(
        '‘', 
        '’', 
        '“', 
        '”', 
        '—',
        '–',
    );
    $replace = array("'","'",'"','"',' - ', '-');
    $string = str_replace($search, $replace, $string);

    return $string;
}

注意:这个问题是一个完整的查询,包括此处询问Microsoft"引号 这是一个重复",就像询问所有轮胎尺寸是询问汽车轮胎尺寸的重复"一样.

Note: This question is a complete query about the full of gamut of quotes including the "Microsoft" quotes asked here This is a "duplicate" in the same way that asking about all tire sizes is a "duplicate" of asking for a car tire size.

推荐答案

你需要这样的东西(假设 UTF-8 输入,忽略 CJK(中文、日语、韩语)):

You need something like this (assuming UTF-8 input, and ignoring CJK (Chinese, Japanese, Korean)):

$chr_map = array(
   // Windows codepage 1252
   "xC2x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
   "xC2x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
   "xC2x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
   "xC2x91" => "'", // U+0091⇒U+2018 left single quotation mark
   "xC2x92" => "'", // U+0092⇒U+2019 right single quotation mark
   "xC2x93" => '"', // U+0093⇒U+201C left double quotation mark
   "xC2x94" => '"', // U+0094⇒U+201D right double quotation mark
   "xC2x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark

   // Regular Unicode     // U+0022 quotation mark (")
                          // U+0027 apostrophe     (')
   "xC2xAB"     => '"', // U+00AB left-pointing double angle quotation mark
   "xC2xBB"     => '"', // U+00BB right-pointing double angle quotation mark
   "xE2x80x98" => "'", // U+2018 left single quotation mark
   "xE2x80x99" => "'", // U+2019 right single quotation mark
   "xE2x80x9A" => "'", // U+201A single low-9 quotation mark
   "xE2x80x9B" => "'", // U+201B single high-reversed-9 quotation mark
   "xE2x80x9C" => '"', // U+201C left double quotation mark
   "xE2x80x9D" => '"', // U+201D right double quotation mark
   "xE2x80x9E" => '"', // U+201E double low-9 quotation mark
   "xE2x80x9F" => '"', // U+201F double high-reversed-9 quotation mark
   "xE2x80xB9" => "'", // U+2039 single left-pointing angle quotation mark
   "xE2x80xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$chr = array_keys  ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));

这里是背景:

每个 Unicode 字符都只属于一个 "General Category",其中可以包含引号的字符字符如下:

Every Unicode character belongs to exactly one "General Category", of which the ones that can contain quote characters are the following:

  • Ps 标点符号,打开"
  • Pe 标点符号,关闭"
  • Pi "标点符号、初始引号(可能表现得像 Ps 或Pe 取决于使用情况)"
  • Pf "标点符号,最后引用(可能表现得像 Ps 或Pe 取决于使用情况)"
  • Po 标点符号,其他"

(这些页面可以方便地检查您是否没有遗漏任何内容 - 还有一个 索引类别)

(these pages are handy for checking that you didn't miss anything - there is also an index of categories)

有时在支持 Unicode 的正则表达式中匹配这些类别很有用.

It is sometimes useful to match these categories in a Unicode-enabled regex.

此外,Unicode 字符具有属性",其中您感兴趣的是Quotation_Mark.不幸的是,这些不能在正则表达式中访问.

Furthermore, Unicode characters have "properties", of which the one you are interested in is Quotation_Mark. Unfortunately, these are not accessible in a regex.

在维基百科中,您可以找到具有 Quotation_Mark 属性的字符组.最后一个参考是 unicode.org 上的 PropList.txt,但这是一个 ASCII 文本文件.

In Wikipedia you can find the group of characters with the Quotation_Mark property. The final reference is PropList.txt on unicode.org, but this is an ASCII textfile.

如果您也需要翻译 CJK 字符,您只需获取它们的代码点,决定它们的翻译,并找到它们的 UTF-8 编码,例如,通过在 fileformat.info 中查找(例如,对于 U+301E:http://www.fileformat.info/info/unicode/char/301e/index.htmlhtm).

In case you need to translate CJK characters too, you only have to get their code points, decide their translation, and find their UTF-8 encoding, e.g., by looking it up in fileformat.info (e.g., for U+301E: http://www.fileformat.info/info/unicode/char/301e/index.htm).

关于 Windows 代码页 1252:Unicode 定义了前 256 个代码点来表示与 ISO-8859-1,但 ISO-8859-1 经常与 Windows 代码页 1252,以便所有浏览器呈现范围 0x80-0x9F,这在 ISO-8859-1 中为空"(更准确地说:它包含控制字符),就好像它是 Windows 代码页 1252.维基百科页面中的表格列出了 Unicode 等效项.

Regarding Windows codepage 1252: Unicode defines the first 256 code points to represent exactly the same characters as ISO-8859-1, but ISO-8859-1 is often confused with Windows codepage 1252, so that all browsers render the range 0x80-0x9F, which is "empty" in ISO-8859-1 (more exactly: it contains control characters), as if it were Windows codepage 1252. The table in the Wikipedia page lists the Unicode equivalents.

注意:strtr() 通常比 str_replace().使用您的输入和 PHP 版本计时.如果速度够快,可以直接用我的$chr_map之类的地图.

如果您不确定您的输入是否是 UTF-8 编码,并且愿意假设如果不是,那么它是 ISO-8859-1 或 Windows 代码页 1252,那么您可以先执行此操作:

If you are not sure that your input is UTF-8 encoded, AND are willing to assume that if it's not, then it's ISO-8859-1 or Windows codepage 1252, then you can do this before anything else:

if ( !preg_match('/^\X*$/u', $str)) {
   $str = utf8_encode($str);
}

警告:这个正则表达式在极少数情况下可能无法检测到非 UTF-8 编码.例如:"Gruß..."/*CP-1252*/=="GruxDFx85" 看起来像这个正则表达式的 UTF-8(U+07C5 是 N'ko 数字 5).这个正则表达式可以稍微增强,但不幸的是,它可以表明对于编码检测问题不存在完全万无一失的解决方案.

Warning: this regex can in very rare cases fail to detect a non-UTF-8 encoding, though. E.g.: "Gruß…"/*CP-1252*/=="GruxDFx85" looks like UTF-8 to this regex (U+07C5 is the N'ko digit 5). This regex can be slightly enhanced, but unfortunately it can be shown that there exists NO completely foolproof solution to the problem of encoding detection.

如果您想将源自 Windows 代码页 1252 的范围 0x80-0x9F 标准化为常规 Unicode 代码点,您可以这样做(并删除上面$chr_map 的第一部分):

If you want to normalize the range 0x80-0x9F that stems from Windows codepage 1252 to regular Unicode codepoints, you can do this (and remove the first part of the $chr_map above):

$normalization_map = array(
   "xC2x80" => "xE2x82xAC", // U+20AC Euro sign
   "xC2x82" => "xE2x80x9A", // U+201A single low-9 quotation mark
   "xC2x83" => "xC6x92",     // U+0192 latin small letter f with hook
   "xC2x84" => "xE2x80x9E", // U+201E double low-9 quotation mark
   "xC2x85" => "xE2x80xA6", // U+2026 horizontal ellipsis
   "xC2x86" => "xE2x80xA0", // U+2020 dagger
   "xC2x87" => "xE2x80xA1", // U+2021 double dagger
   "xC2x88" => "xCBx86",     // U+02C6 modifier letter circumflex accent
   "xC2x89" => "xE2x80xB0", // U+2030 per mille sign
   "xC2x8A" => "xC5xA0",     // U+0160 latin capital letter s with caron
   "xC2x8B" => "xE2x80xB9", // U+2039 single left-pointing angle quotation mark
   "xC2x8C" => "xC5x92",     // U+0152 latin capital ligature oe
   "xC2x8E" => "xC5xBD",     // U+017D latin capital letter z with caron
   "xC2x91" => "xE2x80x98", // U+2018 left single quotation mark
   "xC2x92" => "xE2x80x99", // U+2019 right single quotation mark
   "xC2x93" => "xE2x80x9C", // U+201C left double quotation mark
   "xC2x94" => "xE2x80x9D", // U+201D right double quotation mark
   "xC2x95" => "xE2x80xA2", // U+2022 bullet
   "xC2x96" => "xE2x80x93", // U+2013 en dash
   "xC2x97" => "xE2x80x94", // U+2014 em dash
   "xC2x98" => "xCBx9C",     // U+02DC small tilde
   "xC2x99" => "xE2x84xA2", // U+2122 trade mark sign
   "xC2x9A" => "xC5xA1",     // U+0161 latin small letter s with caron
   "xC2x9B" => "xE2x80xBA", // U+203A single right-pointing angle quotation mark
   "xC2x9C" => "xC5x93",     // U+0153 latin small ligature oe
   "xC2x9E" => "xC5xBE",     // U+017E latin small letter z with caron
   "xC2x9F" => "xC5xB8",     // U+0178 latin capital letter y with diaeresis
);
$chr = array_keys  ($normalization_map); // but: for efficiency you should
$rpl = array_values($normalization_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, $str);

相关文章