通过 PHP 解码数字 html 实体

2021-12-28 00:00:00 utf-8 character-encoding php html

我有这个代码来将数字 html 实体解码为 UTF8 等效字符.

I have this code to decode numeric html entities to the UTF8 equivalent character.

我正在尝试转换这个字符:

I'm trying to convert this character:

’

应该输出:

然而,它只是消失了(没有输出).(我已经检查了页面的源代码,该页面具有正确的 utf8 字符集标题/元标记).

However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).

有人知道代码有什么问题吗?

Does anyone know what is wrong with the code?

function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {    
     $string = html_entity_decode($string, $quote_style, $charset);

     $string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
     $string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\1")', $string);

    //this is another method, which also doesn't work.. 
     //$string = preg_replace_callback("/(&#[0-9]+;)/", "entity_decode_callback", $string);

     return $string; 
}




function chr_utf8_callback($matches) { 
     return chr_utf8(hexdec($matches[1])); 
}

function chr_utf8($num) {   
     if ($num < 128) return chr($num);
     if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
     if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     return '';
}

function entity_decode_callback($m) { 
     return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); 
} 

 echo '=' . entity_decode('&#146;');

推荐答案

html_entity_decode 已经满足您的需求:

$string = '&#146;';

echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');

它将返回字符:

’   binary hex: c292

这是私人使用二 (U+0092).由于它是私人使用,您的 PHP 配置/版本/编译可能根本不会返回它.

Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.

还有一些怪癖:

但在 HTML 中(XHTML 除外,它使用 XML 规则),这是一个长期存在的浏览器怪癖,字符引用范围为 &#128;&#159; 被误解为与 Windows 西方代码页 (cp1252) 中的字节 128 到 159 相关联的字符,而不是具有这些代码点的 Unicode 字符.HTML5 标准最终记录了这种行为.

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range &#128; to &#159; are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

参见:&#146;正在被 nokogiri 在 ruby​​ on rails 中转换为u0092"

相关文章