PHP用unicode字符解码和编码json

2021-12-26 00:00:00 unicode json character-encoding php

我有一些 json 我需要解码、更改然后编码,而不会弄乱任何字符.

I have some json I need to decode, alter and then encode without messing up any characters.

如果我在 json 字符串中有一个 unicode 字符,它将无法解码.我不知道为什么因为 json.org 说一个字符串可以包含:any-Unicode-character-except-"-or--or- control-character.但它在要么是蟒蛇.

If I have a unicode character in a json string it will not decode. I'm not sure why since json.org says a string can contain: any-Unicode-character- except-"-or--or- control-character. But it doesn't work in python either.

{"Tag":"Odómetro"}

我可以使用 utf8_encode 这将允许使用 json_decode 对字符串进行解码,但是该字符会被破坏成其他东西.这是结果数组的 print_r 的结果.两个字符.

I can use utf8_encode which will allow the string to be decoded with json_decode, however the character gets mangled into something else. This is the result from a print_r of the result array. Two characters.

[Tag] => Odómetro

当我再次对数组进行编码时,字符转义为 ascii,根据 json 规范这是正确的:

When I encode the array again I the character escaped to ascii, which is correct according to the json spec:

"Tag"=>"Odu00f3metro"

有什么办法可以解除这种情况吗?json_encode 没有提供这样的选项,utf8_encode 似乎也不起作用.

Is there some way I can un-escape this? json_encode gives no such option, utf8_encode does not seem to work either.

编辑 我看到 json_encode 有一个 unescaped_unicode 选项.但是,它没有按预期工作.哦,该死的,它仅适用于 php 5.4.我将不得不使用一些正则表达式,因为我只有 5.3.

Edit I see there is an unescaped_unicode option for json_encode. However it's not working as expected. Oh damn, it's only on php 5.4. I will have to use some regex as I only have 5.3.

$json = json_encode($array, JSON_UNESCAPED_UNICODE);
Warning: json_encode() expects parameter 2 to be long, string ...

推荐答案

从你所说的一切来看,你正在处理的原始 Odómetro 字符串似乎是用 ISO 8859 编码的-1,不是UTF-8.

Judging from everything you've said, it seems like the original Odómetro string you're dealing with is encoded with ISO 8859-1, not UTF-8.

这就是我这么认为的原因:

Here's why I think so:

  • json_encode 在您通过 utf8_encode 运行输入字符串后生成可解析的输出,该字符串从 ISO 8859-1 转换为 UTF-8.
  • 你确实说过在执行 utf8_encode 之后使用 print_r 时你得到了错位"的输出,但你得到的错位输出实际上正是尝试解析会发生的情况作为 ISO 8859-1 的 UTF-8 文本(在 UTF-8 中 ó 是 x63xb3,但在 ISO 8859-1 中该序列是 ó.
  • 您的 htmlentities hackaround 解决方案奏效了.htmlentities 需要知道输入字符串的编码才能正常工作.如果您不指定,则假定为 ISO 8859-1.(html_entity_decode,令人困惑的是,默认为 UTF-8,因此您的方法具有从 ISO 8859-1 转换为 UTF-8 的效果.)
  • 您说您在 Python 中遇到了同样的问题,这似乎将 PHP 排除在问题之外.
  • json_encode produced parseable output after you ran the input string through utf8_encode, which converts from ISO 8859-1 to UTF-8.
  • You did say that you got "mangled" output when using print_r after doing utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is x63xb3 in UTF-8, but that sequence is ó in ISO 8859-1.
  • Your htmlentities hackaround solution worked. htmlentities needs to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
  • You said you had the same problem in Python, which would seem to exclude PHP from being the issue.

PHP 将使用 uXXXX 转义,但正如您所指出的,这是有效的 JSON.

PHP will use the uXXXX escaping, but as you noted, this is valid JSON.

因此,您似乎需要配置与 Postgres 的连接,以便它为您提供 UTF-8 字符串.PHP 手册表明您可以通过将 options='--client_encoding=UTF8' 附加到连接字符串来执行此操作.当前存储在数据库中的数据也有可能采用错误的编码.(您可以简单地使用 utf8_encode,但这将仅支持属于 ISO 8859-1 的字符.

So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending options='--client_encoding=UTF8' to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use utf8_encode, but this will only support characters that are part of ISO 8859-1).

最后,正如另一个答案所指出的,您确实需要确保使用 HTTP 标头或其他方式声明正确的字符集(当然,这个特定问题可能只是您所做的环境的产物您的 print_r 测试).

Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your print_r testing).

相关文章