PHP:在不知道原始字符集的情况下将任何字符串转换为 UTF-8,或者至少尝试

2021-12-28 00:00:00 utf-8 character-encoding php

我有一个与来自世界各地的客户打交道的应用程序,当然,我希望进入我的数据库的所有内容都采用 UTF-8 编码.

I have an application that deals with clients from all over the world, and, naturally, I want everything going into my databases to be UTF-8 encoded.

对我来说的主要问题是我不知道任何字符串的来源将是什么编码 - 它可能来自文本框(使用 <form accept-charset="utf-8"> 只有在用户实际提交表单时才有用),或者它可能来自上传的文本文件,所以我真的无法控制输入.

The main problem for me is that I don't know what encoding the source of any string is going to be - it could be from a text box (using <form accept-charset="utf-8"> is only useful if the user is actually submitted the form), or it could be from an uploaded text file, so I really have no control over the input.

我需要的是一个函数或类,以确保进入我的数据库的内容尽可能采用 UTF-8 编码.我试过 iconv(mb_detect_encoding($text), "UTF-8", $text);但这有问题(如果输入是未婚夫",则返回未婚夫").我已经尝试了很多东西 =/

What I need is a function or class that makes sure the stuff going into my database is, as far as is possible, UTF-8 encoded. I've tried iconv(mb_detect_encoding($text), "UTF-8", $text); but that has problems (if the input is 'fiancée' it returns 'fianc'). I've tried a lot of things =/

对于文件上传,我喜欢要求最终用户指定他们使用的编码,并向他们展示输出的预览,但这无助于抵御讨厌的黑客(事实上,它可以让他们的生活更轻松).

For file uploads, I like the idea of asking the end user to specify the encoding they use, and show them previews of what the output will look like, but this doesn't help against nasty hackers (in fact, it could make their life a little easier).

我已经阅读了有关该主题的其他 SO 问题,但它们似乎都有细微的差异,例如我需要解析 RSS 提要"或我从网站上抓取数据"(或者,实际上,您不能").

I've read the other SO questions on the subject, but they seem to all have subtle differences like "I need to parse RSS feeds" or "I scrape data from websites" (or, indeed, "You can't").

但必须有一些东西至少有一个很好的尝试!

But there must be something that at least has a good try!

推荐答案

您的要求非常困难.如果可能,最好让用户指定编码.以这种方式防止攻击不应该更容易或更难.

What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.

但是,您可以尝试这样做:

However, you could try doing this:

iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);

将其设置为严格可能会帮助您获得更好的结果.

Setting it to strict might help you get a better result.

相关文章