PHP 文件中的 UTF-8 BOM 签名

我在编写一些带注释的 PHP 类时偶然发现了一个问题.我的名字(对于@author 标签)以 ș(这是一个 UTF-8 字符,...还有一个奇怪的名字,我知道)结束.

I was writing some commented PHP classes and I stumbled upon a problem. My name (for the @author tag) ends up with a ș (which is a UTF-8 character, ...and a strange name, I know).

即使我将文件保存为 UTF-8,一些朋友报告说他们看到该字符完全混乱 (È™).通过添加 BOM 签名,这个问题就消失了.但那件事让我有点困扰,因为我对此知之甚少,除了我在 Wikipedia 上看到的内容以及 SO 上的其他一些类似问题.

Even though I save the file as UTF-8, some friends reported that they see that character totally messed up (È™). This problem goes away by adding the BOM signature. But that thing troubles me a bit, since I don't know that much about it, except from what I saw on Wikipedia and on some other similar questions here on SO.

我知道它在文件的开头添加了一些东西,据我所知,它并没有那么糟糕,但我很担心,因为我读到的唯一有问题的场景涉及 PHP 文件.由于我正在编写 PHP 类来共享它们,因此 100% 兼容比在评论中显示我的名字更重要.

I know that it adds some things at the beginning of the file, and from what I understood it's not that bad, but I'm concerned because the only problematic scenarios I read about involved PHP files. And since I'm writing PHP classes to share them, being 100% compatible is more important than having my name in the comments.

但我正在尝试了解其含义,我应该使用它而不用担心吗?或者是否有可能造成损坏的情况?什么时候?

But I'm trying to understand the implications, should I use it without worrying? or are there cases when it might cause damage? When?

推荐答案

确实,BOM 是发送到浏览器的实际数据.浏览器会很乐意忽略它,但您仍然无法发送标头.

Indeed, the BOM is actual data sent to the browser. The browser will happily ignore it, but still you cannot send headers then.

我相信问题确实出在您和您朋友的编辑器设置上.如果没有 BOM,您朋友的编辑器可能不会自动将文件识别为 UTF-8.他可以尝试设置他的编辑器,使编辑器期望一个文件为 UTF-8(如果您使用真正的 IDE,例如 NetBeans,那么这甚至可以成为一个项目设置,您可以随code一起转).

I believe the problem really is your and your friend's editor settings. Without a BOM, your friend's editor may not automatically recognize the file as UTF-8. He can try to set up his editor such that the editor expects a file to be in UTF-8 (if you use a real IDE such as NetBeans, then this can even be made a project setting that you can transfer along with the code).

另一种方法是尝试一些技巧:一些编辑器尝试根据输入的文本使用一些启发式方法来确定编码.你可以尝试用

An alternative is to try some tricks: some editors try to determine the encoding using some heuristics based on the entered text. You could try to start each file with

<?php //Úτƒ-8 encoded

也许启发式会得到它.可能有更好的东西可以放在那里,你可以谷歌搜索什么样的编码检测启发式是常见的,或者只是尝试一些:-)

and maybe the heuristic will get it. There's probably better stuff to put there, and you can either google for what kind of encoding detection heuristics are common, or just try some out :-)

总而言之,我建议只修复编辑器设置.

All in all, I recommend just fixing the editor settings.

哦等等,我误读了最后一部分:为了将代码传播到任何地方,我想你最安全的方法是让所有文件只包含低 7 位字符,即纯 ASCII,或者只是接受一些人古代编辑看到你写的名字很有趣.没有万无一失的方法.由于标题已经发送,BOM 肯定是坏的.另一方面,只要你只在注释中放 UTF-8 字符等等,一些编辑误解编码的唯一影响就是奇怪的字符.我会正确拼写您的名字并添加针对启发式的评论,以便大多数编辑都能理解,但总会有人看到虚假字符.

Oh wait, I misread the last part: for spreading the code to anywhere, I guess you're safest just making all files only contain the lower 7-bit characters, i.e. plain ASCII, or to just accept that some people with ancient editors see your name written funny. There is no fail-safe way. The BOM is definitely bad because of the headers already sent thing. On the other side, as long as you only put UTF-8 characters in comments and so, the only impact of some editor misunderstanding the encoding is weird characters. I'd go for correctly spelling your name and adding a comment targeted at heuristics so that most editors will get it, but there will always be people who'll see bogus chars instead.

相关文章