PHP 中的 UTF8 文件名和不同的 Unicode 编码

2021-12-27 00:00:00 unicode filepath utf-8 encoding php

我在运行 linux 的服务器上有一个包含 Unicode 字符的文件.如果我通过 SSH 连接到服务器并使用制表符完成导航到包含 unicode 字符的文件/文件夹,则访问该文件/文件夹没有问题.当我尝试通过 PHP 访问文件时出现问题(我访问文件系统的函数是 stat).如果我将 PHP 脚本生成的路径输出到浏览器并将其粘贴到终端中,该文件似乎也存在(即使在终端中查看文件路径完全相同).

我通过 php_ini 将 PHP 设置为使用 UTF8 作为其默认编码,并设置了 mb_internal_encoding.我检查了 PHP 文件路径字符串编码,它应该是 UTF8.再仔细研究一下,我决定 hexdump 终端制表符完成的 é 字符,并将其与 PHP 脚本创建的常规"é 字符的 hexdump 进行比较或通过键盘手动输入字符(在 os x 上为 option+e+e).结果如下:

<前>回声 -né |十六进制转储0000000 cc65 00810000003回声 -né |十六进制转储0000000 a9c30000002

允许在终端中正确引用文件的 é 字符是 3 字节字符.我不确定从哪里开始,我应该在 PHP 中使用什么编码?我应该通过 iconvmb_convert_encoding 将路径转换为另一种编码吗?

解决方案

多亏了两个答案中给出的提示,我能够四处探索并找到一些方法来规范化给定字符的不同 unicode 分解.在我遇到的情况下,我正在访问由 OS X Carbon 应用程序创建的文件.这是一个相当流行的应用程序,因此它的文件名似乎遵循特定的 unicode 分解.

在 PHP 5.3 中引入了一个 新的函数集,允许您可以将 unicode 字符串规范化为特定的分解.显然,您可以将 unicode 字符串分解为四种分解标准.Python 从 2.3 版开始通过 unicode.normalize 具有 unicode 规范化功能.这篇文章关于python对unicode字符串的处理有助于理解编码/字符串处理好一点.

以下是规范化 unicode 文件路径的快速示例:

filePath = unicodedata.normalize('NFD', filePath)

我发现 NFD 格式适用于我的所有目的,我想知道这是否是 unicode 文件名的标准分解.

I have a file containing Unicode characters on a server running linux. If I SSH into the server and use tab-completion to navigate to the file/folder containing unicode characters I have no problem accessing the file/folder. The problem arises when I try accessing the file via PHP (the function I was accessing the file system from was stat). If I output the path generated by the PHP script to the browser and paste it into the terminal the file also seems to exist (even though looking at the terminal the file paths are exactly the same).

I set PHP to use UTF8 as its default encoding via php_ini as well as set mb_internal_encoding. I checked the PHP filepath string encoding and it comes out as UTF8, as it should. Poking around a bit more I decided to hexdump the é character that the terminal's tab-completion and compare it to the hexdump of the 'regular' é character created by the PHP script or by manually entering in the character via keyboard (option+e+e on os x). Here is the result:

echo -n é | hexdump
0000000 cc65 0081                              
0000003
echo -n é | hexdump
0000000 a9c3                                   
0000002

The é character that allows a correct file reference in the terminal is the 3-byte one. I'm not sure where to go from here, what encoding should I use in PHP? Should I be converting the path to another encoding via iconv or mb_convert_encoding?

解决方案

Thanks to the tips given in the two answers I was able to poke around and find some methods for normalizing the different unicode decompositions of a given character. In the situation I was faced with I was accessing files created by a OS X Carbon application. It is a fairly popular application and thus its file names seemed to adhere to a specific unicode decomposition.

In PHP 5.3 a new set of functions was introduced that allows you to normalize a unicode string to a particular decomposition. Apparently there are four decomposition standards which you can decompose you unicode string into. Python has had unicode normalization capabilties since version 2.3 via unicode.normalize. This article on python's handling of unicode strings was helpful in understanding encoding / string handling a bit better.

Here is a quick example on normalizing a unicode filepath:

filePath = unicodedata.normalize('NFD', filePath)

I found that the NFD format worked for all my purposes, I wonder if this is this is the standard decomposition for unicode filenames.

相关文章