file.encoding 没有效果,LC_ALL 环境变量可以

在以下使用 OpenJDK 1.6.0_22 在 Linux 中运行的 Java 程序中,我简单地列出了在命令行中作为参数获取的目录的内容.该目录包含具有 UTF-8 文件名的文件(例如印地语、普通话、德语等).

In the following Java program running in Linux using OpenJDK 1.6.0_22 I simply list the contents of the directory taken in as parameter at the command line. The directory contains the files which have file names in UTF-8 (e.g. Hindi, Mandarin, German etc.).

import java.io.*;

class ListDir {

    public static void main(String[] args) throws Exception {
    //System.setProperty("file.encoding", "en_US.UTF-8");
        System.out.println(System.getProperty("file.encoding"));
    File f = new File(args[0]);
    for(String c : f.list()) {
        String absPath = args[0] + "" + c;
        File cf = new File(args[0] + "/" + c);
        System.out.println(cf.getAbsolutePath() + " --> " + cf.exists());
    }
    }
}

如果我将 LC_ALL 变量设置为 en_US.UTF-8,则结果打印良好.但是,如果我将 LC_ALL 变量设置为 POSIX 并从命令行以 UTF-8 格式提供 file.encoding 和 sun.jnu.encoding 属性,我会得到垃圾输出并且 cf.exists() 返回 false.

If I set the LC_ALL variable to en_US.UTF-8 the results are printed fine. But if I set the LC_ALL variable to POSIX and supply the file.encoding and sun.jnu.encoding properties as UTF-8 from command line I get the garbage output and cf.exists() returns false.

你能解释一下这种行为吗?正如我在许多网站上阅读的那样,据说 file.encoding 足以读取文件名并将它们用于操作.在这里,该属性似乎根本没有效果.

Can you please explain this behavior. As I read on so many websites file.encoding is said to be sufficient to read file names and use them for operations. Here it looks like that property has no effect at all.

更新 1: 如果我将 file.encoding 设置为 GBK(中文),并将 LC_ALL 变量设置为 en_US.UTF-8,则 cf.exists() 返回 true.只有 '?'出现而不是文件名.惊喜o_O.

Update 1: If I set file.encoding to something like GBK (Chinese) and LC_ALL variable to en_US.UTF-8 then cf.exists() returns true. only the '?' appears instead of file name. Surprise o_O.

更新 2: 进行更多调查,看起来这不是 Java 问题.看起来 Linux 上的 libc 使用语言环境设置来转换文件名编码,这些设置将导致找不到文件错误/异常."file.encoding" 用于说明 Java 如何解释文件名.

Update 2: More investigation and it looks like its not a Java issue. It looks like libc on Linux used locale settings to translate file name encodings and those settings will cause file not found error/exception. "file.encoding" is for how Java interprets file names.

更新 3 现在看来问题在于 Java 如何解释文件名.无论文件编码和 LC_ALL 环境变量的值如何,以下简单的 C 代码都可以在 Linux 上运行(我很高兴这证明了这里给出的答案:https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding).但我仍然不清楚 Java 如何解释 LC_ALL 变量.现在研究 OpenJDK 代码.

Update 3 Now it looks problem is how Java interprets file names. The following simple C code works on Linux regardless of file encoding and value of LC_ALL environment variable (I am happy that this proves for answer given here: https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding). But still I am not clear how Java interprets on LC_ALL variable. Now looking into OpenJDK code for that.

示例 C 代码:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dirent.h>

int main(int argc, char *argv[])
{
    char *argdir = argv[1];
    DIR *dp = opendir(argdir);
    struct dirent *de;
    while(de = readdir(dp)) {
        char *abspath = (char *) malloc(strlen(argdir)  + 1 + strlen(de->d_name) + 1);
        strcpy(abspath, argdir);
        abspath[strlen(argdir)] = '/';
        strcpy(abspath + strlen(argdir) + 1, de->d_name);
        printf("%d %s ", de->d_type, abspath);
        FILE *fp = fopen(abspath, "r");
        if (fp) {
            printf("Success");
        }
        fclose(fp);
        putchar('
');
    }
}

推荐答案

注意:所以最后我认为我已经确定了.我不确认它是正确的.但是通过一些代码阅读和测试,这是我发现的,我没有额外的时间来研究它.如果有人感兴趣,他们可以检查一下并判断这个答案是对还是错 - 我会很高兴:)

我使用的参考来自 OpenJDK 网站上的这个 tarball:openjdk-6-src-b25-01_may_2012.tar.gz

The reference I used was from this tarball available at OpenJDK's site: openjdk-6-src-b25-01_may_2012.tar.gz

  1. Java 在此方法中将所有字符串本地转换为平台的本地编码:jdk/src/share/native/common/jni_util.c - JNU_GetStringPlatformChars().系统属性sun.jnu.encoding用于确定平台的编码方式.

  1. Java natively translates all string to platform's local encoding in this method: jdk/src/share/native/common/jni_util.c - JNU_GetStringPlatformChars() . System property sun.jnu.encoding is used to determine the platform's encoding.

sun.jnu.encoding 的值设置在 jdk/src/solaris/native/java/lang/java_props_md.c - GetJavaProperties()使用 libc 的 setlocale() 方法.环境变量LC_ALL用于设置sun.jnu.encoding的值.使用 Java 的 -Dsun.jnu.encoding 选项在命令提示符处给出的值将被忽略.

The value of sun.jnu.encoding is set at jdk/src/solaris/native/java/lang/java_props_md.c - GetJavaProperties() using setlocale() method of libc. Environment variable LC_ALL is used to set the value of sun.jnu.encoding. Value given at the command prompt using -Dsun.jnu.encoding option to Java is ignored.

File.exists() 的调用已编码在文件 jdk/src/share/classes/java/io/File.java 中并返回作为

Call to File.exists() has been coded in file jdk/src/share/classes/java/io/File.java and it returns as

return ((fs.getBooleanAttributes(this) & FileSystem.BA_EXISTS) != 0);

getBooleanAttributes() 是在 jdk/src/share/native/java/io/UnixFileSystem_md 中本地编码的(我跳过了浏览许多文件的代码步骤).c 在函数中:Java_java_io_UnixFileSystem_getBooleanAttributes0().这里的宏WITH_FIELD_PLATFORM_STRING(env, file, ids.path, path) 将路径字符串转换为平台的编码.

getBooleanAttributes() is natively coded (and I am skipping steps in code browsing through many files) in jdk/src/share/native/java/io/UnixFileSystem_md.c in function : Java_java_io_UnixFileSystem_getBooleanAttributes0(). Here the macro WITH_FIELD_PLATFORM_STRING(env, file, ids.path, path) converts path string to platform's encoding.

所以转换成错误的编码实际上会发送一个错误的 C 字符串(char 数组)到后续调用 stat() 方法.它会返回找不到文件的结果.

So conversion to wrong encoding will actually send a wrong C string (char array) to subsequent call to stat() method. And it will return with result that file cannot be found.

LESSON: LC_ALL 很重要

相关文章