JDK 9学习笔记 - (2)能屈能伸的String

2019-07-04 00:00:00 学习笔记 jdk

背景

String作为JDK最核心的数据类型之一，非常有必要专门学习一下，重点关注这4个文件

jdk/src/java.base/share/native/libjava/String.c
jdk/src/java.base/share/classes/java/lang/String.java
jdk/src/java.base/share/classes/java/lang/StringLatin1.java
jdk/src/java.base/share/classes/java/lang/StringUTF16.java

存储

无论是何种语言的何种实现，String本质上都是字节序列，所有可能的字符加起来就构成了字符集，给字符集中每个字符一个序号就是字符编码，使用最广泛的就是Unicode了，它几乎支持地球上所有常见文字，Unicode有三种最主要的实现，UTF-8，UTF-16还有UTF-32，在web领域，UTF-8已经处于绝对垄断地位。Java 9的String，引入了类似Python str的压缩功能。原理很简单，如果String只包含Latin1字符，1字节存一个字符够用了，如果String含有中文，那么就换一种编码方式存储，用一个变量表示当前的字符集就行了。先看三个类。

public final class String
    implements java.io.Serializable, Comparable<String>, CharSequence {

    /**  * The value is used for character storage.  *  * @implNote This field is trusted by the VM, and is a subject to  * constant folding if String instance is constant. Overwriting this  * field after construction will cause problems.  *  * Additionally, it is marked with {@link Stable} to trust the contents  * of the array. No other facility in JDK provides this functionality (yet).  * {@link Stable} is safe here, because value is never null.  */
    @Stable
    private final byte[] value;

    /**  * The identifier of the encoding used to encode the bytes in  * {@code value}. The supported values in this implementation are  *  * LATIN1  * UTF16  *  * @implNote This field is trusted by the VM, and is a subject to  * constant folding if String instance is constant. Overwriting this  * field after construction will cause problems.  */
    private final byte coder;

    /** Cache the hash code for the string */
    private int hash; // Default to 0 }

final class StringLatin1 {
}

final class StringUTF16 {
}

字符序列存储在字节数组value中，然后用一个字节的coder表示编码，这是String的基本构成。然后StringLatin1提供一组静态方法，用来处理只含有Latin1字符时的情况。相应的StringUTF16提供另一组静态方法，处理包含Latin1以外字符时的情况。

注意类的定义，三个类都是final，且只有String有public修饰，所以我们作为JDK的用户，只能使用String，而不能使用StringLatin1或者StringUTF16，这两个类不属于API，属于实现细节，我们既不能使用，也不能依赖其内部实现。但是我们应该理解它，顺从它，避免做出违背它的事情来。

操作

创建

如果让我自己设计一个原始而全能的创建String的方法，我会如何设计这个方法的签名？我想应该传入一个byte[]表示二进制内容，一个offset表示从哪个字节开始解析，一个length表示解析多少个字节，一个charsetName表示这个byte[]的编码类型。在python里面，这个也叫decode，在decode的过程中，我们可以检测是否只有Latin1字符集内的字符，如果是，我们可以用一个字节表示一个字符，如果不是，我们就老老实实用回UTF-16。

charAt

用脚趾头想一下，大概也能猜到它的实现，如果是Latin1的编码，就调用StringLatin1的方法，如果是UTF-16的编码，那么就调用StringUTF16的方法。如下

    public char charAt(int index) {
        if (isLatin1()) {
            return StringLatin1.charAt(value, index);
        } else {
            return StringUTF16.charAt(value, index);
        }
    }

因为Latin1是一字节一个字符，所以必然是直接从value字节数组中直接定位到index字节

    // Latin1     public static char charAt(byte[] value, int index) {
        if (index < 0 || index >= value.length) {
            throw new StringIndexOutOfBoundsException(index);
        }
        return (char)(value[index] & 0xff);
    }

    // UTF-16     public static char charAt(byte[] value, int index) {
        checkIndex(index, value);
        return getChar(value, index);
    }

    static char getChar(byte[] val, int index) {
        assert index >= 0 && index < length(val) : "Trusted caller missed bounds check";
        index <<= 1;
        return (char)(((val[index++] & 0xff) << HI_BYTE_SHIFT) |
                      ((val[index]   & 0xff) << LO_BYTE_SHIFT));
    }

    private static native boolean isBigEndian();

    static final int HI_BYTE_SHIFT;
    static final int LO_BYTE_SHIFT;
    static {
        if (isBigEndian()) {
            HI_BYTE_SHIFT = 8;
            LO_BYTE_SHIFT = 0;
        } else {
            HI_BYTE_SHIFT = 0;
            LO_BYTE_SHIFT = 8;
        }
    }

如果是UTF-16编码，取一个char，把index乘以2，再把连续的两个字节拼在一起就行了，这里因为是两个字节拼成char，涉及到了大小端问题，所以做了特殊处理。

codePointAt

由于char类型只有2字节，而UTF-16最多可以有4字节，虽然很罕见，但是也是有可能存在的，这个时候就需要用int类型表示unicode点位数值了

    // String     public int codePointAt(int index) {
        if (isLatin1()) {
            checkIndex(index, value.length);
            return value[index] & 0xff;
        }
        int length = value.length >> 1;
        checkIndex(index, length);
        return StringUTF16.codePointAt(value, index, length);
    }

    // UTF-16     public static int codePointAt(byte[] value, int index, int end) {
       return codePointAt(value, index, end, false /* unchecked */);
    }

    // UTF-16     private static int codePointAt(byte[] value, int index, int end, boolean checked) {
        assert index < end;
        if (checked) {
            checkIndex(index, value);
        }
        char c1 = getChar(value, index);
        if (Character.isHighSurrogate(c1) && ++index < end) {
            if (checked) {
                checkIndex(index, value);
            }
            char c2 = getChar(value, index);
            if (Character.isLowSurrogate(c2)) {
               return Character.toCodePoint(c1, c2);
            }
        }
        return c1;
    }

Latin1编码时，仍然很简单，直接定位到这个字节即可。UTF-16时却要麻烦些许，需要取出那个字符，判断它是否需要另一半拼起来，需要的话还得再拼上2字节。

截取

substring这类操作，不用翻代码也知道，从某个字节开始，到某个字节结束，重新构造新String就可以了，构造的过程中检测一下是否有压缩的可能，比如本来是UTF-16编码的string，但是字串没有超出Latin1的字符，那么理所应当应该使用Latin1编码节省存储空间。

查找

可能我们都很熟悉KMP或者自动机等高级字符串查找算法，然而在Java的String里，我们并没有用到这种高级算法，JDK String选择的是最简易的暴力查找，现实中的字符串查找，很少出现大量前缀字符匹配但最后失配的情况。比如这篇博文，任意拿10个字符的字串出来查找，都不会遇到前9个匹配但第10个不匹配的情况。但是这种朴素算法，可能会遭受到精心构造的数据的攻击，显著增加计算量。从O(m)变成O(m*n)，m是完整的字符串长度，n是要查找的字串的长度。在极端情况下，我们需要防范这种攻击，但平时一般不用考虑。

替换

有了查找，自然也有了替换，替换 = 查找 + 拼接，单个字符的替换比较简单，但是需要考虑到Latin1变UTF-16的情况，也要考虑UTF-16压缩到Latin1的情况。

拘留

一般String用完就会释放，但是有一些特殊情况，我们用完之后不释放，而是把它放到常量池里，下回有同样的String的时候，返回同一个引用，可以减少内存占用，多数时候也能提高性能。Java语言规范里有一段：

A string literal is a reference to an instance of class String (§4.3.1, §4.3.3).
Moreover, a string literal always refers to the same instance of class String . This
is because string literals – or, more generally, strings that are the values of constant
expressions (§15.28) – are “interned” so as to share unique instances, using the
method String.intern .

字符串字面值，确保会是同一个引用，拘留(intern)是个native方法，这么实现的

JVM_ENTRY(jstring, JVM_InternString(JNIEnv *env, jstring str))
  JVMWrapper("JVM_InternString");
  JvmtiVMObjectAllocEventCollector oam;
  if (str == NULL) return NULL;
  oop string = JNIHandles::resolve_non_null(str);
  oop result = StringTable::intern(string, CHECK_NULL);
  return (jstring) JNIHandles::make_local(env, result);
JVM_END

oop StringTable::intern(oop string, TRAPS)
{
  if (string == NULL) return NULL;
  ResourceMark rm(THREAD);
  int length;
  Handle h_string (THREAD, string);
  jchar* chars = java_lang_String::as_unicode_string(string, length, CHECK_NULL);
  oop result = intern(h_string, chars, length, CHECK_NULL);
  return result;
}

oop StringTable::intern(Handle string_or_null, jchar* name,
                        int len, TRAPS) {
  // shared table always uses java_lang_String::hash_code   unsigned int hashValue = java_lang_String::hash_code(name, len);
  oop found_string = lookup_shared(name, len, hashValue);
  if (found_string != NULL) {
    return found_string;
  }
  if (use_alternate_hashcode()) {
    hashValue = alt_hash_string(name, len);
  }
  int index = the_table()->hash_to_index(hashValue);
  found_string = the_table()->lookup_in_main_table(index, name, len, hashValue);

  // Found   if (found_string != NULL) {
    if (found_string != string_or_null()) {
      ensure_string_alive(found_string);
    }
    return found_string;
  }

  debug_only(StableMemoryChecker smc(name, len * sizeof(name[0])));
  assert(!Universe::heap()->is_in_reserved(name),
         "proposed name of symbol must be stable");

  Handle string;
  // try to reuse the string if possible   if (!string_or_null.is_null()) {
    string = string_or_null;
  } else {
    string = java_lang_String::create_from_unicode(name, len, CHECK_NULL);
  }

#if INCLUDE_ALL_GCS   if (G1StringDedup::is_enabled()) {
    // Deduplicate the string before it is interned. Note that we should never     // deduplicate a string after it has been interned. Doing so will counteract     // compiler optimizations done on e.g. interned string literals.     G1StringDedup::deduplicate(string());
  }
#endif 
  // Grab the StringTable_lock before getting the_table() because it could   // change at safepoint.   oop added_or_found;
  {
    MutexLocker ml(StringTable_lock, THREAD);
    // Otherwise, add to symbol to table     added_or_found = the_table()->basic_add(index, string, name, len,
                                  hashValue, CHECK_NULL);
  }

  if (added_or_found != string()) {
    ensure_string_alive(added_or_found);
  }

  return added_or_found;
}

虽然代码挺长，原理却很简单，维护一个符号表，调用intern的时候，有则返回，无则新建并扔到符号表里去。

小结

JDK 9 String 的 compress 设计，在Latin1字符为主的程序里，可以把String占用的内存减少一半，小小的改进，体现到代码层面却是到处都要把这个放心上处理。天下没有免费午餐，这个特性在节省内存的同时引入了编码检测的开销，有时反而会更慢。如果确信关闭这个特性更好，JVM 9提供了参数可以关闭这个特性，+XX:-CompactStrings。

    原文作者：许怀远
    原文地址: https://zhuanlan.zhihu.com/p/30584322
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。

相关文章