调用 intern() 方法后,内存中的 new String() 对象何时被清除
List<String> list = new ArrayList<>();
for (int i = 0; i < 1000; i++)
{
StringBuilder sb = new StringBuilder();
String string = sb.toString();
string = string.intern()
list.add(string);
}
上例中,调用string.intern()方法后,堆(sb.toString)中创建的1000个对象什么时候会被清空?
In the above sample, after invoking string.intern() method, when will the 1000 objects created in heap (sb.toString) be cleared?
编辑 1:如果不能保证这些对象可以被清除.假设 GC 没有运行,使用 string.intern() 本身是否已经过时?(就内存使用而言?)
Edit 1: If there is no guarantee that these objects could be cleared. Assuming that GC haven't run, is it obsolete to use string.intern() itself? (In terms of the memory usage?)
在使用 intern() 方法时,有什么方法可以减少内存使用/对象创建?
Is there any way to reduce memory usage / object creation while using intern() method?
推荐答案
你的例子有点奇怪,因为它创建了 1000 个空字符串.如果你想得到这样一个消耗最少内存的列表,你应该使用
Your example is a bit odd, as it creates 1000 empty strings. If you want to get such a list with consuming minimum memory, you should use
List<String> list = Collections.nCopies(1000, "");
改为.
如果我们假设发生了一些更复杂的事情,而不是在每次迭代中创建相同的字符串,那么调用 intern()
没有任何好处.会发生什么,取决于实现.但是当对不在池中的字符串调用 intern()
时,在最好的情况下它只会被添加到池中,但在最坏的情况下,会制作另一个副本并添加到游泳池.
If we assume that there is something more sophisticated going on, not creating the same string in every iteration, well, then there is no benefit in calling intern()
. What will happen, is implementation dependent. But when calling intern()
on a string that is not in the pool, it will be just added to the pool in the best case, but in the worst case, another copy will be made and added to the pool.
此时,我们还没有节省,但可能会产生额外的垃圾.
At this point, we have no savings yet, but potentially created additional garbage.
如果某处有重复,此时的实习只能为您节省一些记忆.这意味着您首先构造重复的字符串,然后通过 intern()
查找它们的规范实例,因此在内存中保存重复的字符串直到垃圾收集是不可避免的.但这不是实习的真正问题:
Interning at this point can only save you some memory, if there are duplicates somewhere. This implies that you construct duplicate strings first, to look up their canonical instance via intern()
afterwards, so having the duplicate string in memory until garbage collected, is unavoidable. But that’s not the real problem with interning:
- 在较旧的 JVM 中,对 interned string 进行了特殊处理,这可能会导致垃圾收集性能变差甚至耗尽资源(即固定大小的PermGen"空间).
- 在 HotSpot 中,保存内部字符串的字符串池是一个固定大小的哈希表,会产生哈希冲突,因此当引用的字符串远远多于表大小时,性能会很差.
在 Java 7 更新 40 之前,默认大小约为 1,000,甚至不足以容纳任何不存在哈希冲突的非平凡应用程序的所有字符串常量,更不用说手动添加的字符串了.以后的版本使用大约 60,000 的默认大小,这更好,但仍然是一个固定大小,应该会阻止您添加任意数量的字符串 - 字符串池必须遵守语言规范规定的线程间语义(因为它用于字符串文字),因此,需要执行可能会降低性能的线程安全更新
请记住,即使在没有重复的情况下,即没有节省空间的情况下,您也会为上述缺点付出代价.此外,获取的对规范字符串的引用必须具有比用于查找它的临时对象更长的生命周期,才能对内存消耗产生任何积极影响.
Keep in mind that you pay the price of the disadvantages named above, even in the cases that there are no duplicates, i.e. there is no space saving. Also, the acquired reference to the canonical string has to have a much longer lifetime than the temporary object used to look it up, to have any positive effect on the memory consumption.
后者触及你的字面问题.当垃圾收集器下次运行时,临时实例会被回收,这将是实际需要内存的时候.无需担心何时会发生这种情况,但是,是的,到目前为止,获取规范引用没有任何积极影响,不仅因为到那时内存还没有被重用,而且,因为直到那时才真正需要内存.
The latter touches your literal question. The temporary instances are reclaimed when the garbage collector runs the next time, which will be when the memory is actually needed. There is no need to worry about when this will happen, but well, yes, up to that point, acquiring a canonical reference had no positive effect, not only because the memory hasn’t been reused up to that point, but also, because the memory was not actually needed until then.
这里是提到新的 字符串重复数据删除 功能的地方.这不会改变字符串实例,即这些对象的身份,因为这会改变程序的语义,但会改变相同的字符串以使用相同的 char[]
数组.由于这些字符数组是最大的有效负载,这仍然可以节省大量内存,而不会出现使用 intern()
的性能劣势.由于这种重复数据删除是由垃圾收集器完成的,因此它只会应用于存活时间足够长以产生影响的字符串.此外,这意味着当仍有大量可用内存时,它不会浪费 CPU 周期.
This is the place to mention the new String Deduplication feature. This does not change string instances, i.e. the identity of these objects, as that would change the semantic of the program, but change identical strings to use the same char[]
array. Since these character arrays are the biggest payload, this still may achieve great memory savings, without the performance disadvantages of using intern()
. Since this deduplication is done by the garbage collector, it will only applied to strings that survived long enough to make a difference. Also, this implies that it will not waste CPU cycles when there still is plenty of free memory.
但是,在某些情况下,手动规范化可能是合理的.想象一下,我们正在解析源代码文件或 XML 文件,或从外部源(Reader
或数据库)导入字符串,默认情况下不会发生这种规范化,但可能会出现重复可能性.如果我们计划将数据保留更长时间以供进一步处理,我们可能希望摆脱重复的字符串实例.
However, there might be cases, where manual canonicalization might be justified. Imagine, we’re parsing a source code file or XML file, or importing strings from an external source (Reader
or data base) where such canonicalization will not happen by default, but duplicates may occur with a certain likelihood. If we plan to keep the data for further processing for a longer time, we might want to get rid of duplicate string instances.
在这种情况下,最好的方法之一是使用 local 映射,不受线程同步的影响,在进程之后将其删除,以避免保持引用时间过长,而不必使用与垃圾收集器的特殊交互.这意味着在不同数据源中出现的相同字符串未规范化(但仍受 JVM 的 String Deduplication 约束),但这是一个合理的权衡.通过使用普通的可调整大小的 HashMap
,我们也没有固定 intern
表的问题.
In this case, one of the best approaches is to use a local map, not being subject to thread synchronization, dropping it after the process, to avoid keeping references longer than necessary, without having to use special interaction with the garbage collector. This implies that occurrences of the same strings within different data sources are not canonicalized (but still being subject to the JVM’s String Deduplication), but it’s a reasonable trade-off. By using an ordinary resizable HashMap
, we also do not have the issues of the fixed intern
table.
例如
static List<String> parse(CharSequence input) {
List<String> result = new ArrayList<>();
Matcher m = TOKEN_PATTERN.matcher(input);
CharBuffer cb = CharBuffer.wrap(input);
HashMap<CharSequence,String> cache = new HashMap<>();
while(m.find()) {
result.add(
cache.computeIfAbsent(cb.subSequence(m.start(), m.end()), Object::toString));
}
return result;
}
注意这里 CharBuffer
的使用:它包装输入序列并且它的 subSequence
方法返回另一个具有不同开始和结束索引的包装器,为我们的 HashMap
实现正确的 equals
和 hashCode
方法,而 computeIfAbsent
只会调用 toString
方法,如果键之前不存在于地图中.因此,与使用 intern()
不同的是,不会为已经遇到的字符串创建 String
实例,从而节省了其中最昂贵的方面,即字符数组的复制.
Note the use of the CharBuffer
here: it wraps the input sequence and its subSequence
method returns another wrapper with different start and end index, implementing the right equals
and hashCode
method for our HashMap
, and computeIfAbsent
will only invoke the toString
method, if the key was not present in the map before. So, unlike using intern()
, no String
instance will be created for already encountered strings, saving the most expensive aspect of it, the copying of the character arrays.
如果重复的可能性非常高,我们甚至可以保存包装器实例的创建:
If we have a really high likelihood of duplicates, we may even save the creation of wrapper instances:
static List<String> parse(CharSequence input) {
List<String> result = new ArrayList<>();
Matcher m = TOKEN_PATTERN.matcher(input);
CharBuffer cb = CharBuffer.wrap(input);
HashMap<CharSequence,String> cache = new HashMap<>();
while(m.find()) {
cb.limit(m.end()).position(m.start());
String s = cache.get(cb);
if(s == null) {
s = cb.toString();
cache.put(CharBuffer.wrap(s), s);
}
result.add(s);
}
return result;
}
这只会为每个唯一字符串创建一个包装器,但在放置时还必须为每个唯一字符串执行一次额外的哈希查找.由于包装器的创建非常便宜,因此您确实需要大量的重复字符串,即与总数相比,唯一字符串的数量很少,才能从这种权衡中受益.
This creates only one wrapper per unique string, but also has to perform one additional hash lookup for each unique string when putting. Since the creation of a wrapper is quiet cheap, you really need a significantly large number of duplicate strings, i.e. small number of unique strings compared to the total number, to have a benefit from this trade-off.
如前所述,这些方法非常有效,因为它们使用的是纯本地缓存,之后会被删除.这样,我们就不必处理线程安全问题,也不必以特殊方式与 JVM 或垃圾收集器交互.
As said, these approaches are very efficient, because they use a purely local cache that is just dropped afterwards. With this, we don’t have to deal with thread safety nor interact with the JVM or garbage collector in a special way.
相关文章