Hotspot 什么时候可以在栈上分配对象?

从 Java 6 左右开始,Hotspot JVM 可以进行转义分析,并在堆栈上而不是在垃圾收集堆上分配非转义对象.这会加快生成的代码并减少垃圾收集器的压力.

Since somewhere around Java 6, the Hotspot JVM can do escape analysis and allocate non-escaping objects on the stack instead of on the garbage collected heap. This results in a speedup of the generated code and reduces pressure on the garbage collector.

Hotspot 何时能够堆栈分配对象的规则是什么?换句话说,我什么时候可以依靠它来进行堆栈分配?

What are the rules for when Hotspot is able to stack allocate objects? In other words when can I rely on it to do stack allocation?

编辑:这个问题是重复的,但是(IMO)下面的答案比原始问题的答案更好.

edit: This question is a duplicate, but (IMO) the answer below is a better answer than what is available at the original question.

推荐答案

我做了一些实验,看看 Hotspot 什么时候能够堆栈分配.事实证明,它的堆栈分配比您根据 可用文档.Choi 引用的论文Escape Analysis for Java"表明,一个只分配给局部变量的对象总是可以被堆栈分配.但事实并非如此.

I have done some experimentation in order to see when Hotspot is able to stack allocate. It turns out that its stack allocation is quite a bit more limited than what you might expect based on the available documentation. The referenced paper by Choi "Escape Analysis for Java" suggests that an object that is only ever assigned to local variables can always be stack allocated. But that is not true.

所有这些都是当前 Hotspot 实施的实施细节,因此它们可能会在未来的版本中发生变化.这是指我的 OpenJDK 安装,它是 X86-64 的 1.8.0_121 版本.

All of this are implementation details of the current Hotspot implementation, so they could change in future versions. This refers to my OpenJDK install which is version 1.8.0_121 for X86-64.

基于大量实验的简短摘要似乎是:

The short summary, based on quite a bit of experimentation, seems to be:

如果

  • 它的所有用途都是内联的
  • 它永远不会分配给任何静态或对象字段,只会分配给局部变量
  • 在程序的每一点,哪些局部变量包含对对象的引用必须是 JIT 时间可确定的,并且不依赖于任何不可预测的条件控制流.
  • 如果对象是一个数组,它的大小必须在 JIT 时间知道,并且索引到它必须使用 JIT 时间常量.

要知道这些条件何时成立,您需要对 Hotspot 的工作原理有相当多的了解.在特定情况下依靠 Hotspot 进行堆栈分配是有风险的,因为涉及到很多非本地因素.尤其是要知道是否所有内容都内联可能很难预测.

To know when these conditions hold you need to know quite a bit about how Hotspot works. Relying on Hotspot to definately do stack allocation in a certain situation can be risky, as a lot of non-local factors are involved. Especially knowing if everything is inlined can be difficult to predict.

实际上,如果您只是使用它们进行迭代,简单的迭代器通常是可分配堆栈的.对于复合对象,只有外部对象可以被堆栈分配,因此列表和其他集合总是导致堆分配.

Practically speaking, simple iterators will usually be stack allocatable if you just use them to iterate. For composite objects only the outer object can ever be stack allocated, so lists and other collections always cause heap allocation.

如果您有一个 HashMap 并且您在 myHashMap.get(42) 中使用它,则 42 可能会堆叠allocate 在测试程序中,但不会在完整的应用程序中,因为您可以确定整个程序中的 HashMaps 中会有两种以上的 key 对象,因此 key 上的 hashCode 和 equals 方法不会内联.

If you have a HashMap<Integer,Something> and you use it in myHashMap.get(42), the 42 may stack allocate in a test program, but it will not in a full application because you can be sure that there will be more than two types of key objects in HashMaps in the entire program, and therefore the hashCode and equals methods on the key won't inline.

除此之外,我没有看到任何普遍适用的规则,这将取决于代码的具体情况.

Beyond that I don't see any generally applicable rules, and it will depend on the specifics of the code.

首先要知道的是转义分析是在 内联之后执行的.这意味着 Hotspot 的逃逸分析在这方面比 Choi 论文中的描述更强大,因为从方法返回但调用方方法本地的对象仍然可以堆栈分配.因此,如果您这样做,迭代器几乎总是可以被堆栈分配,例如for(Foo item : myList) {...} (myList.iterator() 的实现很简单,它们通常是这样的.)

The first important thing to know is that escape analysis is performed after inlining. This means that Hotspot's escape analysis is in this respect more powerful than the description in the Choi paper, since an object returned from a method but local to the caller method can still be stack allocated. Because of this iterators can nearly always be stack allocated if you do e.g. for(Foo item : myList) {...} (and the implementation of myList.iterator() is simple enough, which they usually are.)

Hotspot 仅在确定方法为热"时才编译方法的优化版本,因此未多次运行的代码根本不会得到优化,在这种情况下,不会有任何堆栈分配或内联.但是对于那些你通常不关心的方法.

Hotspot only compiles optimized versions of methods once it determines the method is 'hot', so code that is not run a lot of times does not get optimized at all, in which case there is no stack allocation or inlining whatsoever. But for those methods you usually don't care.

内联决策基于 Hotspot 首先收集的分析数据.声明的类型并不重要,即使一个方法是虚拟的 Hotspot 也可以根据它在分析期间看到的对象的类型内联它.分支也有类似的情况(即 if 语句和其他控制流结构):如果在分析期间 Hotspot 从未看到某个分支被采用,它将基于从未采用该分支的假设来编译和优化代码.在这两种情况下,如果 Hotspot 不能证明它的假设总是正确的,它将在编译后的代码中插入称为不常见陷阱"的检查,如果遇到这样的陷阱,Hotspot 将取消优化并可能重新优化考虑新信息.

Inlining decisions are based on profiling data that Hotspot collects first. The declared types do not matter so much, even if a method is virtual Hotspot can inline it based on the types of the objects it sees during profiling. Something similar holds for branches (i.e. if-statements and other control flow constructs): If during profiling Hotspot never sees a certain branch being taken, it will compile and optimize the code based on the assumption that the branch is never taken. In both cases, if Hotspot cannot prove that its assumptions will always be true, it will insert checks in the compiled code known as 'uncommon traps', and if such a trap is hit Hotspot will de-optimize and possibly re-optimize taking the new information into account.

Hotspot 将分析哪些对象类型作为接收者出现在哪些呼叫站点.如果 Hotspot 只看到一个类型或两个不同的类型出现在调用站点,它能够内联被调用的方法.如果只有一两个非常常见的类型,而其他类型的出现频率要低得多,Hotspot 也应该仍然能够内联常见类型的方法,包括检查它需要采用哪些代码.(不过,我不完全确定最后一种情况是否有一种或两种常见类型和更不常见的类型).如果有两种以上的常见类型,Hotspot 根本不会内联调用,而是为间接调用生成机器代码.

Hotspot will profile which object types occur as receivers at which call sites. If Hotspot only sees a single type or only two distinct types occuring at a call site, it is able to inline the called method. If there are only one or two very common types and other types occur much less often Hotspot should also still be able to inline the methods of the common types, including a check for which code it needs to take. (I'm not entirely sure about this last case with one or two common types and more uncommon types though). If there are more than two common types, Hotspot will not inline the call at all but instead generate machine code for an indirect call.

这里的'Type'指的是对象的确切类型.不考虑已实现的接口或共享超类.即使在调用站点出现不同的接收器类型,但它们都继承了相同的方法实现(例如,多个类都从 Object 继承 hashCode),Hotspot 仍然会生成间接调用而不是内联.(所以 i.m.o. 热点在这种情况下是相当愚蠢的.我希望未来的版本能改进这一点.)

'Type' here refers to the exact type of an object. Implemented interfaces or shared superclasses are not taken into account. Even if different receiver types occur at a call site but they all inherit the same implementation of a method (e.g. multiple classes that all inherit hashCode from Object), Hotspot will still generate an indirect call and not inline. (So i.m.o. hotspot is quite stupid in such cases. I hope future versions improve this.)

Hotspot 也只会内联不太大的方法.不太大"由 -XX:MaxInlineSize=n-XX:FreqInlineSize=n 选项确定.JVM 字节码大小低于 MaxInlineSize 的可内联方法始终是内联的,如果调用是热的",则 JVM 字节码大小低于 FreqInlineSize 的方法是内联的.更大的方法永远不会内联.默认情况下 MaxInlineSize 为 35,FreqInlineSize 取决于平台,但对我来说它是 325.因此,如果您希望内联方法,请确保您的方法不要太大.有时它可以帮助从一个大方法中分离出公共路径,以便它可以内联到它的调用者中.

Hotspot will also only inline methods that are not too big. 'Not too big' is determined by the -XX:MaxInlineSize=n and -XX:FreqInlineSize=n options. Inlinable methods with a JVM bytecode size below MaxInlineSize are always inlined, methods with a JVM bytecode size below FreqInlineSize are inlined if the call is 'hot'. Larger methods are never inlined. By default MaxInlineSize is 35 and FreqInlineSize is platform dependent but for me it is 325. So make sure your methods are not too big if you want them inlined. It can sometimes help to split out the common path from a large method, so that it can be inlined into its callers.

关于分析需要了解的重要一点是,分析站点是基于 JVM 字节码的,它本身并没有以任何方式内联.所以如果你有例如静态方法

One important thing to know about profiling is that profiling sites are based on the JVM bytecode, which itself is not inlined in any way. So if you have e.g. a static method

static <T,U> List<U> map(List<T> list, Function<T,U> func) {
    List<U> result = new ArrayList();
    for(T item : list) { result.add(func.call(item)); }
    return result; 
}

将可调用的 SAM Function 映射到列表并返回转换后的列表,Hotspot 将对 func.call 的调用视为单个程序范围的调用站点.您可以在程序中的多个位置调用此 map 函数,在每个调用点传入不同的函数(但对一个调用点传递相同的函数).在这种情况下,您可能希望 Hotspot 能够内联 map,然后还可以调用 func.call,因为每次使用 map只有一个 func 类型.如果是这样,Hotspot 将能够非常紧密地优化循环.不幸的是,Hotspot 还不够聪明.它只为 func.call 调用站点保留一个配置文件,将您传递给 map 的所有 func 类型集中在一起.您可能会使用两种以上不同的 func 实现,因此 Hotspot 将无法内联对 func.call 的调用.链接了解更多详情,已归档链接,因为原始链接似乎已消失.

that maps a SAM Function callable over a list and returns the transformed list, Hotspot will treat the call to func.call as a single program-wide call site. You might call this map function at several spots in your program, passing a different func in at each call site (but the same one for one call site). In that case you might expect that Hotspot is able to inline map, and then also the call to func.call since at every use of map there is only a single func type. If this were so, Hotspot would be able to optimize the loop down very tightly. Unfortunately Hotspot is not smart enough for that. It only keeps a single profile for the func.call call site, lumping all the func types that you pass to map together. You will probably use more than two different implementations of func, so Hotspot will not be able to inline the call to func.call. Link for more details, and archived link as the original appears to be gone.

(顺便说一句,在 Kotlin 中,等效循环可以完全内联,因为 Kotlin 编译器可以内联调用在字节码级别.因此对于某些用途,它可能比 Java 快得多.)

(As an aside, in Kotlin the equivalent loop can be fully inlined as the Kotlin compiler can do inlining of calls at the bytecode level. So for some uses it could be significantly faster than Java.)

另一个重要的事情是Hotspot实际上并没有实现对象的堆栈分配.相反,它实现了标量替换,这意味着一个对象被解构为它的组成字段,并且这些字段像普通的局部变量一样被堆栈分配.这意味着根本没有留下任何物体.仅当不需要创建指向堆栈分配对象的指针时,标量替换才有效.某些形式的堆栈分配,例如C++ 或 Go 将能够在堆栈上分配完整的对象,然后将引用或指向它们的指针传递给被调用的函数,但在 Hotspot 中这不起作用.因此,如果需要将对象引用传递给非内联方法,即使该引用不会转义被调用的方法,Hotspot 也会始终堆分配这样的对象.

Another important thing to know is that Hotspot does not actually implement stack allocation of objects. Instead it implements scalar replacement, which means that an object is deconstructed into its constituent fields and those fields are stack allocated like normal local variables. This means that there is no object left at all. Scalar replacement only works if there is never a need to create a pointer to the stack-allocated object. Some forms of stack allocation in e.g. C++ or Go would be able to allocate full objects on the stack and then pass references or pointers to them to called functions, but in Hotspot this does not work. Therefore if there is ever a need to pass an object reference to a non-inlined method, even if the reference would not escape the called method, Hotspot will always heap-allocate such an object.

原则上,Hotspot 在这方面可能更聪明,但现在不是.

In principle Hotspot could be smarter about this, but right now it is not.

我使用以下程序和变体来查看 Hotspot 何时会进行标量替换.

I used the following program and variations to see when Hotspot will do scalar replacement.

// Minimal example for which the JVM does not scalarize the allocation. If field is final, or the second allocation is unconditional, it will.

class Scalarization {

        int field = 0xbd;
        long foo(long i) { return i * field; }


        public static void main(String[] args) {
                long result = 0;
                for(long i=0; i<100; i++) {
                        result += test();
                }
                System.out.println("Result: "+result);
        }


        static long test() {
                long ctr = 0x5;
                for(long i=0; i<0x10000; i++) {

                Scalarization s = new Scalarization();
                ctr = s.foo(ctr);
                if(i == 0) s = new Scalarization();
                ctr = s.foo(ctr);
                }
                return ctr;
        }
}

如果你用 javac Scalarization.java 编译和运行这个程序;java -verbose:gc Scalarization 您可以查看标量替换是否通过垃圾收集的数量起作用.如果标量替换有效,则我的系统上没有发生垃圾收集,如果标量替换不起作用,我会看到一些垃圾收集.

If you compile and run this program with javac Scalarization.java; java -verbose:gc Scalarization you can see if scalar replacement worked by the number of garbage collections. If scalar replacement works, no garbage collection happened on my system, if scalar replacement did not work I see a few garbage collections.

Hotspot 能够标量化的变体的运行速度明显快于不能标量化的版本.我验证了生成的机器代码(instructions)以确保 Hotspot 没有执行任何操作意想不到的优化.如果hotspot能够标量替换分配,那么它还可以对循环进行一些额外的优化,将其展开几次迭代,然后将这些迭代组合在一起.因此,在标量化版本中,有效循环计数较低,每个迭代执行多个源代码级迭代的工作.所以速度差异不仅仅是因为分配和垃圾收集的开销.

Variants that Hotspot is able to scalarize run significantly faster than versions where it does not. I verified the generated machine code (instructions) to make sure Hotspot was not doing any unexpected optimizations. If hotspot is able to scalar replace the allocations, it can then also do some additional optimizations on the loop, unrolling it a few iterations and then combining those iterations together. So in the scalarized versions the effective loop count is lower with each iteraton doing the work of multiple source code level iterations. So the speed difference is not only due to allocation and garbage collection overhead.

我尝试了上述程序的多种变体.标量替换的一个条件是永远不能将对象分配给对象(或静态)字段,并且可能也不能分配给数组.所以像

I tried a number of variations on the above program. One condition for scalar replacement is that the object must never be assigned to an object (or static) field, and presumably also not into an array. So in code like

Foo f = new Foo();
bar.field = f;

Foo 对象不能被标量替换.即使 bar 本身被标量替换,并且如果您不再使用 bar.field,这仍然成立.所以一个对象只能分配给局部变量.

the Foo object cannot be scalar replaced. This holds even if bar itself is scalar replaced, and also if you never again use bar.field. So an object can only ever be assigned to local variables.

仅此还不够,Hotspot 还必须能够在 JIT 时间静态确定哪个对象实例将成为调用的目标.例如,使用 footest 的以下实现并删除 field 会导致堆分配:

That alone is not enough, Hotspot must also be able to determine statically at JIT-time which object instance will be the target of a call. For example, using the following implementations of foo and test and removing field causes heap allocation:

long foo(long i) { return i * 0xbb; }

static long test() {
    long ctr = 0x5;
    for(long i=0; i<0x10000; i++) {
        Scalarization s = new Scalarization();
        ctr = s.foo(ctr);
        if(i == 50) s = new Scalarization();
        ctr = s.foo(ctr);
    }
    return ctr;
}

如果您随后删除第二个分配的条件,则不会再发生堆分配:

While if you then remove the conditional for the second assignment no more heap allocation occurs:

static long test() {
    long ctr = 0x5;
    for(long i=0; i<0x10000; i++) {
        Scalarization s = new Scalarization();
        ctr = s.foo(ctr);
        s = new Scalarization();
        ctr = s.foo(ctr);
    }
    return ctr;
}

在这种情况下,Hotspot 可以静态确定哪个实例是每次调用 s.foo 的目标.

In this case Hotspot can determine statically which instance is the target for each call to s.foo.

另一方面,即使对 s 的第二次赋值是 Scalarization 的子类,具有完全不同的实现,只要赋值是无条件的,Hotspot 仍然会缩放分配.

On the other hand, even if the second assignment to s is a subclass of Scalarization with a completely different implementation, as long as the assignment is unconditional Hotspot will still scalarize the allocations.

Hotspot 似乎无法将对象移动到先前被标量替换的堆中(至少在没有取消优化的情况下不能).标量替换是一个全有或全无的事情.所以在原始的 test 方法中,Scalarization 的两个分配总是发生在堆上.

Hotspot does not appear to be able to move an object to the heap that was previously scalar replaced (at least not without deoptimizing). Scalar replacement is an all-or-nothing affair. So in the original test method both allocations of Scalarization always happen on the heap.

一个重要的细节是,Hotspot 将根据其分析数据预测条件.如果从不执行条件赋值,Hotspot 将在该假设下编译代码,然后可能能够进行标量替换.如果在稍后的时间点确实采用了条件,Hotspot 将需要使用这个新假设重新编译代码.新代码不会进行标量替换,因为 Hotspot 无法再静态确定后续调用的接收者实例.

One important detail is that Hotspot will predict conditionals based on its profiling data. If a conditional assignment is never executed, Hotspot will compile code under that assumption, and then might be able to do scalar replacement. If at a later point in time the condtion does get taken, Hotspot will need to recompile the code with this new assumption. The new code will not do scalar replacement since Hotspot can no longer determine the receiver instance of following calls statically.

例如在 test 的这个变体中:

For instance in this variant of test:

static long limit = 0;

static long test() {
    long ctr = 0x5;
    long i = limit;
    limit += 0x10000;
    for(; i<limit; i++) { // In this form if scalarization happens is nondeterministic: if the condition is hit before profiling starts scalarization happens, else not.

        Scalarization s = new Scalarization();
        ctr = s.foo(ctr);
        if(i == 0xf9a0) s = new Scalarization();
        ctr = s.foo(ctr);
    }
    return ctr;
}

条件赋值在程序的生命周期内只执行一次.如果此分配发生得足够早,在 Hotspot 开始完整分析 test 方法之前,Hotspot 永远不会注意到正在采用的条件并编译执行标量替换的代码.如果在使用条件时分析已经开始,Hotspot 将不会进行标量替换.使用 0xf9a0 的测试值,是否发生标量替换在我的计算机上是不确定的,因为分析开始的确切时间可能会有所不同(例如,因为分析和优化代码是在后台线程上编译的).因此,如果我运行上述变体,它有时会进行一些垃圾收集,有时则不会.

the conditional assignemnt is only executed once during the lifetime of the program. If this assignment occurs early enough, before Hotspot starts full profiling of the test method, Hotspot never notices the conditional being taken and compiles code that does scalar replacement. If profiling has already started when the conditional is taken, Hotspot will not do scalar replacement. With the test value of 0xf9a0, whether scalar replacement happens is nondeterministic on my computer, since exactly when profiling starts can vary (e.g. because profiling and optimized code is compiled on background threads). So if I run the above variant it sometimes does a few garbage collections, and sometimes does not.

Hotspot 的静态代码分析比 C/C++ 和其他静态编译器所能做的要有限得多,因此 Hotspot 在通过几个条件和其他控制结构来跟踪方法中的控制流来确定一个实例的实例方面并不那么聪明.变量指的是,即使它对于程序员或更智能的编译器来说是静态可确定的.在许多情况下,分析信息将弥补这一点,但需要注意这一点.

Hotspot's static code analysis is much more limited than what C/C++ and other static compilers can do, so Hotspot is not as smart in following the control flow in a method through several conditionals and other control structures to determine the instance that a variable refers to, even if it would be statically determinable for the programmer or a smarter compiler. In many cases the profiling information will make up for that, but it is something to be aware of.

如果数组的大小在 JIT 时已知,则可以堆栈分配.但是,除非 Hotspot 也可以在 JIT 时间静态确定索引值,否则不支持对数组进行索引.所以堆栈分配的数组是非常没用的.由于大多数程序不直接使用数组而是使用标准集合,因此这不是很相关,因为嵌入对象(例如包含 ArrayList 中数据的数组)由于它们的嵌入性而已经需要进行堆分配.我想这种限制的原因是不存在对局部变量的索引操作,因此对于非常罕见的用例,这将需要额外的代码生成功能.

Arrays can be stack allocated if their size is known at JIT time. However indexing into an array is not supported unless Hotspot can also statically determine the index value at JIT-time. So stack allocated arrays are pretty useless. Since most programs don't use arrays directly but use the standard collections this is not very relevant, as embedded objects such as the array containing the data within an ArrayList already need to be heap-allocated due to their embedded-ness. I suppose the reasoning for this restriction is that there exists no indexing operation on local variables so this would require additional code generation functionality for a pretty rare use case.

相关文章