Java 垃圾收集器 G1GC 需要很长时间才能完成“对象复制"(撤离暂停)

2022-01-16 00:00:00 garbage-collection java

我不是 Java 新手,但我对垃圾收集知之甚少.现在我想通过一些实际经验来改变这一点.我的目标是延迟低于 0.3 秒,或者在极端情况下 0.5 秒也可以.

我有一个带有 -Xmx50gb (-Xms50gb) 的应用程序并设置了以下其他 GC 选项:

-XX:+UseG1GC -Xloggc:somewhere.gc.log -XX:+PrintGCDateStamps

但现在由于垃圾收集,我偶尔会出现超过 5 秒的长时间停顿,尽管似乎有足够的可用内存.我发现的一个原因:

[GC pause (G1 Evacuation Pause) (young) 42G->40G(48G), 5.9409662 secs]

为什么 GCG1 还在为此做阻止世界"?(或者至少我看到它正好在这个时候停止了我的应用程序)如果它不是真正必要的,为什么它会进行如此负面的清理,因为有超过 12% 的可用 RAM 可用.我还认为 -XX:MaxGCPauseMillis 的默认值是 200 毫秒,为什么这个值违反了 29 甚至 50 倍(见下文)?

延迟的另一个原因是:

[GC pause (Metadata GC Threshold) (young) (initial-mark) 40G->39G(48G), 10.4667233 secs]

这可能会通过这个答案解决,例如只是增加元数据空间 -XX:MetaspaceSize=100M

顺便说一句:使用 JSE 1.8.0_91-b14

更新:此类事件的详细 GC 日志

2016-08-12T09:20:31.589+0200: 1178.312: [GC pause (G1 Evacuation Pause) (young) 1178.312: [G1Ergonomics (CSet Construction) 开始选择CSet, _pending_cards: 3159, 预测基准时间:1.52 ms,剩余时间:198.48 ms,目标暂停时间:200.00 ms]1178.312:[G1Ergonomics(CSet 构建)将年轻区域添加到 CSet,伊甸园:136 个区域,幸存者:20 个区域,预测年轻区域时间:1924.75 毫秒]1178.312:[G1Ergonomics(CSet Construction)完成选择CSet,eden:136个区域,survivor:20个区域,old:0个区域,预测暂停时间:1926.27 ms,目标暂停时间:200.00 ms]1185.330: [G1Ergonomics (Heap Sizing) 尝试堆扩展,原因:最近 GC 开销高于 GC 后的阈值,最近 GC 开销:21.83 %,阈值:10.00 %,未提交:0 字节,计算的扩展量:0 字节(20.00 %)]1185.330:[G1Ergonomics(Concurrent Cycles)不请求并发循环启动,原因:仍在做混合收集,占用:42580574208字节,分配请求:0字节,阈值:23592960000字节(45.00%),来源:GC结束]1185.330:[G1Ergonomics (Mixed GCs) 不启动混合 GCs,原因:可回收百分比未超过阈值,候选旧区域:1 个区域,可回收:3381416 字节(0.01 %),阈值:5.00 %], 7.0181903 秒][并行时间:6991.8 ms,GC Workers:10][GC Worker Start (ms): Min: 1178312.6, Avg: 1178312.8, Max: 1178312.9, Diff: 0.2][Ext Root Scanning (ms): Min: 1.1, Avg: 1.5, Max: 2.3, Diff: 1.2, Sum: 15.0][更新 RS(毫秒):最小值:0.0,平均值:0.3,最大值:1.3,差异:1.3,总和:3.4][处理的缓冲区:最小值:0,平均值:2.1,最大值:5,差异:5,总和:21][扫描 RS (ms):Min:0.0,Avg:0.0,Max:0.1,Diff:0.1,Sum:0.4][代码根扫描(毫秒):Min:0.0,Avg:0.2,Max:0.4,Diff:0.4,Sum:1.7][对象复制(毫秒):Min:6964.1,Avg:6973.0,Max:6989.5,Diff:25.3,Sum:69730.4][终止(毫秒):最小值:0.0,平均值:16.4,最大值:25.3,差异:25.3,总和:164.4][终止尝试:最小:1,平均:3.2,最大:13,差异:12,总和:32][GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2][GC Worker Total (ms): Min: 6991.5, Avg: 6991.6, Max: 6991.7, Diff: 0.2, Sum: 69915.5][GC Worker End (ms): Min: 1185304.3, Avg: 1185304.3, Max: 1185304.3, Diff: 0.0][代码根修复:0.1 毫秒][代码根清除:0.0 ms][清除 CT:0.3 毫秒][其他:26.0 毫秒][选择 CSet:0.0 毫秒][参考过程:25.3 毫秒][参考 Enq:0.1 毫秒][Redirty 卡:0.1 毫秒][巨大的寄存器:0.2毫秒][巨大的回收:0.0毫秒][免费 CSet:0.2 毫秒][伊甸园:2176.0M(2176.0M)->0.0B(2176.0M) 幸存者:320.0M->320.0M 堆:40.6G(48.8G)->40.0G(48.8G)][时间:用户=0.55 系统=46.58,真实=7.02 秒]

阅读这里了解它:复制(停止世界事件)- 这些是世界暂停撤离或将活动对象复制到新的未使用区域的站点.这可以通过记录为 [GC pause (young)] 的年轻代区域来完成.或者记录为 [GC Pause (mixed)] 的年轻代和老年代区域.

解决方案

为什么GCG1还在做阻止世界"?为此?

因为G1不是一个pauseless收集器,它只是一个low-pause收集器.

<块引用>

我还以为 -XX:MaxGCPauseMillis 的默认值是 200 毫秒,为什么这个值违反了 29 甚至 50 倍(见下文)?

确实如此,但这只是一个目标,而不是保证.许多事情都可能导致它无法实现该目标.你有一个相当大的堆,这使事情变得更加困难,即更容易引发失败.

无论如何,GC 调优之旅从启用详细的 GC 日志记录开始

-Xloggc:<gc日志文件的路径>-XX:+PrintAdaptiveSizePolicy-XX:+打印GCDateStamps-XX:+PrintGCTimeStamps-XX:+PrintGC详情

更新:这些选项适用于热点 8.9 及更高版本使用 统一日志记录,具有不同的参数格式.

然后通过 GCViewer 运行生成的日志以获得一般概述,然后返回 阅读单个日志条目(关于这个主题有很多答案/博客文章)来找出可能导致最糟糕行为的原因.根据原因,可以尝试各种补救措施.

对跟踪垃圾收集器的一般工作方式和 G1 的一般工作方式有一定的了解,这对于避免货物崇拜是必要的.

<块引用>

我的应用程序有很多分配,可以很容易地称为巨大的分配".

如果这确实是原因,那么当前的虚拟机有一些实验选项可以更快地回收它们.

<块引用>

 [对象复制(毫秒):Min:6964.1,Avg:6973.0,Max:6989.5,Diff:25.3,Sum:69730.4][时间:用户=0.55 系统=46.58,真实=7.02 秒]

这意味着它在内核中花费了大部分时间来做一些应该主要由内存访问而不是系统调用组成的事情.因此,交换活动或 透明大页面 可能是嫌疑人.

I'm not a Java newby but I know only a tiny bit about garbage collection. Now I would like to change that with some practical experiences. My goal is a latency of under 0.3 seconds, or in extreme cases 0.5 is okay too.

I have an application with -Xmx50gb (-Xms50gb) and set the other following GC options:

-XX:+UseG1GC -Xloggc:somewhere.gc.log -XX:+PrintGCDateStamps

But now I'm occasionally having long pauses over 5sec due to garbage collections and that although there seems to be enough memory available. One reason I found:

[GC pause (G1 Evacuation Pause) (young) 42G->40G(48G), 5.9409662 secs]

Why is GCG1 still doing a "stop the world" for this? (Or at least I see that it stops my application at exactly this time) And why does it do such a negative cleanup if it is not really necessary as there is over 12% of the available RAM free. Also I thought that the default value for -XX:MaxGCPauseMillis is 200 milliseconds, why is this value violated by a factor of 29 or even 50 (see below)?

Another reason for a delay was:

[GC pause (Metadata GC Threshold) (young) (initial-mark) 40G->39G(48G), 10.4667233 secs]

This will be probably solved via this answer e.g. just increasing the metadata space -XX:MetaspaceSize=100M

BTW: Using JSE 1.8.0_91-b14

Update: detailed GC log of such an event

2016-08-12T09:20:31.589+0200: 1178.312: [GC pause (G1 Evacuation Pause) (young) 1178.312: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3159, predicted base time: 1.52 ms, remaining time: 198.48 ms, target pause time: 200.00 ms]
 1178.312: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 136 regions, survivors: 20 regions, predicted young region time: 1924.75 ms]
 1178.312: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 136 regions, survivors: 20 regions, old: 0 regions, predicted pause time: 1926.27 ms, target pause time: 200.00 ms]
 1185.330: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 21.83 %, threshold: 10.00 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
 1185.330: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 42580574208 bytes, allocation request: 0 bytes, threshold: 23592960000 bytes (45.00 %), source: end of GC]
 1185.330: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 1 regions, reclaimable: 3381416 bytes (0.01 %), threshold: 5.00 %]
, 7.0181903 secs]
   [Parallel Time: 6991.8 ms, GC Workers: 10]
      [GC Worker Start (ms): Min: 1178312.6, Avg: 1178312.8, Max: 1178312.9, Diff: 0.2]
      [Ext Root Scanning (ms): Min: 1.1, Avg: 1.5, Max: 2.3, Diff: 1.2, Sum: 15.0]
      [Update RS (ms): Min: 0.0, Avg: 0.3, Max: 1.3, Diff: 1.3, Sum: 3.4]
         [Processed Buffers: Min: 0, Avg: 2.1, Max: 5, Diff: 5, Sum: 21]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.4]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 1.7]
      [Object Copy (ms): Min: 6964.1, Avg: 6973.0, Max: 6989.5, Diff: 25.3, Sum: 69730.4]
      [Termination (ms): Min: 0.0, Avg: 16.4, Max: 25.3, Diff: 25.3, Sum: 164.4]
         [Termination Attempts: Min: 1, Avg: 3.2, Max: 13, Diff: 12, Sum: 32]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.2]
      [GC Worker Total (ms): Min: 6991.5, Avg: 6991.6, Max: 6991.7, Diff: 0.2, Sum: 69915.5]
      [GC Worker End (ms): Min: 1185304.3, Avg: 1185304.3, Max: 1185304.3, Diff: 0.0]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 26.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 25.3 ms]
      [Ref Enq: 0.1 ms]
 [Redirty Cards: 0.1 ms]
      [Humongous Register: 0.2 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.2 ms]
   [Eden: 2176.0M(2176.0M)->0.0B(2176.0M) Survivors: 320.0M->320.0M Heap: 40.6G(48.8G)->40.0G(48.8G)]
 [Times: user=0.55 sys=46.58, real=7.02 secs] 

Read here about it: Copying (Stop the World Event) - These are the stop the world pauses to evacuate or copy live objects to new unused regions. This can be done with young generation regions which are logged as [GC pause (young)]. Or both young and old generation regions which are logged as [GC Pause (mixed)].

解决方案

Why is GCG1 still doing a "stop the world" for this?

Because G1 is not a pauseless collector, it is just a low-pause collector.

Also I thought that the default value for -XX:MaxGCPauseMillis is 200 milliseconds, why is this value violated by a factor of 29 or even 50 (see below)?

It is, but it's just a goal, not a guarantee. Many things can cause it to fail to meet that goal. You got a fairly large heap, this makes things more difficult, i.e. failures are easier to provoke.

Anyway, the GC tuning journey starts with enabling verbose GC logging via

-Xloggc:<path to gc log file>
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails

Update: These options are for hotspot 8. 9 and later use unified logging which has a different parameter format.

and then running the resulting log through GCViewer to get a general overview and then going back to reading individual log entries (there are many answers/blog posts on this topic) to figure out what might be causing the worst behavior. Depending on the cause various remedies can be tried.

Some general understanding of how tracing garbage collectors work in general and G1 will be necessary to avoid cargo-culting.

My application has many allocations which could be easily called "humongous allocations".

If that actually is the cause then current VMs have some experimental options to reclaim them sooner.

 [Object Copy (ms): Min: 6964.1, Avg: 6973.0, Max: 6989.5, Diff: 25.3, Sum: 69730.4]
 [Times: user=0.55 sys=46.58, real=7.02 secs] 

This means it spends most of the time in the kernel when doing something that should mostly consist of memory accesses and not system calls. So swap activity or transparent huge pages are likely suspects.

相关文章