GC停顿问题分析

1. 由于promotion failed引起应用响应时间上升,超时

83

首先,从日志上来看,在Allocation failure之后,JVM同时进行了CMS,这似乎跟对CMS的理解不一致。

回归一下CMS的垃圾回收过程,大概分下面几个阶段:

  • initial mark
  • concurrent marking
  • remark
  • concurrent sweep
  • resetting

其中,concurrent的是跟应用进程同步进行的,也即stop-the-world的是initial mark与remark过程,来看一个正常一点的CMS流程,如:

2016-06-27T15:11:56.581+0800: 339380.151: [GC (Allocation Failure) 339380.152: [ParNew: 1247814K->18323K(1380160K), 0.0590120 secs] 7550641K->6325168K(8235264K), 0.0594820 secs] [Times: user=0.31 sys=0.01, real=0.06 secs]
2016-06-27T15:11:56.641+0800: 339380.211: Total time for which application threads were stopped: 0.0629010 seconds
2016-06-27T15:11:56.644+0800: 339380.214: [GC (CMS Initial Mark) [1 CMS-initial-mark: 6306845K(6855104K)] 6325177K(8235264K), 0.0098790 secs] [Times: user=0.05 sys=0.01, real=0.01 secs]
2016-06-27T15:11:56.654+0800: 339380.224: Total time for which application threads were stopped: 0.0130410 seconds
2016-06-27T15:11:56.654+0800: 339380.224: [CMS-concurrent-mark-start]
2016-06-27T15:11:56.766+0800: 339380.336: [CMS-concurrent-mark: 0.112/0.112 secs] [Times: user=0.63 sys=0.02, real=0.11 secs]
2016-06-27T15:11:56.766+0800: 339380.336: [CMS-concurrent-preclean-start]
2016-06-27T15:11:56.811+0800: 339380.381: [CMS-concurrent-preclean: 0.044/0.044 secs] [Times: user=0.05 sys=0.00, real=0.05 secs]
2016-06-27T15:11:56.811+0800: 339380.381: [CMS-concurrent-abortable-preclean-start]
2016-06-27T15:11:59.003+0800: 339382.573: Total time for which application threads were stopped: 0.0021950 seconds
2016-06-27T15:11:59.004+0800: 339382.574: Total time for which application threads were stopped: 0.0010940 seconds
2016-06-27T15:12:01.235+0800: 339384.805: Total time for which application threads were stopped: 0.0022090 seconds
2016-06-27T15:12:01.236+0800: 339384.806: Total time for which application threads were stopped: 0.0011300 seconds
 CMS: abort preclean due to time 2016-06-27T15:12:01.825+0800: 339385.396: [CMS-concurrent-abortable-preclean: 5.002/5.015 secs] [Times: user=6.21 sys=0.27, real=5.01 secs]
2016-06-27T15:12:01.828+0800: 339385.398: [GC (CMS Final Remark) [YG occupancy: 262689 K (1380160 K)]339385.398: [Rescan (parallel) , 0.0284940 secs]339385.427: [weak refs processing, 0.0581820 secs]339385.485: [class unloading, 0.1483130 secs]339385.633: [scrub symbol table, 0.0129170 secs]339385.646: [scrub string table, 0.0023140 secs][1 CMS-remark: 6306845K(6855104K)] 6569534K(8235264K), 0.3404820 secs] [Times: user=0.80 sys=0.01, real=0.34 secs]
2016-06-27T15:12:02.168+0800: 339385.739: Total time for which application threads were stopped: 0.3428850 seconds
2016-06-27T15:12:02.168+0800: 339385.739: [CMS-concurrent-sweep-start]
2016-06-27T15:12:02.185+0800: 339385.755: Total time for which application threads were stopped: 0.0063080 seconds
2016-06-27T15:12:02.196+0800: 339385.766: Total time for which application threads were stopped: 0.0086300 seconds
2016-06-27T15:12:02.206+0800: 339385.776: Total time for which application threads were stopped: 0.0075660 seconds
2016-06-27T15:12:02.213+0800: 339385.783: Total time for which application threads were stopped: 0.0070080 seconds
2016-06-27T15:12:02.216+0800: 339385.786: Total time for which application threads were stopped: 0.0027590 seconds
2016-06-27T15:12:02.218+0800: 339385.788: Total time for which application threads were stopped: 0.0022080 seconds
2016-06-27T15:12:03.224+0800: 339386.794: Total time for which application threads were stopped: 0.0030800 seconds
2016-06-27T15:12:04.228+0800: 339387.798: Total time for which application threads were stopped: 0.0020560 seconds
2016-06-27T15:12:05.232+0800: 339388.802: Total time for which application threads were stopped: 0.0020960 seconds
2016-06-27T15:12:06.236+0800: 339389.806: Total time for which application threads were stopped: 0.0020660 seconds
2016-06-27T15:12:06.472+0800: 339390.042: Total time for which application threads were stopped: 0.0037540 seconds
2016-06-27T15:12:06.533+0800: 339390.103: Total time for which application threads were stopped: 0.0036710 seconds
2016-06-27T15:12:06.535+0800: 339390.105: Total time for which application threads were stopped: 0.0013890 seconds
2016-06-27T15:12:06.537+0800: 339390.108: Total time for which application threads were stopped: 0.0023870 seconds
2016-06-27T15:12:06.728+0800: 339390.298: Total time for which application threads were stopped: 0.0019490 seconds
2016-06-27T15:12:06.729+0800: 339390.300: Total time for which application threads were stopped: 0.0010790 seconds
2016-06-27T15:12:07.734+0800: 339391.304: Total time for which application threads were stopped: 0.0029020 seconds
2016-06-27T15:12:07.780+0800: 339391.350: Total time for which application threads were stopped: 0.0021800 seconds
2016-06-27T15:12:07.783+0800: 339391.353: Total time for which application threads were stopped: 0.0017710 seconds
2016-06-27T15:12:09.789+0800: 339393.359: Total time for which application threads were stopped: 0.0030100 seconds
2016-06-27T15:12:10.012+0800: 339393.582: Total time for which application threads were stopped: 0.0027170 seconds
2016-06-27T15:12:10.014+0800: 339393.585: Total time for which application threads were stopped: 0.0019420 seconds
2016-06-27T15:12:10.652+0800: 339394.223: [CMS-concurrent-sweep: 8.410/8.484 secs] [Times: user=11.32 sys=0.53, real=8.48 secs]
2016-06-27T15:12:10.653+0800: 339394.223: [CMS-concurrent-reset-start]
2016-06-27T15:12:10.667+0800: 339394.238: [CMS-concurrent-reset: 0.015/0.015 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]

可以看到尽管几个阶段加起来耗时不少,但是由于其中耗时部分都是concurrent的,stop-the-world的时间很少,所以对应用并没有什么的影响。

但是,总有但是,在某些情况下,CMS会出现“类似”退化成Serial GC的情况,暂且称为stop-the-world CMS,典型的是:

  • promotion failed场景,YGC后存活的对象需要promote到survivor,但survivor空间不足以存储存活对象,需要到old generation,但是old generation此时也不够空间(不要忽略可能存在的内存碎片)存储存活对象,所以需要进行一次stop-the-world的CMS;
  • concurrent mode failure场景,在CMS进行的过程中,又发生GC(比如还是promotion failed),但内存还是不够,继续触发CMS,有可能触发Full GC;
  • allocation failed->直接触发Full GC(极少,可忽略);

补充一个concurrent mode failure的日志:

2016-06-28T11:38:46.227+0800: 263.823: Total time for which application threads were stopped: 0.0511794 seconds
2016-06-28T11:38:46.227+0800: 263.823: [Full GC (Allocation Failure) 263.824: [CMS2016-06-28T11:38:46.356+0800: 263.953: [CMS-concurrent-mark: 0.147/0.199 secs] [Times: user=0.27 sys=0.00, real=0.20 secs]
 (concurrent mode failure): 3053218K->3053213K(3145728K), 1.0337448 secs] 3715366K->3705741K(4089472K), [Metaspace: 77454K->77454K(1120256K)], 1.0339970 secs] [Times: user=1.04 sys=0.00, real=1.03 secs]
2016-06-28T11:38:47.261+0800: 264.858: [Full GC (Allocation Failure) 264.858: [CMS: 3053213K->183845K(3145728K), 0.4329523 secs] 3705741K->183845K(4089472K), [Metaspace: 77454K->77454K(1120256K)], 0.4336212 secs] [Times: user=0.43 sys=0.00, real=0.44 secs]
2016-06-28T11:38:47.695+0800: 265.291: Total time for which application threads were stopped: 1.4681578 seconds
2016-06-28T11:38:47.731+0800: 265.327: Total time for which application threads were stopped: 0.0102457 seconds
2016-06-28T11:38:48.090+0800: 265.686: Total time for which application threads were stopped: 0.0043965 seconds

怎么避免?

对于promotion failed.  可选的方式有:

  • 调整survivor空间,能优化应用性能,但对于这个场景估计不行,因为现在的应用一般都设置了很大的堆,几个G以上,这个时候old generation一般较大,出现old generation没空间存储的情况,一般survivor也存不下,不太可能将survivor调整太大。
  • 调整CMS触发时间,让old generation预留出足够的空间,便不会因为空间不够在promotion failed以后触发stop-the-word CMS。

对于第二种方式,我们知道JVM有后台线程会定时检查是否满足触发CMS的条件,其中一个重要的便是年老代已使用的比例,不同版本JVM默认大小是不一样的,可通过CMSInitiatingOccupancyFraction设置,对CMSInitiatingOccupancyFraction=-1的情况,根据CMSTriggerRatio,MinHeapFreeRatio计算,这边JDK8的默认值是92%。

鉴于此,可通过CMSInitiatingOccupancyFraction来调整触发CMS的时机(以防万一,如果UseCMSInitiatingOccupancyOnly没开启,记得加上,在UseCMSInitiatingOccupancyOnly为false时,还会有其它的判断条件)。

最后,对于concurrent mode failure,只有第二种方式可选。其实,对于堆大小比较充裕的情况,concurrent mode failure如果出现,一般说明有内存泄漏了。

 

 

 

 

如果未说明,本Blog中文章皆为原创文章,请尊重他人劳动,转载请注明: 转载自jmatrix

本文链接地址: GC停顿问题分析

(注:一般引用了,我都会添加引用,如果有侵权的请联系我)



This entry was posted in JVM and tagged , . Bookmark the permalink. Follow any comments here with the RSS feed for this post. Trackbacks are closed, but you can post a comment.