1649004 – [Infra] OpenDaylight JVM Garbage Collection (GC) -XX:+UseG1GC

Bug 1649004 - [Infra] OpenDaylight JVM Garbage Collection (GC) -XX:+UseG1GC

Summary: [Infra] OpenDaylight JVM Garbage Collection (GC) -XX:+UseG1GC

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	opendaylight
Sub Component:
Version:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	beta
Target Release:	15.0 (Stein)
Assignee:	Stephen Kitt
QA Contact:	Sai Sindhur Malleni
Docs Contact:
URL:
Whiteboard:	Infra
Depends On:
Blocks:	1577975
TreeView+	depends on / blocked

Reported:	2018-11-12 17:06 UTC by Michael Vorburger
Modified:	2019-03-06 16:17 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-06 16:15:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michael Vorburger 2018-11-12 17:06:00 UTC

We have spoken, on and off, about switching OpenDaylight's JVM Garbage Collection (GC) from the Java 8 OpenJDK default ParallelGC to the newer G1GC ("-XX:+UseG1GC"), which is the default in Java 9+ 

Filing this new issue now to actually look into this; we should run scale tests with -XX:+UseG1GC to see what differences it makes - and then add that to our default start-up options (or close this issue with conclusions why we did not).

Bug 1577975 probably has to be resolved before we attempt this (UNLESS this change actually HELPS with Bug 1577975; but my hunch, as of right now, is that is due to another bigger issue).

It could be interesting to also try the -XX:+G1EnableStringDeduplication option.

Comment 1 Sai Sindhur Malleni 2018-11-13 01:29:44 UTC

Here is a JFR from a scale run with the default garbage collector
http://file.rdu.redhat.com/~smalleni/jfr-1.tar.gz

Controller-1 was leader

I will update the BZ with JFR from scale run using -XX:+UseG1GC to compare.

Comment 2 Michael Vorburger 2018-11-13 10:05:47 UTC

And -XX:-UseAdaptiveSizePolicy as a (third..) new option also worth discussing here.

Comment 3 Sai Sindhur Malleni 2018-11-14 16:05:05 UTC

Here is a JFR using G1GC http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz

Interesting thing is that, when using G1GC and running scale tests, we observed only a maximum CPU usage of about 6 cores.

Comment 4 Mike Kolesnik 2018-11-15 06:33:34 UTC

Currently targeting to OSP 15 to track this, if this would prove useful we can clone this bug for earlier versions.

Comment 6 Sai Sindhur Malleni 2018-11-15 14:37:44 UTC

I have done a re-run and can confirm that I have not seen any crazy CPU usage. The maximum we saw was around 9-10 cores.

Controller-2:
https://snapshot.raintank.io/dashboard/snapshot/wyE4EkWQtNIdzaQvBzFS1nJ5X7yOsrWn

Controller-1:
https://snapshot.raintank.io/dashboard/snapshot/HQONTtt3kMhpJoE00FkxrP49xpKsBUof

Controller-0:
https://snapshot.raintank.io/dashboard/snapshot/VwUqSfW69EcJNST21HrJFR1wxJt2qbl5

I am convinced that we need to move to G1GC asap at this point if we don't see long GC pauses in the JFR. that should help to solve some of what we are seeing in https://bugzilla.redhat.com/show_bug.cgi?id=1577975

Link to JFR on rerun with G1GC: http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz

The JFR in this link is smaller. Please take a look Michael.

Comment 7 Sai Sindhur Malleni 2018-11-15 14:38:33 UTC

In the above link controller-1 was leader.

Comment 8 Michael Vorburger 2018-11-16 18:44:41 UTC

>  http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz
> The JFR in this link is smaller. Please take a look Michael.

* controller-0-scale.jfr is 24.7 GB
* controller-1-scale.jfr is 4.8 GB
* controller-2-scale.jfr is 4.6 GB

I haven't been able to analyze the big controller-0 yet (still need to set up a HUGE Dev VM ...), and only 1/2 of the controller-1 one, but in that see a longest GC pause of only 41ms, which is huge diff from the 1.9s in Bug 1577975.  Before concluding this issue, I'd still like to have a look at all of them though.

> In the above link controller-1 was leader.

Could I just double check on this - are you sure that despite the controller-0 being x5 the size, it was controller-1 where the action was?  That seems curious.  (FYI we're really only interested in the leader here, as the follower aren't "doing" much of real interest.)

> I have done a re-run and can confirm that I have not seen any crazy CPU usage. 
> The maximum we saw was around 9-10 cores.

Is there an easy way to re-run flamegraphs like you did early and confirm that the CPU usage that we do see now is no longer mostly GC, but "just" actual Java code running?  Because then we could move on to [separately] actually profile and optimize that... ;-)

Comment 9 Sai Sindhur Malleni 2018-11-18 23:56:06 UTC

COntroller-1 was the leader in this set of JFRs http://rdu-storage01.scalelab.redhat.com/sai/jfr-3-g1gc.tar.gz obtained on rerunning. Please us this link as the JFR on leader is also much smaller. Controller-0 was the leader in the  other run (http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz).



Tl;DR please use http://rdu-storage01.scalelab.redhat.com/sai/jfr-3-g1gc.tar.gz in which controller-1 was leader.


I can profile the JVM, but IMHO we should not block this bug on that.

Comment 11 Michael Vorburger 2018-11-20 12:23:02 UTC

>  able to analyze the big controller-0 yet (still need to set up a HUGE Dev VM

after some.. "hoops" (RDO Cloud max. flavour only gives 16 GB RAM, which is not sufficient to analyse the 8.6 GB sized controller-1.jfr from jfr-3-g1gc.tar.gz, so I've had to create a 50 GB VM "elsewhere" so that I could use -Xmx36g in jmc.ini), I've just finally managed to dig into this JFR - it's actually very useful & interesting!  We should do this more often, and I may open other linked bugs later (incl. for actual profiling), but let's focus only on the GC here in this issue:

> see a longest GC pause of only 41ms, which is huge diff from the 1.9s in

so the Longest Pause is actually 285 ms - which is still a clear improvement.

> I have done a re-run and can confirm that I have not seen any crazy CPU usage. 

Based on the GC data in JFR, and Sai's confirmation, it seems to me that we can conclude that switching ODL's GC policy to G1 is a Good Idea.

> Bug 1577975 probably has to be resolved before we attempt this 
> (UNLESS this change actually HELPS with Bug 1577975; but my hunch, as of 
> right now, is that is due to another bigger issue).

So given that using G1 helps with Bug 1577975 (high CPU usage), let us flip this around - resolving this bug here blocks the 1577975 (and not the other way around, as initially planned).

IMHO we should in parallel continue efforts to reduce object allocation as per the issues linked on Bug 1577975.

Comment 12 Michael Vorburger 2018-11-20 12:27:07 UTC

> Currently targeting to OSP 15 to track this,
> if this would prove useful we can clone this bug for earlier versions.

IMHO this should go in ASAP, for the stream being currently tested, not future OSP15.  Will you change the target?

Comment 13 Mike Kolesnik 2018-11-22 12:07:56 UTC

(In reply to Michael Vorburger from comment #12)
> > Currently targeting to OSP 15 to track this,
> > if this would prove useful we can clone this bug for earlier versions.
> 
> IMHO this should go in ASAP, for the stream being currently tested, not
> future OSP15.  Will you change the target?

Thanks for your input.

Since this is a change that could be potentially disruptive, and hasn't undergone proper testing, I'd prefer to see it implemented in OSP 15 when we will have time to run with it, and if we see it's indeed promising and has no downsides we can always backport it to earlier versions.

Comment 14 Michael Vorburger 2018-12-19 16:42:40 UTC

> Since this is a change that could be potentially disruptive, and hasn't
> undergone proper testing, I'd prefer to see it implemented in OSP 15 when we
> will have time to run with it, and if we see it's indeed promising and has
> no downsides we can always backport it to earlier versions.

IMHO this should go in ASAP; based on the data available in this issue (best to re-confirm in testing), we can go from 1.9s GC to 285 ms.

Comment 15 Franck Baudin 2019-03-06 16:15:34 UTC

As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 16 Franck Baudin 2019-03-06 16:17:25 UTC

As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Note You need to log in before you can comment on or make changes to this bug.