We have spoken, on and off, about switching OpenDaylight's JVM Garbage Collection (GC) from the Java 8 OpenJDK default ParallelGC to the newer G1GC ("-XX:+UseG1GC"), which is the default in Java 9+ Filing this new issue now to actually look into this; we should run scale tests with -XX:+UseG1GC to see what differences it makes - and then add that to our default start-up options (or close this issue with conclusions why we did not). Bug 1577975 probably has to be resolved before we attempt this (UNLESS this change actually HELPS with Bug 1577975; but my hunch, as of right now, is that is due to another bigger issue). It could be interesting to also try the -XX:+G1EnableStringDeduplication option.
Here is a JFR from a scale run with the default garbage collector http://file.rdu.redhat.com/~smalleni/jfr-1.tar.gz Controller-1 was leader I will update the BZ with JFR from scale run using -XX:+UseG1GC to compare.
And -XX:-UseAdaptiveSizePolicy as a (third..) new option also worth discussing here.
Here is a JFR using G1GC http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz Interesting thing is that, when using G1GC and running scale tests, we observed only a maximum CPU usage of about 6 cores.
Currently targeting to OSP 15 to track this, if this would prove useful we can clone this bug for earlier versions.
I have done a re-run and can confirm that I have not seen any crazy CPU usage. The maximum we saw was around 9-10 cores. Controller-2: https://snapshot.raintank.io/dashboard/snapshot/wyE4EkWQtNIdzaQvBzFS1nJ5X7yOsrWn Controller-1: https://snapshot.raintank.io/dashboard/snapshot/HQONTtt3kMhpJoE00FkxrP49xpKsBUof Controller-0: https://snapshot.raintank.io/dashboard/snapshot/VwUqSfW69EcJNST21HrJFR1wxJt2qbl5 I am convinced that we need to move to G1GC asap at this point if we don't see long GC pauses in the JFR. that should help to solve some of what we are seeing in https://bugzilla.redhat.com/show_bug.cgi?id=1577975 Link to JFR on rerun with G1GC: http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz The JFR in this link is smaller. Please take a look Michael.
In the above link controller-1 was leader.
> http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz > The JFR in this link is smaller. Please take a look Michael. * controller-0-scale.jfr is 24.7 GB * controller-1-scale.jfr is 4.8 GB * controller-2-scale.jfr is 4.6 GB I haven't been able to analyze the big controller-0 yet (still need to set up a HUGE Dev VM ...), and only 1/2 of the controller-1 one, but in that see a longest GC pause of only 41ms, which is huge diff from the 1.9s in Bug 1577975. Before concluding this issue, I'd still like to have a look at all of them though. > In the above link controller-1 was leader. Could I just double check on this - are you sure that despite the controller-0 being x5 the size, it was controller-1 where the action was? That seems curious. (FYI we're really only interested in the leader here, as the follower aren't "doing" much of real interest.) > I have done a re-run and can confirm that I have not seen any crazy CPU usage. > The maximum we saw was around 9-10 cores. Is there an easy way to re-run flamegraphs like you did early and confirm that the CPU usage that we do see now is no longer mostly GC, but "just" actual Java code running? Because then we could move on to [separately] actually profile and optimize that... ;-)
COntroller-1 was the leader in this set of JFRs http://rdu-storage01.scalelab.redhat.com/sai/jfr-3-g1gc.tar.gz obtained on rerunning. Please us this link as the JFR on leader is also much smaller. Controller-0 was the leader in the other run (http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz). Tl;DR please use http://rdu-storage01.scalelab.redhat.com/sai/jfr-3-g1gc.tar.gz in which controller-1 was leader. I can profile the JVM, but IMHO we should not block this bug on that.
> able to analyze the big controller-0 yet (still need to set up a HUGE Dev VM after some.. "hoops" (RDO Cloud max. flavour only gives 16 GB RAM, which is not sufficient to analyse the 8.6 GB sized controller-1.jfr from jfr-3-g1gc.tar.gz, so I've had to create a 50 GB VM "elsewhere" so that I could use -Xmx36g in jmc.ini), I've just finally managed to dig into this JFR - it's actually very useful & interesting! We should do this more often, and I may open other linked bugs later (incl. for actual profiling), but let's focus only on the GC here in this issue: > see a longest GC pause of only 41ms, which is huge diff from the 1.9s in so the Longest Pause is actually 285 ms - which is still a clear improvement. > I have done a re-run and can confirm that I have not seen any crazy CPU usage. Based on the GC data in JFR, and Sai's confirmation, it seems to me that we can conclude that switching ODL's GC policy to G1 is a Good Idea. > Bug 1577975 probably has to be resolved before we attempt this > (UNLESS this change actually HELPS with Bug 1577975; but my hunch, as of > right now, is that is due to another bigger issue). So given that using G1 helps with Bug 1577975 (high CPU usage), let us flip this around - resolving this bug here blocks the 1577975 (and not the other way around, as initially planned). IMHO we should in parallel continue efforts to reduce object allocation as per the issues linked on Bug 1577975.
> Currently targeting to OSP 15 to track this, > if this would prove useful we can clone this bug for earlier versions. IMHO this should go in ASAP, for the stream being currently tested, not future OSP15. Will you change the target?
(In reply to Michael Vorburger from comment #12) > > Currently targeting to OSP 15 to track this, > > if this would prove useful we can clone this bug for earlier versions. > > IMHO this should go in ASAP, for the stream being currently tested, not > future OSP15. Will you change the target? Thanks for your input. Since this is a change that could be potentially disruptive, and hasn't undergone proper testing, I'd prefer to see it implemented in OSP 15 when we will have time to run with it, and if we see it's indeed promising and has no downsides we can always backport it to earlier versions.
> Since this is a change that could be potentially disruptive, and hasn't > undergone proper testing, I'd prefer to see it implemented in OSP 15 when we > will have time to run with it, and if we see it's indeed promising and has > no downsides we can always backport it to earlier versions. IMHO this should go in ASAP; based on the data available in this issue (best to re-confirm in testing), we can go from 1.9s GC to 285 ms.
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality