Bug 1577975

Summary: [Infra][Scale] High CPU Usage of OpenDaylight during Longevity/Scale testing
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: opendaylightAssignee: Michael Vorburger <vorburger>
Status: CLOSED WONTFIX QA Contact: Sai Sindhur Malleni <smalleni>
Severity: high Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: aadam, anmiller, mkolesni, skitt, smalleni
Target Milestone: z5Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Infra
Fixed In Version: Doc Type: Known Issue
Doc Text:
OpenDaylight may experience periods of very high CPU usage. This issue should not affect the functionality of OpenDaylight, although it could potentially impact other system services.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-06 16:15:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1492279, 1493558, 1566143, 1585227, 1602187, 1602188, 1607959, 1607965, 1615709, 1649004, 1650576, 1660167    
Bug Blocks:    
Attachments:
Description Flags
Karaf logs on 3 controllers
none
Flame Graph which Sai sent by email from a situation when there is no problem, just as an example none

Description Sai Sindhur Malleni 2018-05-14 14:23:17 UTC
Description of problem: When running longevity tests (creating 2 neutron resources at a time for 40 times over and over again) and nova boot tests (10 VMs at a time for 200 times) we are seeing OpenDaylight consume too much CPU, of the order of 30 cores. This was not reproduced by normal scale testing using neutron resource creation but was more specific to longevity and nova boot tests. This bug is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1492279  which was closed as the issue wasn't reproduced easily on OSP13. However after fairly extensive amount of testing, it does seem reproducable often.

After a brief IRC discussion with Micahel Vorburger it was decided to open a new  bug to track this issue.

I need to profile the CPU to provide flame graphs like I did in https://bugzilla.redhat.com/show_bug.cgi?id=1492279. 

Version-Release number of selected component (if applicable):
OSP13
opendaylight-8.0.0-10.el7ost.noarch.rpm 

How reproducible:
Often during nova boot and longevity tests

Steps to Reproduce:
1. Deploy OSP13 with ODL
2. Run scale/longevity tests with neutorn and nova resources
3.

Actual results:
CPU consumption around 30 cores

Expected results:
Far less CPU consumption

Additional info:
Happened on controller-0 as can be seen here: https://snapshot.raintank.io/dashboard/snapshot/5xXv3VcHtg5mPtieqwJnQ6aIaKqdcR5H

Comment 1 Sai Sindhur Malleni 2018-05-14 14:25:12 UTC
Created attachment 1436215 [details]
Karaf logs on 3 controllers

High CPU usage is seen from 15:56 to 16:26 UTC on 05/13/2018 in logs.

Comment 2 Michael Vorburger 2018-05-15 10:19:30 UTC
Sai, yeah that 3500% CPU spike on that graph looks worth looking into.  looking forward to get to the bottom of this with you.  

The logs don't immediately tell me anything direct to fix, although it does look like the cluster was in trouble, note that apart from too many assorted NPEs, we hit a lot of these:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data.0.18:2550/), Path(/user/shardmanager-config/member-1-shard-default-config/shard-default-member-0:datastore-config@0:64424_85629#-1245797625)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".

2018-05-13T16:08:50,569 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | PhiAccrualFailureDetector        | 46 - com.typesafe.akka.slf4j - 2.5.4 | heartbeat interval is growing too large: 3368 millis

The question here I guess is if clustering is broken and causing CPU overload or if something else in other code is causing high CPU Usage and perhaps that is the CAUSE of clustering issues?  So IMHO the first next step here would be to obtain one of those "flamegraphs" to see which area of the code we are, presumably, spin looping and burning CPU cycles in... is that something you could obtain?

Comment 3 Michael Vorburger 2018-06-06 08:58:40 UTC
Created attachment 1448186 [details]
Flame Graph which Sai sent by email from a situation when there is no problem, just as an example

Comment 4 Sai Sindhur Malleni 2018-08-22 01:37:35 UTC
Here is a link to the CPU flamegraph with OSP13 and ODL opendaylight-8.3.0-3.el7ost.noarch.

http://file.rdu.redhat.com/~smalleni/flamegraph_30.svg

Looks like most of the time is being spent by GC in the SpinPause function.


Here are the thread dumps obtained by using jstack: http://file.rdu.redhat.com/~smalleni/jstack_541_100.txt

Comment 6 Andrig T Miller 2018-08-24 15:42:09 UTC
In looking at this, it looks like most of your time is spent doing GC.

With a small heap like 2GB, and with the version of Java you are on, I would recommend the following settings for the JVM:

-XX:+UseParallelOldGC -XX:MaxHeapFreeRatio=20 -XX:MinHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90

Also, set the min heap size to be 25% of the max heap size.  Currently, you are only setting the max heap size.

This turns on Java's footprint management, and it will attempt to keep the heap usage as low as possible, with a potential degradation in throughput, but since you are spending about 80% of your time in GC anyway, you shouldn't see any degradation.

If you want to use a larger heap size, as one e-mail suggested that sometimes you test with an 8GB heap.  With an 8GB heap, the setting above would not be appropriate.

I would just set the -XX:+UseParallelOldGC, without the free ratio and GC time ratio and adative size policy settings at all.

Andy

Comment 7 Michael Vorburger 2018-08-27 08:49:08 UTC
Sai, one important point to clarify in this context is if you saw this (bad) GC flames in a build with or without the various fixes for the TX related OOMs which recently went in?  We do know we had (now fixed) actual memory leaks - we need to make sure that we chase this problem here with recent builds including those fixes like bug 1615709, bug 1607959, bug 1602187, bug 1607965, bug 1566143, bug 1585227, bug 1493558 (and perhaps other ones; basically with trace:transaction not showing any remaining leaks).

Then, as Andy suggests, can we try to both bump our Xmx from our current default of 2 GB (right?) to 4 and/or 8 GB, and with vs without this UseParallelOldGC option, and see if that makes this problem go away, even under longevity stress testing?  It's totally possible that with 2 GB we've just underdimensioned our heap (even after the TX related memory fixes).  It's possible that in new major versions memory consumption in yangtools & Co. differs from before.

If with the TX fixes and giving it 4 and 8 instead of 2 GB heap, and this UseParallelOldGC, we still reproducibly burn CPU mostly in GC, then we would have to have a much closer look.

Comment 8 Michael Vorburger 2018-08-27 09:03:46 UTC
We also need to first make sure here that the earlier #c2 re. the clustering having completely broken down is not the main issue here - it would be useful to see a new karaf.log that goes along with the latest new flamegraph.

The http://file.rdu.redhat.com/~smalleni/jstack_541_100.txt from #c4 is actually worth staring at as well... on the closer look I've had at it now, that seems to show a ton of neutron and netvirt and openflowplugin threads all in TIMED_WAITING to obtain newWriteOnlyTransaction ... that doesn't feel right either, at all.  All the more reason to first make sure we run with all already known TX write leaks fixed and perhaps more heap, and then revisit this.

Comment 9 Stephen Kitt 2018-08-27 09:18:41 UTC
I think it would also be helpful to enable GC logs, so we can analyse its behaviour:

-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -Xloggc:/path/to/gc.log

(and then make sure gc.log is preserved). This logs GC details, with timestamps and tenuring information. This will allow us to determine what algorithms are being used, whether our heap is too small compared to our steady state, or whether we're spending too much time in object allocation, and also whether the different generations are sized correctly (although this is less of an issue now than it was in the past).

Comment 10 Stephen Kitt 2018-08-27 09:22:44 UTC
(In reply to Michael Vorburger from comment #8)
> The http://file.rdu.redhat.com/~smalleni/jstack_541_100.txt from #c4 is
> actually worth staring at as well... on the closer look I've had at it now,
> that seems to show a ton of neutron and netvirt and openflowplugin threads
> all in TIMED_WAITING to obtain newWriteOnlyTransaction ... that doesn't feel
> right either, at all.  All the more reason to first make sure we run with
> all already known TX write leaks fixed and perhaps more heap, and then
> revisit this.

That's the transaction rate limiter putting the brakes on; note that it applies to individual transaction operations, not just transactions as a whole, so fixing leaks or extending the heap won't help. The only way to reduce this is to reduce the number of MD-SAL operations AFAIK.

Comment 11 Stephen Kitt 2018-08-27 10:03:13 UTC
(In reply to Stephen Kitt from comment #10)
> That's the transaction rate limiter putting the brakes on; note that it
> applies to individual transaction operations, not just transactions as a
> whole, so fixing leaks or extending the heap won't help.

Not quite — it applies to subtransactions, i.e. per-store transactions. So one way to reduce this is to reduce the number of transactions in flight.

Comment 12 Stephen Kitt 2018-08-27 12:45:03 UTC
So it's clear for everyone, to specify these options when starting Karaf, we should use the EXTRA_JAVA_OPTS; e.g.

EXTRA_JAVA_OPTS="-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -Xloggc:/tmp/gc.log" bin/karaf

Comment 13 Sai Sindhur Malleni 2018-08-27 14:19:41 UTC
To clarify, the testing was done with opendaylight-8.3.0-3.el7ost.noarch which was supposed to have the transaction leak fixes.

Comment 14 Michael Vorburger 2018-08-27 15:37:41 UTC
(In reply to Sai Sindhur Malleni from comment #13)

Sai on IRC also clarified that this was seen when already running with 8 not 2 GB heap.

(In reply to Stephen Kitt from comment #10)
> (In reply to Michael Vorburger from comment #8)
> > The http://file.rdu.redhat.com/~smalleni/jstack_541_100.txt from #c4 is
> > actually worth staring at as well... on the closer look I've had at it now,
> 
> That's the transaction rate limiter putting the brakes on; note that it

Stephen and I further discussed this point, and concluded that this is likely neither a direct cause nor an effect of the GC spike looked at in this bug, but an orthogonal issue, so should not be the focus here. (We'll try to do something about this separately; at least e.g. in Neutron we could reduce the currently separate read VS write transaction into a common R/W, probably relatively easily.)

Next step: With the current 8 GB heap, try (1) with Andrig's suggested -XX:+UseParallelOldGC (only AFAIK; the other options were only for smaller heaps), (2) Stephen wants to see the GC log (above), and (3) I would be interested in a HPROF heap dump files when it's doing that crazy GC churning.  (Can be obtained with jstack or kill -3 even without it crashing with an OOM.)

Comment 15 Michael Vorburger 2018-08-27 15:58:56 UTC
> (Can be obtained with jstack or kill -3 even without it crashing with an OOM.)

duh, no that's for stack thread dump; Stephen suggests this for a heap dump:

  jmap -dump,format=b,file=heap.dump ${PID}

Comment 16 Michael Vorburger 2018-08-28 14:40:20 UTC
What would probably really help here to understand more would be "Java Flight Recording", produced as documented on https://wiki.opendaylight.org/view/HowToProfilePerformance.

Comment 17 Sai Sindhur Malleni 2018-10-01 14:27:32 UTC
I did rerun with -XX:+UseParallelOldGC but still observed the JVM taking around 35 cores. I did not have time to get all the data needed for debug from the setup, but can do that next time when we have access to hardware, 

Andy, do you think moving to G1GC will help? In discussions with upstream, we found most of them are using G1GC for ODl.

Comment 18 Michael Vorburger 2018-10-31 14:23:41 UTC
IMHO the next step here for the Nov scale lab round is still to obtain a JFR to see if we can understand which types of objects created from where cause this massive GC.

We can, in parallel, also try out -XX:+UseG1GC, but I want to caution against expections that it will automagically help - if we have hotspots with excessive object creation, we'll need to see how those can be optimized, no matter which GC we use.  It would still certainly be very interesting to see what difference G1 makes though.

Comment 19 Michael Vorburger 2018-10-31 14:26:27 UTC
We also need to keep an eye on those clustering problems while we do these tests... as long as we have [a lot of] AskTimeoutException, those should probably be looked at, first.  (The assumption here being that if the datastore went bust, everything else goes haywire and excessive object allocation in that case is an effect, not a cause; unless it's the other way around, but I doubt that.)

Comment 20 Michael Vorburger 2018-11-12 17:47:58 UTC
I've received a first JFR, and propose the following first actions:

* https://jira.opendaylight.org/browse/OVSDB-469 ovsdb Version toString
* https://jira.opendaylight.org/browse/ODLPARENT-175 -XX:+DisableExplicitGC
* https://jira.opendaylight.org/browse/OVSDB-470 isBridgeOnOvsdbNode
* https://jira.opendaylight.org/browse/OVSDB-471 Jackson JSON
* https://jira.opendaylight.org/browse/CONTROLLER-1870 Arrays.copyOf
* https://jira.opendaylight.org/browse/GENIUS-236 OvsdbNodeListener
* https://jira.opendaylight.org/browse/GENIUS-237 EntityOwnershipUtils

As this first JFR was from an incomplete test run, we should produce more.

Once we've made progress here, we should consider Bug 1649004.

Comment 21 Sai Sindhur Malleni 2018-11-13 01:28:31 UTC
JFR from scale run: http://file.rdu.redhat.com/~smalleni/jfr-1.tar.gz

Leader was controller-1

Comment 22 Michael Vorburger 2018-11-13 13:32:44 UTC
Digging more, I'm proposing the following second batch of GC reduction actions:

* https://jira.opendaylight.org/browse/MDSAL-387 HashMap.resize()
* https://jira.opendaylight.org/browse/CONTROLLER-1871 Optional

These are still based on the initial ("bad") JFR (and there was more, but let's get started with this set).  The following are from the new "big" JFR (from a "real" run):

* https://jira.opendaylight.org/browse/OPNFLWPLUG-1047 iterator()
* https://jira.opendaylight.org/browse/GENIUS-240 IdPoolListener equals()
* https://jira.opendaylight.org/browse/CONTROLLER-1872 Akka
* https://jira.opendaylight.org/browse/MDSAL-388 getAugmentation()
* https://jira.opendaylight.org/browse/NETVIRT-1502 getStackTraceElement()
* https://jira.opendaylight.org/browse/NETVIRT-1503 Uuid Matcher

FTR: The controller-1-scale.jfr shows a “longest GC pause” of 1.9s - that’s bad (IMHO), and hopefully going to be reduced with all of the issues above addressed; it will be interesting to re-test with fixes and see how low we can get it - and how much that will reduce the crazy CPU usage spike.

PS: Analyzing the 4.1 GB JFR required to push -Xmx18g in jmc.ini ... causing 20 GB laptop swapping like hell - barely usable. I need to look at how to set up and use a Desktop VM with more RAM.

Comment 23 Sai Sindhur Malleni 2018-11-14 16:04:38 UTC
Here is a JFR using G1GC http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz

Interesting thing is that, when using G1GC and running scale tests, we observed only a maximum CPU usage of about 6 cores.

Comment 24 Sai Sindhur Malleni 2018-11-15 14:35:08 UTC
I have done a re-run and can confirm that I have not seen any crazy CPU usage. The maximum we saw was around 9-10 cores.

Controller-2:
https://snapshot.raintank.io/dashboard/snapshot/wyE4EkWQtNIdzaQvBzFS1nJ5X7yOsrWn

Controller-1:
https://snapshot.raintank.io/dashboard/snapshot/HQONTtt3kMhpJoE00FkxrP49xpKsBUof

Controller-0:
https://snapshot.raintank.io/dashboard/snapshot/VwUqSfW69EcJNST21HrJFR1wxJt2qbl5

I am convinced that we need to move to G1GC asap at this point if we don't see long GC pauses in the JFR.

Link to JFR on rerun with G1GC: http://rdu-storage01.scalelab.redhat.com/sai/jfr-g1gc.tar.gz

Comment 25 Sai Sindhur Malleni 2018-11-15 14:38:57 UTC
Just to note, in the JFRs from above link, controller-1 was leader.

Comment 26 Sai Sindhur Malleni 2018-11-18 23:57:14 UTC
Link in Comment 24 for JFR is wrong. Correct link is http://rdu-storage01.scalelab.redhat.com/sai/jfr-3-g1gc.tar.gz

Comment 27 Michael Vorburger 2018-12-19 16:40:56 UTC
FTR: All linked upstream issues have meanwhile been addressed (upstream, at least in master for Neon, some already for Fluorine SP2).

Comment 29 Franck Baudin 2019-03-06 16:15:36 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 30 Franck Baudin 2019-03-06 16:17:26 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality