Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1492279

Summary: [Scale][Infra] ODL JVM High CPU usage
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: opendaylightAssignee: Michael Vorburger <vorburger>
Status: CLOSED CURRENTRELEASE QA Contact: Toni Freger <tfreger>
Severity: medium Docs Contact:
Priority: medium    
Version: 12.0 (Pike)CC: aadam, mkolesni, nyechiel, sgaddam, smalleni, trozet, vorburger
Target Milestone: betaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab, odl_infra
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-04-23 09:20:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1577975    

Description Sai Sindhur Malleni 2017-09-16 02:27:31 UTC
Description of problem:
Based on the Scale Tests Sridhar and I did, we have observed the ODL JVM take up a lot of CPU when creating neutron resources at scale (around 10 cores). In some cases, this was in the range of 30-40 cores which is quite high. We need to profile the JVM to root cause the issues, either by using https://medium.com/netflix-techblog/java-in-flames-e763b3d32166 or some other suitable method. 

This can be proritized in the next iteration of scale testing

Version-Release number of selected component (if applicable):
OSP 12
opendaylight-6.2.0-0.1.20170913snap58.el7.noarch

How reproducible:
Happens almost 100% of the time during scale testing

Steps to Reproduce:
1. Run scale tests using Browbeat+Rally on OpenStack cloud with ODl as backend
2.
3.

Actual results:

Very high CPU usage-usually around 10 cores and in the range of 30-40 cores during certain itnervals
Expected results:
Lower CPU usage

Additional info:
Link to Grafana graph showing this https://snapshot.raintank.io/dashboard/snapshot/0DGX0973FPPQZzexsw9qwWlu5QlviACn?orgId=2

Comment 1 Nir Yechiel 2017-09-28 13:20:58 UTC
Hi Michael,

Can you take a look at this bug?

Thanks,
Nir

Comment 2 Michael Vorburger 2017-10-03 11:12:19 UTC
We cannot real "action" this bug yet, with the information currently available here.

However, the next step I would suggest is that we (someone; me, if nobody else can) need to come up with a kind of "very easy step by step procedure" for how full CPU profiling details can be obtained during scale testing - basically very similar to what we had come up with re. chasing OOM.  I would then see this issue take a similar route as the earlier OOM work - once such a procedure is established, it will have to be applied during the next scale lab (Nov?) by Sai, and then it's results analysed (e.g. by me), likely leading to some corrective action/s related to this in upstream.. this may turn into a simple fix, or a bigger design issue.

Comment 3 Michael Vorburger 2017-10-03 11:13:37 UTC
> we (someone; me, if nobody else can) need to come up with a kind 
> of "very easy step by step procedure" for how full CPU profiling
> details can be obtained during scale testing 

https://medium.com/netflix-techblog/java-in-flames-e763b3d32166 has been recommended to us internally, and looks like worth exploring further (i.e. try out locally to learn, perhaps document better for easy of use by e.g. Sai and others?).

Comment 4 Michael Vorburger 2017-11-13 12:39:42 UTC
Sai, are you still seeing this one? Can you provide ideally flames or a JMC JFR or otherwise (less ideal) at least a serious of jstack taken every 5s when you get it into this state, to give us an idea of what it's doing?

Comment 7 Sai Sindhur Malleni 2017-12-18 15:27:30 UTC
We came up with this script[1] to profile the JVM CPU  usage and generate flame graphs only when the JVM CPU usage goes above a certain threshold. The jstack output also has been captured using the script.

When the CPU usage is moderately high, around 3 cores we see

Flamegraph: http://file.rdu.redhat.com/~smalleni/flamegraph_4_100.svg
jstack: http://file.rdu.redhat.com/~smalleni/jstack_4_100.txt

When the CPU usage is very high, around 20 cores,

Flamegraph: http://file.rdu.redhat.com/~smalleni/flamegraph_65_100.svg
jstack: http://file.rdu.redhat.com/~smalleni/jstack_65_100.txt

Comment 8 Sai Sindhur Malleni 2017-12-18 15:27:56 UTC
Link to the CPU profiling script:

[1]- https://github.com/smalleni/openstack-testing/blob/master/cpu_profile.sh

Comment 9 Michael Vorburger 2018-01-17 14:36:04 UTC
If I understand the flamegraph_65_100.svg correctly (and I'm not 100% sure that I do...), we're seeing here that we're spending a hell of a lot of time in genius' LockManager?  And in jstack_65_100.txt we see *A LOT* of threads related to LockManager... from netvirt's vpnmanager VpnOpStatusListener & Co.  That doesn't feel right, at that magnitude.  We will need to look more closely into that interaction (basically by staring at the code).

We also see yangtools' NormalizedNodeWriter figures prominently; but that MAY "just" be an impact of above (reasoning: the LockManager RPC YANG modeled input/ouput); I suggest we try to understand and fix LockManager and then in retesting see if NormalizedNodeWriter goes with the wind through that, or remains.  If it does, then perhaps this https://git.opendaylight.org/gerrit/#/c/65296/ old idea could help...

Comment 14 Sai Sindhur Malleni 2018-04-23 01:55:19 UTC
Analyzing CPU usage of karaf process on the OSP13 with ODL Oxygen cloud during neutron scale tests, it was seen that the CPU usage was no where close to what was observed in the previous releases- it only occassionaly goes to around 5 cores but in previous releases we have seen it go to around 30 cores sometimes

ODL-0: https://snapshot.raintank.io/dashboard/snapshot/S5sDpg1wzUuuRlpcevFAQLcETxmJEMIT
ODL-1: https://snapshot.raintank.io/dashboard/snapshot/NgjixxYzVOEh1iNK3r0Y1gsHl0mMX574
ODL-2: https://snapshot.raintank.io/dashboard/snapshot/RrtcIZDKEe0pX7Svc8gCcl8Q3aJw4YC9

Comment 16 Michael Vorburger 2018-04-23 09:20:57 UTC
> it was seen that the CPU usage was no where close
> to what was observed in the previous releases

great!

> it only occassionaly goes to around 5 cores but in previous
> releases we have seen it go to around 30 cores sometimes

We could further profile this some time, but I doubt it's a short term blocking priority... specifically because if one zooms into the Grafana pictures, they really appear to be only occasional short spikes every hour of a few second or so (it's hard to be 100% sure how long, because it probably only polls every few seconds?), and and that as Sai wrote likely "are mostly during the GC kicking in". Just FTR I don't have "proof" that this is GC (do you?); but it's a likely explanation; we would have to correlate GC logs to be 100% sure.  To me, the way this looks like now, compared to before, is not a major concern anymore, and I am therefore closing this as a WORKSFORME - anyone, please feel free to re-open in case of disagreement.

Comment 17 Sai Sindhur Malleni 2018-04-23 10:43:40 UTC
Michael, "works for me" was a bit too harsh :-) Thanks Nir!

Comment 18 Sai Sindhur Malleni 2018-04-23 10:45:11 UTC
Michael, yes I don't have conclusive evidence to suspect the GC, but past experience has taught me that short spikes occasionally in JAVA apps are more often than not due to the GC...

Comment 19 Michael Vorburger 2018-08-28 10:11:09 UTC
Just for the record, (something like) this is being looked at in Bug 1577975 again.