Bug 1492279
| Summary: | [Scale][Infra] ODL JVM High CPU usage | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sai Sindhur Malleni <smalleni> |
| Component: | opendaylight | Assignee: | Michael Vorburger <vorburger> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Toni Freger <tfreger> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 12.0 (Pike) | CC: | aadam, mkolesni, nyechiel, sgaddam, smalleni, trozet, vorburger |
| Target Milestone: | beta | Keywords: | Triaged |
| Target Release: | 13.0 (Queens) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | scale_lab, odl_infra | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
N/A
|
|
| Last Closed: | 2018-04-23 09:20:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1577975 | ||
|
Description
Sai Sindhur Malleni
2017-09-16 02:27:31 UTC
Hi Michael, Can you take a look at this bug? Thanks, Nir We cannot real "action" this bug yet, with the information currently available here. However, the next step I would suggest is that we (someone; me, if nobody else can) need to come up with a kind of "very easy step by step procedure" for how full CPU profiling details can be obtained during scale testing - basically very similar to what we had come up with re. chasing OOM. I would then see this issue take a similar route as the earlier OOM work - once such a procedure is established, it will have to be applied during the next scale lab (Nov?) by Sai, and then it's results analysed (e.g. by me), likely leading to some corrective action/s related to this in upstream.. this may turn into a simple fix, or a bigger design issue. > we (someone; me, if nobody else can) need to come up with a kind > of "very easy step by step procedure" for how full CPU profiling > details can be obtained during scale testing https://medium.com/netflix-techblog/java-in-flames-e763b3d32166 has been recommended to us internally, and looks like worth exploring further (i.e. try out locally to learn, perhaps document better for easy of use by e.g. Sai and others?). Sai, are you still seeing this one? Can you provide ideally flames or a JMC JFR or otherwise (less ideal) at least a serious of jstack taken every 5s when you get it into this state, to give us an idea of what it's doing? We came up with this script[1] to profile the JVM CPU usage and generate flame graphs only when the JVM CPU usage goes above a certain threshold. The jstack output also has been captured using the script. When the CPU usage is moderately high, around 3 cores we see Flamegraph: http://file.rdu.redhat.com/~smalleni/flamegraph_4_100.svg jstack: http://file.rdu.redhat.com/~smalleni/jstack_4_100.txt When the CPU usage is very high, around 20 cores, Flamegraph: http://file.rdu.redhat.com/~smalleni/flamegraph_65_100.svg jstack: http://file.rdu.redhat.com/~smalleni/jstack_65_100.txt Link to the CPU profiling script: [1]- https://github.com/smalleni/openstack-testing/blob/master/cpu_profile.sh If I understand the flamegraph_65_100.svg correctly (and I'm not 100% sure that I do...), we're seeing here that we're spending a hell of a lot of time in genius' LockManager? And in jstack_65_100.txt we see *A LOT* of threads related to LockManager... from netvirt's vpnmanager VpnOpStatusListener & Co. That doesn't feel right, at that magnitude. We will need to look more closely into that interaction (basically by staring at the code). We also see yangtools' NormalizedNodeWriter figures prominently; but that MAY "just" be an impact of above (reasoning: the LockManager RPC YANG modeled input/ouput); I suggest we try to understand and fix LockManager and then in retesting see if NormalizedNodeWriter goes with the wind through that, or remains. If it does, then perhaps this https://git.opendaylight.org/gerrit/#/c/65296/ old idea could help... Analyzing CPU usage of karaf process on the OSP13 with ODL Oxygen cloud during neutron scale tests, it was seen that the CPU usage was no where close to what was observed in the previous releases- it only occassionaly goes to around 5 cores but in previous releases we have seen it go to around 30 cores sometimes ODL-0: https://snapshot.raintank.io/dashboard/snapshot/S5sDpg1wzUuuRlpcevFAQLcETxmJEMIT ODL-1: https://snapshot.raintank.io/dashboard/snapshot/NgjixxYzVOEh1iNK3r0Y1gsHl0mMX574 ODL-2: https://snapshot.raintank.io/dashboard/snapshot/RrtcIZDKEe0pX7Svc8gCcl8Q3aJw4YC9 Sorry about wrong links, here are the right ones ODL-0: https://snapshot.raintank.io/dashboard/snapshot/AgXLYhg3yvx4ivPFABXa449DDMo5wWe7 ODL-1: https://snapshot.raintank.io/dashboard/snapshot/O5X8VcYVzKthTkWxmhM5z2YQ0O7oPo79 ODL-2: https://snapshot.raintank.io/dashboard/snapshot/80ah6tC62sJJPyeY45A4RjOrC4ZDdjaN > it was seen that the CPU usage was no where close > to what was observed in the previous releases great! > it only occassionaly goes to around 5 cores but in previous > releases we have seen it go to around 30 cores sometimes We could further profile this some time, but I doubt it's a short term blocking priority... specifically because if one zooms into the Grafana pictures, they really appear to be only occasional short spikes every hour of a few second or so (it's hard to be 100% sure how long, because it probably only polls every few seconds?), and and that as Sai wrote likely "are mostly during the GC kicking in". Just FTR I don't have "proof" that this is GC (do you?); but it's a likely explanation; we would have to correlate GC logs to be 100% sure. To me, the way this looks like now, compared to before, is not a major concern anymore, and I am therefore closing this as a WORKSFORME - anyone, please feel free to re-open in case of disagreement. Michael, "works for me" was a bit too harsh :-) Thanks Nir! Michael, yes I don't have conclusive evidence to suspect the GC, but past experience has taught me that short spikes occasionally in JAVA apps are more often than not due to the GC... Just for the record, (something like) this is being looked at in Bug 1577975 again. |