Description of problem: On running low stress longevity tests using Browbeat+Rally (creating 40 neutron resources 2 at a time and deleting them, over and over again), in a clustered ODL setup, ODL on controller-1 hits OOM after about 42 hours into the test. ODL on controller-2 is functional at that point but ODL on controller-0 seems to be running and ports are up but is non-functional (see BZ https://bugzilla.redhat.com/show_bug.cgi?id=1486060). When ODL on controller-0 is restarted to make it functional again at around 16:01 UTC 08/28/2017, ODL on controller-2 dies at around 16_04 UTC 08/28/2017. Here we can see the karaf process count going to 0 on controller-2 around 16:04 UTC 08/28/2017: https://snapshot.raintank.io/dashboard/snapshot/chxdQkhAw3X8l9LS2HNNzCZGQHQvWubO?orgId=2 The heap is dumped before the process dies, however it can be clearly seen that the 2G heap is not reached here: https://snapshot.raintank.io/dashboard/snapshot/RMuDksXZ61ql2kMA47wqBHUXQeYWG05g?orgId=2 Max heap used is around 1.4G Version-Release number of selected component (if applicable): OSP 12 Puddle from 2017-8-18.3 ODL RPM from upstream: python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Run low stress longevity tests 2. ODL OOM on controller-1 of a clustered setup and ODL on controller-0 becomes non-functional leaving ODL on controller-2 running 3. Restart non-functional ODL on controller-0 and ODL on controller-2 dies Actual results: Restarting ODL on controller-0 results in ODL dying on controller-2 Expected results: Restarting ODL on one node shouldn't lead to ODL dying on another node in a clustered setup Additional info: ODL Controller-0 Logs: http://8.43.86.1:8088/smalleni/karaf-controller-0.log.tar.gz ODL Controller-1 Logs: http://8.43.86.1:8088/smalleni/karaf-controller-1.log.tar.gz http://8.43.86.1:8088/smalleni/karaf-controller-1-rollover.log.tar.gz ODL Controller-2 Logs: http://8.43.86.1:8088/smalleni/karaf-controller-2.log.tar.gz
ODL RPM used was opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch
Cc +skitt who knows much more about ODL clustering than me... +need additional information from reporter, as per upstream Bug 9064: We have to "first better understand what it actually died of. For example, if it were a "OutOfMemoryError: unable to create new native thread" then that would not be in regular logs, but probably would have gone to STDOUT by the JVM, which means on a RPM installed ODL started by a systemd service, it should appear e.g. in something like "systemctl status opendaylight" (or a better command to consult the full systemd journal of that service?) - can you provide us with what that has?"
A few points need clarifying. If controller-1 dumps heap, that means it did hit an OOM, even if the heap itself isn't used up to the Xmx. Could we get access to the dump to investigate? When it's non-responsive, what state is controller-0 in exactly? What does controller-2 say about the cluster's state? Given that controller-1 is gone, if controller-0 is non-responsive, controller-2 should switch to "isolated leader" and kill itself after ~10 min. As Michael says, we'd need to see Karaf's output; controller-2's logs don't give much indication about why it died. (It switched to follower just before dying, but that doesn't explain things.)
We do not have access to logs anymore since the environmnet has since been rebuilt. However, this would be hard to reproduce given the several OOM fixes that went in recently. We will retry a longevity test over the weekend and let you know if we are able to reproduce this.
Closing this as the linked upstream ODL Bug 9064 has been closed (as dupe). Sai (reporter), please re-open if you disagree.
It was closed upstream as a dupe of https://bugs.opendaylight.org/show_bug.cgi?id=9034 which isn’t closed, so this bug should stay open IMO, just pointed at https://bugs.opendaylight.org/show_bug.cgi?id=9034 instead of https://bugs.opendaylight.org/show_bug.cgi?id=9064.
Michael, can you please this bug? It looks like https://bugs.opendaylight.org//show_bug.cgi?id=9034 has been merge upstream. Do we need to backport it into Carbon or cherry-pict it to downstream?
This bug can be slightly confusing to read as it mixes a few things, so summary: This issue was specifically about "ODL on controller-2 dies" - after an OOM related death of ANOTHER node, and a stuck-but-not-dead 3rd NODE. That OOM issue WAS fixed upstream in ODL BZ 9034 (which in the end turned out to actually be caused and fixed by upstream ODL BZ 9038). In https://bugs.opendaylight.org/show_bug.cgi?id=9064 the reporter of this issue (Sai, as well as Sridhar) confirm that with the OOM fixed, they did not see other nodes unexpectedly dying, anymore. > 9034 has been merge upstream. Do we need to backport it > into Carbon or cherry-pict it to downstream? Nir, as per https://git.opendaylight.org/gerrit/#/c/62674/ this actually *IS* fixed in (latest SR of) Carbon ALREADY. I'd therefore like to close this bug as DONE/FIXED solved, based on above; and am marking it as Status: POST, hoping that is the correct (not 100% sure that is the right transition; please fix if it's not).
Tim Rozet and I faced this in the scale lab today. On stopping one ODL to verify the cluster monitoring tool was working, the other two ODLs became unreachable. I restart of the ODL containers was required to get the other two nodes back into the cluster. It is likely there is an issue here. Here are the karaf logs from all the three ODLs. http://file.rdu.redhat.com/~smalleni/cluster.tar.gz
Just to correct my previous comment, what exactly happened in Tim’s words: “I killed controller 1 first. Then brought it up and controller 0 and 2 died.”
This BZ is a bit confusing, because after silence of 6 months it has suddenly been turned from a cluster issue raised last Sep/Oct which was originally related to OOM, and which was fixed (see ODL BZ 9034 and ODL BZ 9064) into an entirely new other clustering related problem (no more OOMs now), with new logs attached. The latest attached logs do show entries "Caused by: org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-0-shard-default-operational currently has no leader. Try again later." but I am not clear yet what is causing that. I've not yet been able to properly correlate the logs from the different nodes, filter out clustering unrelated events, to see what happen where when causing what. Sai and Tim, before I come up with something to analyse the logs properly, could we clarify: Is this sporadic, or reliably reproducable? And the EXACT steps needed to reproduce this are.. You have a 3 node cluster running and you kill the 1 of the nodes (is the one you kill the current leader, or a follower?) and then the 2 other ones die by themselves, always? Or the 2 other "became unreachable" - what does that mean? Or you restarted the killed node and THEN other 2 died? I need timestamps of what you did when to correlate with the logs.
I spent quite sometime today to reproduce this. It seems to be fairly hard to reproduce and probably needs certain code paths to be hit to reproduce successfully. I tried the following scenarios and couldn't reproduce the bug. 1. Take down leader in idle cloud and bring it back up 2. Take down follower in idle cloud and bring it back up 3. Take down leader in cloud under load tests and bring it back up 4. Take down follower in cloud under load tests and bring it back up Given, all the mystery surrounding this bug and how it manifests itself in different ways at differet times in different releases, I think it is safe to close this bug. We can reopen one later if needed.
Thanks for the check Sai, I'll move the bug back to VERIFIED then and let's open a new one if something pops up.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086