1486069 – [Scale][HA] On restarting ODL on one node, ODL on another node dies in a clustered setup

Bug 1486069 - [Scale][HA] On restarting ODL on one node, ODL on another node dies in a clustered setup

Summary: [Scale][HA] On restarting ODL on one node, ODL on another node dies in a clus...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	opendaylight
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	beta
Target Release:	13.0 (Queens)
Assignee:	Stephen Kitt
QA Contact:	Tomas Jamrisko
Docs Contact:
URL:
Whiteboard:	scale_lab, odl_ha
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-28 22:11 UTC by Sai Sindhur Malleni
Modified:	2018-10-18 07:25 UTC (History)
CC List:	8 users (show)
Fixed In Version:	opendaylight-8.0.0-1.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	N/A
Last Closed:	2018-06-27 13:34:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenDaylight Bug	9034	0	None	None	None	2017-09-13 13:00:43 UTC
Red Hat Product Errata	RHEA-2018:2086	0	None	None	None	2018-06-27 13:34:53 UTC

Description Sai Sindhur Malleni 2017-08-28 22:11:39 UTC

Description of problem:
On running low stress longevity tests using Browbeat+Rally (creating 40 neutron resources 2 at a time and deleting them, over and over again), in a clustered ODL setup, ODL on controller-1 hits OOM after about 42 hours into the test. ODL on controller-2 is functional at that point but ODL on controller-0 seems to be running and ports are up but is non-functional (see BZ https://bugzilla.redhat.com/show_bug.cgi?id=1486060). When ODL on controller-0 is restarted to make it functional again at around 16:01 UTC 08/28/2017, ODL on controller-2 dies at around 16_04 UTC 08/28/2017.

Here we can see the karaf process count going to 0 on controller-2 around 16:04 UTC 08/28/2017: https://snapshot.raintank.io/dashboard/snapshot/chxdQkhAw3X8l9LS2HNNzCZGQHQvWubO?orgId=2

The heap is dumped before the process dies, however it can be clearly seen that the 2G heap is not reached here: https://snapshot.raintank.io/dashboard/snapshot/RMuDksXZ61ql2kMA47wqBHUXQeYWG05g?orgId=2

Max heap used is around 1.4G 




Version-Release number of selected component (if applicable):

OSP 12 Puddle from 2017-8-18.3
ODL RPM from upstream: python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Run low stress longevity tests
2. ODL OOM on controller-1 of a clustered setup and ODL on controller-0 becomes non-functional leaving ODL on controller-2 running
3. Restart non-functional ODL on controller-0 and ODL on controller-2 dies

Actual results:

Restarting ODL on controller-0 results in ODL dying on controller-2

Expected results:

Restarting ODL on one node shouldn't lead to ODL dying on another node in a clustered setup

Additional info:
ODL Controller-0 Logs:
http://8.43.86.1:8088/smalleni/karaf-controller-0.log.tar.gz
ODL Controller-1 Logs:
http://8.43.86.1:8088/smalleni/karaf-controller-1.log.tar.gz
http://8.43.86.1:8088/smalleni/karaf-controller-1-rollover.log.tar.gz
ODL Controller-2 Logs:
http://8.43.86.1:8088/smalleni/karaf-controller-2.log.tar.gz

Comment 1 Sai Sindhur Malleni 2017-08-29 11:25:05 UTC

ODL RPM used was opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch

Comment 2 Michael Vorburger 2017-09-07 11:55:48 UTC

Cc +skitt who knows much more about ODL clustering than me...

+need additional information from reporter, as per upstream Bug 9064: 

We have to "first better understand what it actually died of.  For example, if it were a "OutOfMemoryError: unable to create new native thread" then that would not be in regular logs, but probably would have gone to STDOUT by the JVM, which means on a RPM installed ODL started by a systemd service, it should appear e.g. in something like "systemctl status opendaylight" (or a better command to consult the full systemd journal of that service?) - can you provide us with what that has?"

Comment 3 Stephen Kitt 2017-09-07 12:15:45 UTC

A few points need clarifying.

If controller-1 dumps heap, that means it did hit an OOM, even if the heap itself isn't used up to the Xmx. Could we get access to the dump to investigate?

When it's non-responsive, what state is controller-0 in exactly? What does controller-2 say about the cluster's state? Given that controller-1 is gone, if controller-0 is non-responsive, controller-2 should switch to "isolated leader" and kill itself after ~10 min.

As Michael says, we'd need to see Karaf's output; controller-2's logs don't give much indication about why it died. (It switched to follower just before dying, but that doesn't explain things.)

Comment 4 Sai Sindhur Malleni 2017-09-07 15:13:06 UTC

We do not have access to logs anymore since the environmnet has since been rebuilt. However, this would be hard to reproduce given the several OOM fixes that went in recently. We will retry a longevity test over the weekend and let you know if we are able to reproduce this.

Comment 5 Michael Vorburger 2017-09-13 12:52:16 UTC

Closing this as the linked upstream ODL Bug 9064 has been closed (as dupe).

Sai (reporter), please re-open if you disagree.

Comment 6 Stephen Kitt 2017-09-13 12:56:17 UTC

It was closed upstream as a dupe of https://bugs.opendaylight.org/show_bug.cgi?id=9034 which isn’t closed, so this bug should stay open IMO, just pointed at https://bugs.opendaylight.org/show_bug.cgi?id=9034 instead of https://bugs.opendaylight.org/show_bug.cgi?id=9064.

Comment 7 Nir Yechiel 2017-09-28 13:45:12 UTC

Michael, can you please this bug? It looks like https://bugs.opendaylight.org//show_bug.cgi?id=9034 has been merge upstream. Do we need to backport it into Carbon or cherry-pict it to downstream?

Comment 8 Michael Vorburger 2017-10-23 11:54:57 UTC

This bug can be slightly confusing to read as it mixes a few things, so summary:

This issue was specifically about "ODL on controller-2 dies" - after an OOM related death of ANOTHER node, and a stuck-but-not-dead 3rd NODE.  

That OOM issue WAS fixed upstream in ODL BZ 9034 (which in the end turned out to actually be caused and fixed by upstream ODL BZ 9038).

In https://bugs.opendaylight.org/show_bug.cgi?id=9064 the reporter of this issue (Sai, as well as Sridhar) confirm that with the OOM fixed, they did not see other nodes unexpectedly dying, anymore.

> 9034 has been merge upstream. Do we need to backport it
> into Carbon or cherry-pict it to downstream?

Nir, as per https://git.opendaylight.org/gerrit/#/c/62674/ this actually *IS* fixed in (latest SR of) Carbon ALREADY.

I'd therefore like to close this bug as DONE/FIXED solved, based on above; and am marking it as Status: POST, hoping that is the correct (not 100% sure that is the right transition; please fix if it's not).

Comment 16 Sai Sindhur Malleni 2018-04-20 02:28:46 UTC

Tim Rozet and I faced this in the scale lab today. On stopping one ODL to verify the cluster monitoring tool was working, the other two ODLs became unreachable. I restart of the ODL containers was required to get the other two nodes back into the cluster. It is likely there is an issue here. Here are the karaf logs from all the three ODLs.

http://file.rdu.redhat.com/~smalleni/cluster.tar.gz

Comment 17 Sai Sindhur Malleni 2018-04-20 03:12:34 UTC

Just to correct my previous comment, what exactly happened in Tim’s words:


“I killed controller 1 first. Then brought it up and controller 0 and 2 died.”

Comment 19 Michael Vorburger 2018-04-30 19:12:23 UTC

This BZ is a bit confusing, because after silence of 6 months it has suddenly been turned from a cluster issue raised last Sep/Oct which was originally related to OOM, and which was fixed (see ODL BZ 9034 and ODL BZ 9064) into an entirely new other clustering related problem (no more OOMs now), with new logs attached.

The latest attached logs do show entries "Caused by: org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-0-shard-default-operational currently has no leader. Try again later." but I am not clear yet what is causing that.  I've not yet been able to properly correlate the logs from the different nodes, filter out clustering unrelated events, to see what happen where when causing what.

Sai and Tim, before I come up with something to analyse the logs properly, could we clarify: Is this sporadic, or reliably reproducable? And the EXACT steps needed to reproduce this are.. You have a 3 node cluster running and you kill the 1 of the nodes (is the one you kill the current leader, or a follower?) and then the 2 other ones die by themselves, always?  

Or the 2 other "became unreachable" - what does that mean?

Or you restarted the killed node and THEN other 2 died?

I need timestamps of what you did when to correlate with the logs.

Comment 21 Sai Sindhur Malleni 2018-05-08 20:50:13 UTC

I spent quite sometime today to reproduce this. It seems to be fairly hard to reproduce and probably needs certain code paths to be hit to reproduce successfully. I tried the following scenarios and couldn't reproduce the bug.

1. Take down leader in idle cloud and bring it back up
2. Take down follower in idle cloud and bring it back up
3. Take down leader in cloud under load tests and bring it back up
4. Take down follower in cloud under load tests and bring it back up

Given, all the mystery surrounding this bug and how it manifests itself in different ways at differet times in different releases, I think it is safe to close this bug. We can reopen one later if needed.

Comment 22 Mike Kolesnik 2018-05-09 06:34:52 UTC

Thanks for the check Sai, I'll move the bug back to VERIFIED then and let's open a new one if something pops up.

Comment 24 errata-xmlrpc 2018-06-27 13:34:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.