1649431 – [Infra] ODL fails to join cluster on stopping and restarting

Bug 1649431 - [Infra] ODL fails to join cluster on stopping and restarting

Summary: [Infra] ODL fails to join cluster on stopping and restarting

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	opendaylight
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	z5
Target Release:	13.0 (Queens)
Assignee:	Stephen Kitt
QA Contact:	Noam Manos
Docs Contact:
URL:
Whiteboard:	Infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-13 15:41 UTC by Sai Sindhur Malleni
Modified:	2019-03-06 16:17 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-06 16:16:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenDaylight Bug	CONTROLLER-1626	0	None	None	None	2018-11-14 13:35:44 UTC

Description Sai Sindhur Malleni 2018-11-13 15:41:52 UTC

Description of problem: In a clustered environment we tried to stop each ODL (one by one) and bring it back up with Oracle JDK instead. Two of the 3 ODLs come back normally and the leader changes when the leader was brought down. However, the ODL on controller-0 never manages to rejoin the cluster (controller-1 is leader and controller-2 is follower after all the containers were brought back up). We keep saying errors such as 
2018-11-13T14:15:10,141 | WARN  | ForkJoinPool.commonPool-worker-43 | AbstractShardBackendResolver     | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.7.4.redhat-4 | Failed to resolve shard
java.util.concurrent.TimeoutException: Connection attempt failed
        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4]
        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.onConnectResponse(AbstractShardBackendResolver.java:148) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4]
        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$2(AbstractShardBackendResolver.java:140) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:?]
        at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:443) [?:?]
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:?]
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:?]
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:?]
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:?]
Caused by: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation 0 was superseded by 4
        at org.opendaylight.controller.cluster.datastore.Shard.findFrontend(Shard.java:486) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4]
        at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:527) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4]
        at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:328) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4]
        at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270) ~[212:org.opendaylight.controller.sal-akka-raft:1.7.4.redhat-4]
        at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44) ~[220:org.opendaylight.controller.sal-clustering-commons:1.7.4.redhat-4]
        at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:275) ~[45:com.typesafe.akka.persistence:2.5.11]
        at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) ~[220:org.opendaylight.controller.sal-clustering-commons:1.7.4.redhat-4]
        at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:608) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.actor.Actor.aroundReceive(Actor.scala:517) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.actor.Actor.aroundReceive$(Actor.scala:515) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:273) ~[45:com.typesafe.akka.persistence:2.5.11]
        at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:691) ~[45:com.typesafe.akka.persistence:2.5.11]
        at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:192) ~[45:com.typesafe.akka.persistence:2.5.11]
        at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:191) ~[45:com.typesafe.akka.persistence:2.5.11]
        at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:273) ~[45:com.typesafe.akka.persistence:2.5.11]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.actor.ActorCell.invoke(ActorCell.scala:559) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.Mailbox.run(Mailbox.scala:224) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[42:com.typesafe.akka.actor:2.5.11]
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[42:com.typesafe.akka.actor:2.5.11]

Version-Release number of selected component (if applicable):
OSP13

How reproducible:
Not sure

Steps to Reproduce:
1. Stop and bring back one ODL at a time in a clustered setup
2.
3.

Actual results:
ODL on controller-0 never joins cluster

Expected results:
ODL on controller-0 should be able to join the cluster

Additional info:

Comment 1 Sai Sindhur Malleni 2018-11-13 15:43:00 UTC

Logs: http://file.rdu.redhat.com/~smalleni/rhbz-1649431/

Comment 2 Stephen Kitt 2018-11-14 13:35:45 UTC

It turns out this is a known issue upstream: clearing the journal and snapshots on one node resets the generation to 0, and the node can never rejoin a running cluster with a later generation. There might be a manual work-around, I’m waiting for more information upstream.

Comment 3 Sai Sindhur Malleni 2018-11-15 13:51:49 UTC

Stephen,

But we cleared journal and snapshots on the other two nodes also and they joined the cluster without issues. Also, I repeated the same testing yesterday but all three nodes were able to join the cluster this time.

Comment 4 Stephen Kitt 2018-11-15 14:06:24 UTC

(In reply to Sai Sindhur Malleni from comment #3)
> But we cleared journal and snapshots on the other two nodes also and they
> joined the cluster without issues. Also, I repeated the same testing
> yesterday but all three nodes were able to join the cluster this time.

Yes, it doesn’t always happen.

Comment 7 Franck Baudin 2019-03-06 16:16:44 UTC

As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 8 Franck Baudin 2019-03-06 16:17:49 UTC

As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Note You need to log in before you can comment on or make changes to this bug.