Description of problem: In a clustered environment we tried to stop each ODL (one by one) and bring it back up with Oracle JDK instead. Two of the 3 ODLs come back normally and the leader changes when the leader was brought down. However, the ODL on controller-0 never manages to rejoin the cluster (controller-1 is leader and controller-2 is follower after all the containers were brought back up). We keep saying errors such as 2018-11-13T14:15:10,141 | WARN | ForkJoinPool.commonPool-worker-43 | AbstractShardBackendResolver | 227 - org.opendaylight.controller.sal-distributed-datastore - 1.7.4.redhat-4 | Failed to resolve shard java.util.concurrent.TimeoutException: Connection attempt failed at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4] at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.onConnectResponse(AbstractShardBackendResolver.java:148) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4] at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$2(AbstractShardBackendResolver.java:140) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:?] at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:443) [?:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:?] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:?] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:?] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:?] Caused by: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation 0 was superseded by 4 at org.opendaylight.controller.cluster.datastore.Shard.findFrontend(Shard.java:486) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4] at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:527) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4] at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:328) ~[227:org.opendaylight.controller.sal-distributed-datastore:1.7.4.redhat-4] at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270) ~[212:org.opendaylight.controller.sal-akka-raft:1.7.4.redhat-4] at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44) ~[220:org.opendaylight.controller.sal-clustering-commons:1.7.4.redhat-4] at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:275) ~[45:com.typesafe.akka.persistence:2.5.11] at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) ~[220:org.opendaylight.controller.sal-clustering-commons:1.7.4.redhat-4] at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:608) ~[42:com.typesafe.akka.actor:2.5.11] at akka.actor.Actor.aroundReceive(Actor.scala:517) ~[42:com.typesafe.akka.actor:2.5.11] at akka.actor.Actor.aroundReceive$(Actor.scala:515) ~[42:com.typesafe.akka.actor:2.5.11] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:273) ~[45:com.typesafe.akka.persistence:2.5.11] at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:691) ~[45:com.typesafe.akka.persistence:2.5.11] at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:192) ~[45:com.typesafe.akka.persistence:2.5.11] at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:191) ~[45:com.typesafe.akka.persistence:2.5.11] at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:273) ~[45:com.typesafe.akka.persistence:2.5.11] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590) ~[42:com.typesafe.akka.actor:2.5.11] at akka.actor.ActorCell.invoke(ActorCell.scala:559) ~[42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) ~[42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.Mailbox.run(Mailbox.scala:224) ~[42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) ~[42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[42:com.typesafe.akka.actor:2.5.11] Version-Release number of selected component (if applicable): OSP13 How reproducible: Not sure Steps to Reproduce: 1. Stop and bring back one ODL at a time in a clustered setup 2. 3. Actual results: ODL on controller-0 never joins cluster Expected results: ODL on controller-0 should be able to join the cluster Additional info:
Logs: http://file.rdu.redhat.com/~smalleni/rhbz-1649431/
It turns out this is a known issue upstream: clearing the journal and snapshots on one node resets the generation to 0, and the node can never rejoin a running cluster with a later generation. There might be a manual work-around, I’m waiting for more information upstream.
Stephen, But we cleared journal and snapshots on the other two nodes also and they joined the cluster without issues. Also, I repeated the same testing yesterday but all three nodes were able to join the cluster this time.
(In reply to Sai Sindhur Malleni from comment #3) > But we cleared journal and snapshots on the other two nodes also and they > joined the cluster without issues. Also, I repeated the same testing > yesterday but all three nodes were able to join the cluster this time. Yes, it doesn’t always happen.
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL. [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality