Bug 1567509

Summary:	[Infra] Error writing to data store- node was deleted by another transaction
Product:	Red Hat OpenStack	Reporter:	Sai Sindhur Malleni <smalleni>
Component:	opendaylight	Assignee:	Michael Vorburger <vorburger>
Status:	CLOSED WONTFIX	QA Contact:	Noam Manos <nmanos>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	aadam, mkolesni, smalleni
Target Milestone:	---	Keywords:	Triaged, ZStream
Target Release:	14.0 (Rocky)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	Infra
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-03-06 16:15:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2018-04-14 14:29:39 UTC

Description of problem:
When running a Browbeat+Rally use case in my downstream scale CI which does:
Create a network
Create a sbunet
Create a router
Attach router to subnet and oublic network
Boot VM with floating IP
Ping VM

with concurrency 8 and times set to 50 we see that some VMs remain unpingable even after 300 seconds and we see a lot of the following exceptions in karaf logs


2018-04-13T23:40:08,068 | ERROR | ForkJoinPool-1-worker-27 | ElanUtils                        | 351 - org.opendaylight.netvirt.elanmanager-impl - 0.6.0.redhat-5 | Error writing to datastore {}
java.util.concurrent.ExecutionException: OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was deleted by other transaction.]]}
        at org.opendaylight.mdsal.common.api.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:66) [296:org.opendaylight.mdsal.common-api:2.4.0.redhat-5]
        at org.opendaylight.mdsal.common.api.MappingCheckedFuture.get(MappingCheckedFuture.java:79) [296:org.opendaylight.mdsal.common-api:2.4.0.redhat-5]
        at org.opendaylight.netvirt.elan.utils.ElanUtils.waitForTransactionToComplete(ElanUtils.java:1381) [351:org.opendaylight.netvirt.elanmanager-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.elan.internal.ElanInterfaceManager.removeElanInterface(ElanInterfaceManager.java:275) [351:org.opendaylight.netvirt.elanmanager-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.elan.internal.InterfaceRemoveWorkerOnElan.call(InterfaceRemoveWorkerOnElan.java:52) [351:org.opendaylight.netvirt.elanmanager-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.elan.internal.InterfaceRemoveWorkerOnElan.call(InterfaceRemoveWorkerOnElan.java:21) [351:org.opendaylight.netvirt.elanmanager-impl:0.6.0.redhat-5]
        at org.opendaylight.infrautils.jobcoordinator.internal.JobCoordinatorImpl$MainTask.runWithUncheckedExceptionLogging(JobCoordinatorImpl.java:416) [281:org.opendaylight.infrautils.jobcoordinator-impl:1.3.0.redhat-5]
        at org.opendaylight.infrautils.utils.concurrent.LoggingUncaughtThreadDeathContextRunnable.run(LoggingUncaughtThreadDeathContextRunnable.java:60) [287:org.opendaylight.infrautils.util:1.3.0.redhat-5]
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) [?:?]
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:?]
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:?]
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:?]
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:?]
Caused by: org.opendaylight.controller.md.sal.common.api.data.OptimisticLockFailedException: Optimistic lock failed.
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.lambda$processNextPendingTransaction$0(ShardDataTree.java:740) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPending(ShardDataTree.java:778) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPendingTransaction(ShardDataTree.java:725) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:808) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:84) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:97) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:236) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleReadyLocalTransaction(ShardCommitCoordinator.java:200) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.Shard.handleReadyLocalTransaction(Shard.java:731) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:333) ~[?:?]
        at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270) ~[?:?]
        at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44) ~[?:?]
        at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:274) ~[?:?]
        at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) ~[?:?]
        at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:545) ~[?:?]
        at akka.actor.Actor.aroundReceive(Actor.scala:514) ~[?:?]
        at akka.actor.Actor.aroundReceive$(Actor.scala:512) ~[?:?]
 at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:272) ~[?:?]
        at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:663) ~[?:?]
        at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:183) ~[?:?]
        at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:182) ~[?:?]
        at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:272) ~[?:?]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527) ~[?:?]
        at akka.actor.ActorCell.invoke(ActorCell.scala:496) ~[?:?]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) ~[?:?]
        at akka.dispatch.Mailbox.run(Mailbox.scala:224) ~[?:?]
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[?:?]
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was deleted by other transaction.
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:287) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:132) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:307) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:290) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:132) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:307) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:290) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:132) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.StructuralContainerModificationStrategy.checkApplicable(StructuralContainerModificationStrategy.java:101) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:307) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:290) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:132) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.RootModificationApplyOperation.checkApplicable(RootModificationApplyOperation.java:72) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractDataTreeTip.validate(AbstractDataTreeTip.java:35) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.lambda$processNextPendingTransaction$0(ShardDataTree.java:732) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPending(ShardDataTree.java:778) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPendingTransaction(ShardDataTree.java:725) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:808) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:84) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:97) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:236) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleReadyLocalTransaction(ShardCommitCoordinator.java:200) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.Shard.handleReadyLocalTransaction(Shard.java:731) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:333) ~[?:?]
        at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270) ~[?:?]
        at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44) ~[?:?]
        at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:274) ~[?:?]
        at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) ~[?:?]
        at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:545) ~[?:?]
        at akka.actor.Actor.aroundReceive(Actor.scala:514) ~[?:?]
        at akka.actor.Actor.aroundReceive$(Actor.scala:512) ~[?:?]
        at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:272) ~[?:?]
        at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:663) ~[?:?]
        at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:183) ~[?:?]
        at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:182) ~[?:?]
        at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:272) ~[?:?]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527) ~[?:?]
        at akka.actor.ActorCell.invoke(ActorCell.scala:496) ~[?:?]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) ~[?:?]
        at akka.dispatch.Mailbox.run(Mailbox.scala:224) ~[?:?]
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[?:?]


Version-Release number of selected component (if applicable):
OSP 13
opendaylight-8.0.0-5.el7ost.noarch


How reproducible: 100%

Steps to Reproduce:
1. Install OSp13 + ODL
2. Run scale usecase mentioned above
3.

Actual results:
Exceptions in karaf logs and some vms remain unpingable

Expected results:
No exceptions and VMs should be pingable

Additional info:

Comment 1 Mike Kolesnik 2018-04-16 06:22:12 UTC

*** Bug 1567508 has been marked as a duplicate of this bug. ***

Comment 3 Michael Vorburger 2018-04-16 08:52:16 UTC

> make sure to clean the logs from the stack trace since this exception is expected

where was this determined, and how will that actually make the VM pingable?

Let's have a closer look in https://jira.opendaylight.org/browse/NETVIRT-1215 ...

Comment 4 Michael Vorburger 2018-04-16 09:24:49 UTC

Sai, as far as I can tell from looking at the netvrit code upstream (which I'm not an expert on), there should already be retries in the code, so this is indeed a "false" error log.. I've just raised two changes which should suppress this, but it won't actually "fix" anything - so if "we see that some VMs remain unpingable", that's actually another real issue probably (do you see any other errors in the log when this happens?) - would you like to open a new BZ about that for someone else, so that we can close this one when we'll have the logging cleaned up?

Comment 6 Mike Kolesnik 2018-04-23 10:06:40 UTC

AFAIK when an optimistic lock fails it is an expected behavior since that's how optimistic locking works.

Let me know if that's not the case..

Comment 7 Michael Vorburger 2018-04-23 11:19:50 UTC

> when an optimistic lock fails it is an expected behavior
> since that's how optimistic locking works.
> Let me know if that's not the case.

Well, no, it's not the case.  It's really a little bit more complicated:

IFF we have code for handling optimistic lock failures in ODL e.g. via retries, then we shold not show ERROR logs.   This is often done wrong in code I see and review.

But if we do not, then such ERROR logs do indicate the writes failed.  So to just blindly demand "to clean the logs from the stack trace since this exception is expected" is not the correct solution.  Instead, we must make sure we "handle" this in the respective code (e.g. with retries triggered by correct exception handling) - see NETVIRT-1215.

Comment 10 Franck Baudin 2019-03-06 16:15:14 UTC

As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 11 Franck Baudin 2019-03-06 16:17:21 UTC

As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality