Replicated TX cache, nodes A, B, C 0. A and B have topology 2, C already got topology 3 1. A sends prepare with topology 2 to B and C, both apply the prepare and respond 2. C forwards prepare to B with topology 3 3. A sends commit with topology 2 to B and C, both commit and respond 4. again, C forwards prepare to B with topology 3 5. A and B get updated topology id 6. A executes another transaction on the same entry 7. prepare and commit from first transaction with topology 3 arrive at B - B overwrites (or removes) the entry again Result: on B we have inconsistent state
Dan Berindei <dberinde> made a comment on jira ISPN-3745 [~rvansa] What's the cache configuration? The forwarding is always done synchronously, so node A couldn't receive the prepare response and send the commit until C finished its forwarding.
Radim Vansa <rvansa> made a comment on jira ISPN-3745 You're right, as I have synchronous tx cache, the forwarding should be synchronous. Regrettably, I miss the logs from the forwarding node (it got truncated), just to let you see what happened: {code} 04:19:29,410 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-95,default,apex862-11617) Attempting to execute command: CommitCommand {gtx=GlobalTransaction:<apex861-22006>:164595: local, cacheName='testCache', topologyId=18} [sender=apex861-22006] 04:19:29,411 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (remote-thread-14) Calling perform() on CommitCommand {gtx=GlobalTransaction:<apex861-22006>:164595:remote, cacheName='testCache', topologyId=18} 04:19:29,412 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (remote-thread-14) About to send back response SuccessfulResponse{responseValue=null} for command CommitCommand {gtx=GlobalTransaction:<apex861-22006>:164595:remote, cacheName='testCache', topologyId=18} 04:19:31,301 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-78,default,apex862-11617) Attempting to execute command: PrepareCommand {modifications=[ ... ], onePhaseCommit=false, gtx=GlobalTransaction:<apex861-22006>:164595:local, cacheName='testCache', topologyId=19} [sender=apex863-20495] {code}
Radim Vansa <rvansa> made a comment on jira ISPN-3745 Thinking about that once more, the broadcast optimization may be the villain here as well, because the apex863 (sender) has just joined. It got the prepare/commit as this was broadcast but nobody waited for its response. Then, it could forward the commands to the old nodes and these executed it again.
Dan Berindei <dberinde> made a comment on jira ISPN-3745 Is topologyId = 18 the id of the topology that contains the joiner, or the topology before? If it's the new topology, and the command was initially invoked remotely with topology 17, then the command was forwarded, otherwise it was likely retransmitted by JGroups. I'm inclined to think it's caused by JGroups retransmitting the message to the joiner and the originator not waiting for the response, too.
Radim Vansa <rvansa> made a comment on jira ISPN-3745 Topology 18 does not contain the joiner, 19 contains it.
Dan Berindei <dberinde> updated the status of jira ISPN-3745 to Resolved