Bug 1033087

Summary: Forwarded Prepare/Commit executed after transaction finished
Product: [JBoss] JBoss Data Grid 6 Reporter: Radim Vansa <rvansa>
Component: InfinispanAssignee: Tristan Tarrant <ttarrant>
Status: VERIFIED --- QA Contact: Martin Gencur <mgencur>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2.0CC: jdg-bugs
Target Milestone: CR1   
Target Release: 6.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1017190    

Description Radim Vansa 2013-11-21 14:23:51 UTC
Replicated TX cache, nodes A, B, C

0. A and B have topology 2, C already got topology 3
1. A sends prepare with topology 2 to B and C, both apply the prepare and respond
2. C forwards prepare to B with topology 3
3. A sends commit with topology 2 to B and C, both commit and respond
4. again, C forwards prepare to B with topology 3
5. A and B get updated topology id
6. A executes another transaction on the same entry
7. prepare and commit from first transaction with topology 3 arrive at B - B overwrites (or removes) the entry again

Result: on B we have inconsistent state

Comment 2 JBoss JIRA Server 2013-12-02 15:27:48 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-3745

[~rvansa] What's the cache configuration? The forwarding is always done synchronously, so node A couldn't receive the prepare response and send the commit until C finished its forwarding.

Comment 3 JBoss JIRA Server 2013-12-03 08:07:37 UTC
Radim Vansa <rvansa> made a comment on jira ISPN-3745

You're right, as I have synchronous tx cache, the forwarding should be synchronous. Regrettably, I miss the logs from the forwarding node (it got truncated), just to let you see what happened:

{code}
04:19:29,410 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-95,default,apex862-11617) Attempting to execute command: CommitCommand {gtx=GlobalTransaction:<apex861-22006>:164595:
local, cacheName='testCache', topologyId=18} [sender=apex861-22006]
04:19:29,411 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (remote-thread-14) Calling perform() on CommitCommand {gtx=GlobalTransaction:<apex861-22006>:164595:remote, cacheName='testCache', topologyId=18}
04:19:29,412 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (remote-thread-14) About to send back response SuccessfulResponse{responseValue=null}  for command CommitCommand {gtx=GlobalTransaction:<apex861-22006>:164595:remote, cacheName='testCache', topologyId=18}
04:19:31,301 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-78,default,apex862-11617) Attempting to execute command: PrepareCommand {modifications=[ ... ], onePhaseCommit=false, gtx=GlobalTransaction:<apex861-22006>:164595:local, cacheName='testCache', topologyId=19} [sender=apex863-20495]
{code}

Comment 4 JBoss JIRA Server 2013-12-03 08:10:31 UTC
Radim Vansa <rvansa> made a comment on jira ISPN-3745

Thinking about that once more, the broadcast optimization may be the villain here as well, because the apex863 (sender) has just joined. It got the prepare/commit as this was broadcast but nobody waited for its response. Then, it could forward the commands to the old nodes and these executed it again.

Comment 5 JBoss JIRA Server 2013-12-04 11:28:25 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-3745

Is topologyId = 18 the id of the topology that contains the joiner, or the topology before? If it's the new topology, and the command was initially invoked remotely with topology 17, then the command was forwarded, otherwise it was likely retransmitted by JGroups. 

I'm inclined to think it's caused by JGroups retransmitting the message to the joiner and the originator not waiting for the response, too.

Comment 6 JBoss JIRA Server 2013-12-04 12:25:23 UTC
Radim Vansa <rvansa> made a comment on jira ISPN-3745

Topology 18 does not contain the joiner, 19 contains it.

Comment 7 JBoss JIRA Server 2013-12-19 11:40:43 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-3745 to Resolved