Bug 841889

Summary: Transaction leak caused by reordering between prepare and commit
Product: [JBoss] JBoss Data Grid 6 Reporter: Tristan Tarrant <ttarrant>
Component: LibraryAssignee: Tristan Tarrant <ttarrant>
Status: VERIFIED --- QA Contact: Martin Gencur <mgencur>
Severity: high Docs Contact:
Priority: high    
Version: 6.0.0CC: amanukya, jdg-bugs
Target Milestone: ER6   
Target Release: 6.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
This may cause locks to not be released and as a result, other transactions may not be able to write that data (though they are be able to read it). As a workaround, one can enable transaction recovery, then use the JMX recovery hooks in order to cleanup the pending lock.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tristan Tarrant 2012-07-20 13:46:27 UTC

Comment 1 mark yarborough 2012-08-22 13:52:19 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
CCFR from mmarkus

Comment 3 JBoss JIRA Server 2012-10-13 22:57:50 UTC
Mircea Markus <mmarkus> made a comment on jira ISPN-2081

My initial suggestion with waiting for all responses before sending the rollback is already in place as the ResponseMode is set to GET_ALL for prepare commands.
The only situation I can think of in which the problem can appear is when we get a replication timeout before the prepare message reaches all the participants. Then RpcManager throws a timeout exception and the rollback is sent async. Now this rollback might still get processed *before* the prepare on some nodes. 
I'll solve this as per a recommandation from Sebastiano Peluso:
- the rollback creates the RemoteTransaction and marks it as rollback
- when the prepare eventually arrives it checks to see if the transaction is not already marked for rollback. If so it removes it.
- if the prepare never arrives it means that the originator has crashed and the transaction should be cleaned by the remote lookup service

Comment 4 JBoss JIRA Server 2012-10-13 23:03:48 UTC
Mircea Markus <mmarkus> made a comment on jira ISPN-2081

This also require the following improvement, again suggested by Sebastiano Peluso:
- during the prepare, we first try to acquire a lock locally
- if this lock acquisition fails, we throw a timeout exception
- as response to the timeout exception the TransactionCoordinator triggers a rollback
- this rollback propagates to remote nodes. As the rollback also creates a transaction now (see previous comment) this transaction will never be cleaned
- what we basically need to do is not to send a remote rollback unless required, i.e. if this transaction went remotely

Comment 5 JBoss JIRA Server 2012-10-14 13:22:05 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2081

I'm not sure about the first step, "try to acquire a lock locally". Say the local node (B) is a backup owner for key k1, the tx (tx1) acquires the lock locally, then it tries to go remote to the primary owner (A). The lock is held by another tx (tx2, originated from another node C) on the primary, but then A dies and B becomes the primary owner.

The tx1 caller will get a SuspectException, and the lock on B will be released. But the commit/tx completion command for tx2 might arrive on B just before that happens, and tx2 will try to unlock k1 while it is held by tx1 - resulting in an IllegalMonitorStateException.

I've been also thinking for some time that users shouldn't see SuspectExceptions and Infinispan should retry those commands instead. My idea was to retry automatically in the DistributionInterceptor, but if tx1 already held the lock on B it would be able to commit before tx2 - meaning we have to move the retry logic at a lower level.

Comment 6 JBoss JIRA Server 2012-10-14 13:39:48 UTC
Mircea Markus <mmarkus> made a comment on jira ISPN-2081

[~dan.berindei]"Say the local node (B) is a backup owner for key k1, the tx (tx1) acquires the lock locally" <-- in this situation the transaction would not acquire any lock locally, but just set the value of backup lock flag. The lock acquisition when run locally is no different from when run remotely. 
The situation I've described happens when the main owner of k1 is B and the the local tx cannot acquire a lock on it. In this case, with the current code, a remote rollback is sent even though the tx hasn't left the originator node. This is sub-optimal for now, but as we plan to create a remote transaction during rollback as well, this is not allowed to happen as that tx would leak.

Comment 7 JBoss JIRA Server 2012-10-14 14:10:08 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2081

Ok, never mind then... I thought you wanted to change things so that the lock is always acquired on the local node, regardless of ownership.

Comment 8 JBoss JIRA Server 2012-10-17 10:26:04 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2081

Integrated in master, but the fix can't be backported to 5.1.x - it would need a different approach.

Comment 9 mark yarborough 2012-11-14 14:42:23 UTC
ttarrant will add jira links as appropriate.

Comment 10 Anna Manukyan 2012-12-20 14:25:35 UTC
I've verified that all added tests are in place. They run on all environments and succeeded.