Bug 841889
Summary: | Transaction leak caused by reordering between prepare and commit | ||
---|---|---|---|
Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Tristan Tarrant <ttarrant> |
Component: | Library | Assignee: | Tristan Tarrant <ttarrant> |
Status: | VERIFIED --- | QA Contact: | Martin Gencur <mgencur> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.0.0 | CC: | amanukya, jdg-bugs |
Target Milestone: | ER6 | ||
Target Release: | 6.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Known Issue | |
Doc Text: |
This may cause locks to not be released and as a result, other transactions may not be able to write that data (though they are be able to read it). As a workaround, one can enable transaction recovery, then use the JMX recovery hooks in order to cleanup the pending lock.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Tristan Tarrant
2012-07-20 13:46:27 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: CCFR from mmarkus Mircea Markus <mmarkus> made a comment on jira ISPN-2081 My initial suggestion with waiting for all responses before sending the rollback is already in place as the ResponseMode is set to GET_ALL for prepare commands. The only situation I can think of in which the problem can appear is when we get a replication timeout before the prepare message reaches all the participants. Then RpcManager throws a timeout exception and the rollback is sent async. Now this rollback might still get processed *before* the prepare on some nodes. I'll solve this as per a recommandation from Sebastiano Peluso: - the rollback creates the RemoteTransaction and marks it as rollback - when the prepare eventually arrives it checks to see if the transaction is not already marked for rollback. If so it removes it. - if the prepare never arrives it means that the originator has crashed and the transaction should be cleaned by the remote lookup service Mircea Markus <mmarkus> made a comment on jira ISPN-2081 This also require the following improvement, again suggested by Sebastiano Peluso: - during the prepare, we first try to acquire a lock locally - if this lock acquisition fails, we throw a timeout exception - as response to the timeout exception the TransactionCoordinator triggers a rollback - this rollback propagates to remote nodes. As the rollback also creates a transaction now (see previous comment) this transaction will never be cleaned - what we basically need to do is not to send a remote rollback unless required, i.e. if this transaction went remotely Dan Berindei <dberinde> made a comment on jira ISPN-2081 I'm not sure about the first step, "try to acquire a lock locally". Say the local node (B) is a backup owner for key k1, the tx (tx1) acquires the lock locally, then it tries to go remote to the primary owner (A). The lock is held by another tx (tx2, originated from another node C) on the primary, but then A dies and B becomes the primary owner. The tx1 caller will get a SuspectException, and the lock on B will be released. But the commit/tx completion command for tx2 might arrive on B just before that happens, and tx2 will try to unlock k1 while it is held by tx1 - resulting in an IllegalMonitorStateException. I've been also thinking for some time that users shouldn't see SuspectExceptions and Infinispan should retry those commands instead. My idea was to retry automatically in the DistributionInterceptor, but if tx1 already held the lock on B it would be able to commit before tx2 - meaning we have to move the retry logic at a lower level. Mircea Markus <mmarkus> made a comment on jira ISPN-2081 [~dan.berindei]"Say the local node (B) is a backup owner for key k1, the tx (tx1) acquires the lock locally" <-- in this situation the transaction would not acquire any lock locally, but just set the value of backup lock flag. The lock acquisition when run locally is no different from when run remotely. The situation I've described happens when the main owner of k1 is B and the the local tx cannot acquire a lock on it. In this case, with the current code, a remote rollback is sent even though the tx hasn't left the originator node. This is sub-optimal for now, but as we plan to create a remote transaction during rollback as well, this is not allowed to happen as that tx would leak. Dan Berindei <dberinde> made a comment on jira ISPN-2081 Ok, never mind then... I thought you wanted to change things so that the lock is always acquired on the local node, regardless of ownership. Dan Berindei <dberinde> made a comment on jira ISPN-2081 Integrated in master, but the fix can't be backported to 5.1.x - it would need a different approach. ttarrant will add jira links as appropriate. I've verified that all added tests are in place. They run on all environments and succeeded. |