Bug 1310603 - [GSS](6.4.z) XA-Transaction commit inconsistent when slowing down process
[GSS](6.4.z) XA-Transaction commit inconsistent when slowing down process
Status: CLOSED CURRENTRELEASE
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Transaction Manager (Show other bugs)
6.4.6
All All
high Severity high
: CR1
: EAP 6.4.9
Assigned To: jboss-set
Ondrej Chaloupka
:
Depends On:
Blocks: 1365876 eap649-payload 1325725
  Show dependency treegraph
 
Reported: 2016-02-22 05:39 EST by Carsten Lichy-Bittendorf
Modified: 2017-03-16 11:46 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
explains the test conditions and architecture (101.49 KB, application/pdf)
2016-02-22 05:39 EST, Carsten Lichy-Bittendorf
no flags Details
sequence diagram (5.57 MB, image/bmp)
2016-02-22 05:41 EST, Carsten Lichy-Bittendorf
no flags Details
byteman script for reproducer (2.37 KB, text/plain)
2016-02-22 05:42 EST, Carsten Lichy-Bittendorf
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker JBEAP-3575 Critical Verified (7.0.z) (EAP) Talk to the local transaction manager to determine if a transaction containing XAResources is still in-fli... 2017-12-04 07:29 EST
JBoss Issue Tracker JBTM-2583 Major Closed Use the local ActionStatusService to determine if a transaction containing XAResources is still in-flight before relying... 2017-12-04 07:29 EST
Red Hat Knowledge Base (Solution) 2173391 None None None 2016-02-22 05:42 EST

  None (edit)
Description Carsten Lichy-Bittendorf 2016-02-22 05:39:25 EST
Created attachment 1129225 [details]
explains the test conditions and architecture

Description of problem:

Customer experienced a case within a XA-transaction, involving 
- a JMS local queue, 
- a JMS remote queue 
- and a Oracle database.
The Two JMS XAresources are commited and the oracle XAresources could not be committed because at the same time Recovery Thread has rolled back the transaction. 
It's happens on heavy load, but caustomer has identified a scenario with byteman that reproduces the case systematically (Please see attached files "customize.btm" and "JMS Reliability.bmp"

Version-Release number of selected component (if applicable):

all EAP 6.x.x

How reproducible:

See the attached PDF "use cases tests hornetq.pdf" explain the test conditions and architecture.
RH QE has those test cases in place.

Steps to Reproduce:
1. Run test case SEND_DB_66_CPU while attached byteman scripts are in active and you'll see:
- 2phasecommit is initiated
- topLevelPrepare are done on 3 XAresources (local queue, remote queue and Oracle)
- ShadowNoFileLockStore is asked to write transaction log
- [BYTEMAN artificial load - Sleep 7min] 

BUT as the same time :
- Recovery thread launch a recovery pass on transaction
- it asks to different Orphan Filters to vote decision about each XAResource ,and so first to oracle XAResources:
  o JTATransactionLogXAResourceOrphanFilter asks ShadowNoFileLockStore the transactionstatus
  o ShadowNoFileLockStore looks for a transaction log, but it's not written yet on disk
  o JTATransactionLogXAResourceOrphanFilter is abstaining to vote
  o JTANodeNameXAResourceOrphanFilter decides to rollback
  o  [BYTEMAN artificial load - Sleep 3min]

Other Thread is awaken :
-  ShadowNoFileLockStore writes transaction log
- doCommit method is called
- [BYTEMAN artificial load - Sleep 1min]

Recovery Thread is awaken :
- Oracle transaction is rolled back 
- handle orphan is called successively on remote queue XA Resource and local queue XAResource
- as transaction log exists, JTATransactionLogXAResourceOrphanFilter return LEAVE_ALONE
- both transactions are not rolled back

Other Thread is awaken :
- topLevelCommit on local queue => SUCCESS
- topLevelCommit on remote queue => SUCCESS
- topLevelCommit on oracle => FAILURE (ORA_24756 : transaction don't exist anymore, as it has been rolled back)

Actual results:

- tx ends in state "heuristic" as the JMS resources get committed, while the DB get rolled-back.  

Expected results:

- Avoid state "heuristic" as all resources should get commited || rolled-back.


Additional info:

The customer is expecting an enhancement on EAP6 to prevent it. In all the customers environments TM & RM are co-located, so there should be a synchronize mechanism to prevent issue.

The initially proposed solution to set orphanSafetyInterval to 10 minutes is not satisfying customers expectations. They can easily reproduce the problem by increase the artificial load and sleep time.
Moreover, our production team have not such restriction about reactivity (<10minutes) when a problems occurs.

An feasible solution might be based on https://issues.jboss.org/browse/JBTM-2583 as it might covering the scenario described above.
Comment 1 Carsten Lichy-Bittendorf 2016-02-22 05:41 EST
Created attachment 1129226 [details]
sequence diagram
Comment 2 Carsten Lichy-Bittendorf 2016-02-22 05:42 EST
Created attachment 1129227 [details]
byteman script for reproducer
Comment 10 JBoss JIRA Server 2016-02-29 09:33:53 EST
Tom Jenkinson <tom.jenkinson@redhat.com> updated the status of jira JBEAP-3575 to Coding In Progress
Comment 14 JBoss JIRA Server 2016-05-16 06:22:34 EDT
Bartosz Baranowski <bbaranow@redhat.com> updated the status of jira JBEAP-3575 to Coding In Progress
Comment 15 JBoss JIRA Server 2016-05-18 03:30:20 EDT
Tom Jenkinson <tom.jenkinson@redhat.com> updated the status of jira JBTM-2583 to Reopened
Comment 16 JBoss JIRA Server 2016-05-18 03:30:33 EDT
Tom Jenkinson <tom.jenkinson@redhat.com> updated the status of jira JBTM-2583 to Closed
Comment 17 JBoss JIRA Server 2016-05-18 03:31:03 EDT
Tom Jenkinson <tom.jenkinson@redhat.com> updated the status of jira JBTM-2583 to Reopened
Comment 18 JBoss JIRA Server 2016-05-18 03:31:12 EDT
Tom Jenkinson <tom.jenkinson@redhat.com> updated the status of jira JBTM-2583 to Closed
Comment 20 JBoss JIRA Server 2016-06-01 16:05:03 EDT
Tom Jenkinson <tom.jenkinson@redhat.com> updated the status of jira JBTM-2583 to Reopened
Comment 21 JBoss JIRA Server 2016-06-01 16:06:44 EDT
Tom Jenkinson <tom.jenkinson@redhat.com> updated the status of jira JBTM-2583 to Closed
Comment 23 Jiří Bílek 2016-07-01 09:53:35 EDT
Verified with EAP 6.4.9.CP.CR2
Comment 24 JBoss JIRA Server 2016-08-04 05:31:31 EDT
Bartosz Baranowski <bbaranow@redhat.com> updated the status of jira JBEAP-3575 to Resolved
Comment 25 JBoss JIRA Server 2016-08-25 10:05:39 EDT
Jiří Bílek <jbilek@redhat.com> updated the status of jira JBEAP-3575 to Reopened
Comment 26 JBoss JIRA Server 2016-08-29 09:18:02 EDT
Bartosz Baranowski <bbaranow@redhat.com> updated the status of jira JBEAP-3575 to Resolved
Comment 27 JBoss JIRA Server 2016-09-07 07:01:46 EDT
Jiří Bílek <jbilek@redhat.com> updated the status of jira JBEAP-3575 to Reopened
Comment 28 JBoss JIRA Server 2016-09-19 08:50:32 EDT
Jiří Bílek <jbilek@redhat.com> updated the status of jira JBEAP-3575 to Coding In Progress
Comment 29 JBoss JIRA Server 2016-09-19 08:50:37 EDT
Jiří Bílek <jbilek@redhat.com> updated the status of jira JBEAP-3575 to Open
Comment 30 JBoss JIRA Server 2016-09-19 09:16:01 EDT
Jiří Bílek <jbilek@redhat.com> updated the status of jira JBEAP-3575 to Resolved
Comment 31 Petr Penicka 2017-01-17 07:59:11 EST
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.
Comment 32 Petr Penicka 2017-01-17 07:59:15 EST
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Note You need to log in before you can comment on or make changes to this bug.