Bug 1310603 - [GSS](6.4.z) XA-Transaction commit inconsistent when slowing down process
Summary: [GSS](6.4.z) XA-Transaction commit inconsistent when slowing down process
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Transaction Manager
Version: 6.4.6
Hardware: All
OS: All
high
high
Target Milestone: CR1
: EAP 6.4.9
Assignee: jboss-set
QA Contact: Ondrej Chaloupka
URL:
Whiteboard:
Depends On:
Blocks: eap649-payload 1325725 1365876
TreeView+ depends on / blocked
 
Reported: 2016-02-22 10:39 UTC by Carsten Lichy-Bittendorf
Modified: 2020-01-17 15:40 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
explains the test conditions and architecture (101.49 KB, application/pdf)
2016-02-22 10:39 UTC, Carsten Lichy-Bittendorf
no flags Details
sequence diagram (5.57 MB, image/bmp)
2016-02-22 10:41 UTC, Carsten Lichy-Bittendorf
no flags Details
byteman script for reproducer (2.37 KB, text/plain)
2016-02-22 10:42 UTC, Carsten Lichy-Bittendorf
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker JBEAP-3575 0 Critical Verified (7.0.z) (EAP) Talk to the local transaction manager to determine if a transaction containing XAResources is still in-fli... 2019-03-26 13:28:45 UTC
Red Hat Issue Tracker JBTM-2583 0 Major Closed Use the local ActionStatusService to determine if a transaction containing XAResources is still in-flight before relying... 2019-03-26 13:28:45 UTC
Red Hat Knowledge Base (Solution) 2173391 0 None None None 2016-02-22 10:42:38 UTC

Description Carsten Lichy-Bittendorf 2016-02-22 10:39:25 UTC
Created attachment 1129225 [details]
explains the test conditions and architecture

Description of problem:

Customer experienced a case within a XA-transaction, involving 
- a JMS local queue, 
- a JMS remote queue 
- and a Oracle database.
The Two JMS XAresources are commited and the oracle XAresources could not be committed because at the same time Recovery Thread has rolled back the transaction. 
It's happens on heavy load, but caustomer has identified a scenario with byteman that reproduces the case systematically (Please see attached files "customize.btm" and "JMS Reliability.bmp"

Version-Release number of selected component (if applicable):

all EAP 6.x.x

How reproducible:

See the attached PDF "use cases tests hornetq.pdf" explain the test conditions and architecture.
RH QE has those test cases in place.

Steps to Reproduce:
1. Run test case SEND_DB_66_CPU while attached byteman scripts are in active and you'll see:
- 2phasecommit is initiated
- topLevelPrepare are done on 3 XAresources (local queue, remote queue and Oracle)
- ShadowNoFileLockStore is asked to write transaction log
- [BYTEMAN artificial load - Sleep 7min] 

BUT as the same time :
- Recovery thread launch a recovery pass on transaction
- it asks to different Orphan Filters to vote decision about each XAResource ,and so first to oracle XAResources:
  o JTATransactionLogXAResourceOrphanFilter asks ShadowNoFileLockStore the transactionstatus
  o ShadowNoFileLockStore looks for a transaction log, but it's not written yet on disk
  o JTATransactionLogXAResourceOrphanFilter is abstaining to vote
  o JTANodeNameXAResourceOrphanFilter decides to rollback
  o  [BYTEMAN artificial load - Sleep 3min]

Other Thread is awaken :
-  ShadowNoFileLockStore writes transaction log
- doCommit method is called
- [BYTEMAN artificial load - Sleep 1min]

Recovery Thread is awaken :
- Oracle transaction is rolled back 
- handle orphan is called successively on remote queue XA Resource and local queue XAResource
- as transaction log exists, JTATransactionLogXAResourceOrphanFilter return LEAVE_ALONE
- both transactions are not rolled back

Other Thread is awaken :
- topLevelCommit on local queue => SUCCESS
- topLevelCommit on remote queue => SUCCESS
- topLevelCommit on oracle => FAILURE (ORA_24756 : transaction don't exist anymore, as it has been rolled back)

Actual results:

- tx ends in state "heuristic" as the JMS resources get committed, while the DB get rolled-back.  

Expected results:

- Avoid state "heuristic" as all resources should get commited || rolled-back.


Additional info:

The customer is expecting an enhancement on EAP6 to prevent it. In all the customers environments TM & RM are co-located, so there should be a synchronize mechanism to prevent issue.

The initially proposed solution to set orphanSafetyInterval to 10 minutes is not satisfying customers expectations. They can easily reproduce the problem by increase the artificial load and sleep time.
Moreover, our production team have not such restriction about reactivity (<10minutes) when a problems occurs.

An feasible solution might be based on https://issues.jboss.org/browse/JBTM-2583 as it might covering the scenario described above.

Comment 1 Carsten Lichy-Bittendorf 2016-02-22 10:41:29 UTC
Created attachment 1129226 [details]
sequence diagram

Comment 2 Carsten Lichy-Bittendorf 2016-02-22 10:42:07 UTC
Created attachment 1129227 [details]
byteman script for reproducer

Comment 10 JBoss JIRA Server 2016-02-29 14:33:53 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 14 JBoss JIRA Server 2016-05-16 10:22:34 UTC
Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 15 JBoss JIRA Server 2016-05-18 07:30:20 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 16 JBoss JIRA Server 2016-05-18 07:30:33 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 17 JBoss JIRA Server 2016-05-18 07:31:03 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 18 JBoss JIRA Server 2016-05-18 07:31:12 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 20 JBoss JIRA Server 2016-06-01 20:05:03 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 21 JBoss JIRA Server 2016-06-01 20:06:44 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 23 Jiří Bílek 2016-07-01 13:53:35 UTC
Verified with EAP 6.4.9.CP.CR2

Comment 24 JBoss JIRA Server 2016-08-04 09:31:31 UTC
Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Resolved

Comment 25 JBoss JIRA Server 2016-08-25 14:05:39 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Reopened

Comment 26 JBoss JIRA Server 2016-08-29 13:18:02 UTC
Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Resolved

Comment 27 JBoss JIRA Server 2016-09-07 11:01:46 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Reopened

Comment 28 JBoss JIRA Server 2016-09-19 12:50:32 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 29 JBoss JIRA Server 2016-09-19 12:50:37 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Open

Comment 30 JBoss JIRA Server 2016-09-19 13:16:01 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Resolved

Comment 31 Petr Penicka 2017-01-17 12:59:11 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 32 Petr Penicka 2017-01-17 12:59:15 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.


Note You need to log in before you can comment on or make changes to this bug.