Bug 1310603

Summary: [GSS](6.4.z) XA-Transaction commit inconsistent when slowing down process
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Carsten Lichy-Bittendorf <clichybi>
Component: Transaction ManagerAssignee: jboss-set
Status: CLOSED CURRENTRELEASE QA Contact: Ondrej Chaloupka <ochaloup>
Severity: high Docs Contact:
Priority: high    
Version: 6.4.6CC: bbaranow, bmaxwell, cdewolf, dtikhomi, jbilek, jdoyle, jtruhlar, ochaloup, rnetuka, tom.jenkinson, vtunka
Target Milestone: CR1   
Target Release: EAP 6.4.9   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1324262, 1325725, 1365876    
Attachments:
Description Flags
explains the test conditions and architecture
none
sequence diagram
none
byteman script for reproducer none

Description Carsten Lichy-Bittendorf 2016-02-22 10:39:25 UTC
Created attachment 1129225 [details]
explains the test conditions and architecture

Description of problem:

Customer experienced a case within a XA-transaction, involving 
- a JMS local queue, 
- a JMS remote queue 
- and a Oracle database.
The Two JMS XAresources are commited and the oracle XAresources could not be committed because at the same time Recovery Thread has rolled back the transaction. 
It's happens on heavy load, but caustomer has identified a scenario with byteman that reproduces the case systematically (Please see attached files "customize.btm" and "JMS Reliability.bmp"

Version-Release number of selected component (if applicable):

all EAP 6.x.x

How reproducible:

See the attached PDF "use cases tests hornetq.pdf" explain the test conditions and architecture.
RH QE has those test cases in place.

Steps to Reproduce:
1. Run test case SEND_DB_66_CPU while attached byteman scripts are in active and you'll see:
- 2phasecommit is initiated
- topLevelPrepare are done on 3 XAresources (local queue, remote queue and Oracle)
- ShadowNoFileLockStore is asked to write transaction log
- [BYTEMAN artificial load - Sleep 7min] 

BUT as the same time :
- Recovery thread launch a recovery pass on transaction
- it asks to different Orphan Filters to vote decision about each XAResource ,and so first to oracle XAResources:
  o JTATransactionLogXAResourceOrphanFilter asks ShadowNoFileLockStore the transactionstatus
  o ShadowNoFileLockStore looks for a transaction log, but it's not written yet on disk
  o JTATransactionLogXAResourceOrphanFilter is abstaining to vote
  o JTANodeNameXAResourceOrphanFilter decides to rollback
  o  [BYTEMAN artificial load - Sleep 3min]

Other Thread is awaken :
-  ShadowNoFileLockStore writes transaction log
- doCommit method is called
- [BYTEMAN artificial load - Sleep 1min]

Recovery Thread is awaken :
- Oracle transaction is rolled back 
- handle orphan is called successively on remote queue XA Resource and local queue XAResource
- as transaction log exists, JTATransactionLogXAResourceOrphanFilter return LEAVE_ALONE
- both transactions are not rolled back

Other Thread is awaken :
- topLevelCommit on local queue => SUCCESS
- topLevelCommit on remote queue => SUCCESS
- topLevelCommit on oracle => FAILURE (ORA_24756 : transaction don't exist anymore, as it has been rolled back)

Actual results:

- tx ends in state "heuristic" as the JMS resources get committed, while the DB get rolled-back.  

Expected results:

- Avoid state "heuristic" as all resources should get commited || rolled-back.


Additional info:

The customer is expecting an enhancement on EAP6 to prevent it. In all the customers environments TM & RM are co-located, so there should be a synchronize mechanism to prevent issue.

The initially proposed solution to set orphanSafetyInterval to 10 minutes is not satisfying customers expectations. They can easily reproduce the problem by increase the artificial load and sleep time.
Moreover, our production team have not such restriction about reactivity (<10minutes) when a problems occurs.

An feasible solution might be based on https://issues.jboss.org/browse/JBTM-2583 as it might covering the scenario described above.

Comment 1 Carsten Lichy-Bittendorf 2016-02-22 10:41:29 UTC
Created attachment 1129226 [details]
sequence diagram

Comment 2 Carsten Lichy-Bittendorf 2016-02-22 10:42:07 UTC
Created attachment 1129227 [details]
byteman script for reproducer

Comment 10 JBoss JIRA Server 2016-02-29 14:33:53 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 14 JBoss JIRA Server 2016-05-16 10:22:34 UTC
Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 15 JBoss JIRA Server 2016-05-18 07:30:20 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 16 JBoss JIRA Server 2016-05-18 07:30:33 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 17 JBoss JIRA Server 2016-05-18 07:31:03 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 18 JBoss JIRA Server 2016-05-18 07:31:12 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 20 JBoss JIRA Server 2016-06-01 20:05:03 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 21 JBoss JIRA Server 2016-06-01 20:06:44 UTC
Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 23 Jiří Bílek 2016-07-01 13:53:35 UTC
Verified with EAP 6.4.9.CP.CR2

Comment 24 JBoss JIRA Server 2016-08-04 09:31:31 UTC
Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Resolved

Comment 25 JBoss JIRA Server 2016-08-25 14:05:39 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Reopened

Comment 26 JBoss JIRA Server 2016-08-29 13:18:02 UTC
Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Resolved

Comment 27 JBoss JIRA Server 2016-09-07 11:01:46 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Reopened

Comment 28 JBoss JIRA Server 2016-09-19 12:50:32 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 29 JBoss JIRA Server 2016-09-19 12:50:37 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Open

Comment 30 JBoss JIRA Server 2016-09-19 13:16:01 UTC
Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Resolved

Comment 31 Petr Penicka 2017-01-17 12:59:11 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 32 Petr Penicka 2017-01-17 12:59:15 UTC
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.