Bug 1310603

Summary:

[GSS](6.4.z) XA-Transaction commit inconsistent when slowing down process

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Carsten Lichy-Bittendorf <clichybi>

Component:

Transaction Manager

Assignee:

jboss-set

Status:

CLOSED CURRENTRELEASE

QA Contact:

Ondrej Chaloupka <ochaloup>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.4.6

CC:

bbaranow, bmaxwell, cdewolf, dtikhomi, jbilek, jdoyle, jtruhlar, ochaloup, rnetuka, tom.jenkinson, vtunka

Target Milestone:

CR1

Target Release:

EAP 6.4.9

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1324262, 1325725, 1365876

Attachments:

Description	Flags
explains the test conditions and architecture	none
sequence diagram	none
byteman script for reproducer	none

Description Carsten Lichy-Bittendorf 2016-02-22 10:39:25 UTC

Created attachment 1129225 [details]
explains the test conditions and architecture

Description of problem:

Customer experienced a case within a XA-transaction, involving 
- a JMS local queue, 
- a JMS remote queue 
- and a Oracle database.
The Two JMS XAresources are commited and the oracle XAresources could not be committed because at the same time Recovery Thread has rolled back the transaction. 
It's happens on heavy load, but caustomer has identified a scenario with byteman that reproduces the case systematically (Please see attached files "customize.btm" and "JMS Reliability.bmp"

Version-Release number of selected component (if applicable):

all EAP 6.x.x

How reproducible:

See the attached PDF "use cases tests hornetq.pdf" explain the test conditions and architecture.
RH QE has those test cases in place.

Steps to Reproduce:
1. Run test case SEND_DB_66_CPU while attached byteman scripts are in active and you'll see:
- 2phasecommit is initiated
- topLevelPrepare are done on 3 XAresources (local queue, remote queue and Oracle)
- ShadowNoFileLockStore is asked to write transaction log
- [BYTEMAN artificial load - Sleep 7min] 

BUT as the same time :
- Recovery thread launch a recovery pass on transaction
- it asks to different Orphan Filters to vote decision about each XAResource ,and so first to oracle XAResources:
  o JTATransactionLogXAResourceOrphanFilter asks ShadowNoFileLockStore the transactionstatus
  o ShadowNoFileLockStore looks for a transaction log, but it's not written yet on disk
  o JTATransactionLogXAResourceOrphanFilter is abstaining to vote
  o JTANodeNameXAResourceOrphanFilter decides to rollback
  o  [BYTEMAN artificial load - Sleep 3min]

Other Thread is awaken :
-  ShadowNoFileLockStore writes transaction log
- doCommit method is called
- [BYTEMAN artificial load - Sleep 1min]

Recovery Thread is awaken :
- Oracle transaction is rolled back 
- handle orphan is called successively on remote queue XA Resource and local queue XAResource
- as transaction log exists, JTATransactionLogXAResourceOrphanFilter return LEAVE_ALONE
- both transactions are not rolled back

Other Thread is awaken :
- topLevelCommit on local queue => SUCCESS
- topLevelCommit on remote queue => SUCCESS
- topLevelCommit on oracle => FAILURE (ORA_24756 : transaction don't exist anymore, as it has been rolled back)

Actual results:

- tx ends in state "heuristic" as the JMS resources get committed, while the DB get rolled-back.  

Expected results:

- Avoid state "heuristic" as all resources should get commited || rolled-back.


Additional info:

The customer is expecting an enhancement on EAP6 to prevent it. In all the customers environments TM & RM are co-located, so there should be a synchronize mechanism to prevent issue.

The initially proposed solution to set orphanSafetyInterval to 10 minutes is not satisfying customers expectations. They can easily reproduce the problem by increase the artificial load and sleep time.
Moreover, our production team have not such restriction about reactivity (<10minutes) when a problems occurs.

An feasible solution might be based on https://issues.jboss.org/browse/JBTM-2583 as it might covering the scenario described above.

Comment 1 Carsten Lichy-Bittendorf 2016-02-22 10:41:29 UTC

Created attachment 1129226 [details]
sequence diagram

Comment 2 Carsten Lichy-Bittendorf 2016-02-22 10:42:07 UTC

Created attachment 1129227 [details]
byteman script for reproducer

Comment 10 JBoss JIRA Server 2016-02-29 14:33:53 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 14 JBoss JIRA Server 2016-05-16 10:22:34 UTC

Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 15 JBoss JIRA Server 2016-05-18 07:30:20 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 16 JBoss JIRA Server 2016-05-18 07:30:33 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 17 JBoss JIRA Server 2016-05-18 07:31:03 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 18 JBoss JIRA Server 2016-05-18 07:31:12 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 20 JBoss JIRA Server 2016-06-01 20:05:03 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 21 JBoss JIRA Server 2016-06-01 20:06:44 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 23 Jiří Bílek 2016-07-01 13:53:35 UTC

Verified with EAP 6.4.9.CP.CR2

Comment 24 JBoss JIRA Server 2016-08-04 09:31:31 UTC

Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Resolved

Comment 25 JBoss JIRA Server 2016-08-25 14:05:39 UTC

Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Reopened

Comment 26 JBoss JIRA Server 2016-08-29 13:18:02 UTC

Bartosz Baranowski <bbaranow> updated the status of jira JBEAP-3575 to Resolved

Comment 27 JBoss JIRA Server 2016-09-07 11:01:46 UTC

Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Reopened

Comment 28 JBoss JIRA Server 2016-09-19 12:50:32 UTC

Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Coding In Progress

Comment 29 JBoss JIRA Server 2016-09-19 12:50:37 UTC

Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Open

Comment 30 JBoss JIRA Server 2016-09-19 13:16:01 UTC

Jiří Bílek <jbilek> updated the status of jira JBEAP-3575 to Resolved

Comment 31 Petr Penicka 2017-01-17 12:59:11 UTC

Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Comment 32 Petr Penicka 2017-01-17 12:59:15 UTC

Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.