Red Hat Bugzilla – Bug 1310603
[GSS](6.4.z) XA-Transaction commit inconsistent when slowing down process
Last modified: 2017-03-16 11:46:31 EDT
Created attachment 1129225 [details]
explains the test conditions and architecture
Description of problem:
Customer experienced a case within a XA-transaction, involving
- a JMS local queue,
- a JMS remote queue
- and a Oracle database.
The Two JMS XAresources are commited and the oracle XAresources could not be committed because at the same time Recovery Thread has rolled back the transaction.
It's happens on heavy load, but caustomer has identified a scenario with byteman that reproduces the case systematically (Please see attached files "customize.btm" and "JMS Reliability.bmp"
Version-Release number of selected component (if applicable):
all EAP 6.x.x
See the attached PDF "use cases tests hornetq.pdf" explain the test conditions and architecture.
RH QE has those test cases in place.
Steps to Reproduce:
1. Run test case SEND_DB_66_CPU while attached byteman scripts are in active and you'll see:
- 2phasecommit is initiated
- topLevelPrepare are done on 3 XAresources (local queue, remote queue and Oracle)
- ShadowNoFileLockStore is asked to write transaction log
- [BYTEMAN artificial load - Sleep 7min]
BUT as the same time :
- Recovery thread launch a recovery pass on transaction
- it asks to different Orphan Filters to vote decision about each XAResource ,and so first to oracle XAResources:
o JTATransactionLogXAResourceOrphanFilter asks ShadowNoFileLockStore the transactionstatus
o ShadowNoFileLockStore looks for a transaction log, but it's not written yet on disk
o JTATransactionLogXAResourceOrphanFilter is abstaining to vote
o JTANodeNameXAResourceOrphanFilter decides to rollback
o [BYTEMAN artificial load - Sleep 3min]
Other Thread is awaken :
- ShadowNoFileLockStore writes transaction log
- doCommit method is called
- [BYTEMAN artificial load - Sleep 1min]
Recovery Thread is awaken :
- Oracle transaction is rolled back
- handle orphan is called successively on remote queue XA Resource and local queue XAResource
- as transaction log exists, JTATransactionLogXAResourceOrphanFilter return LEAVE_ALONE
- both transactions are not rolled back
Other Thread is awaken :
- topLevelCommit on local queue => SUCCESS
- topLevelCommit on remote queue => SUCCESS
- topLevelCommit on oracle => FAILURE (ORA_24756 : transaction don't exist anymore, as it has been rolled back)
- tx ends in state "heuristic" as the JMS resources get committed, while the DB get rolled-back.
- Avoid state "heuristic" as all resources should get commited || rolled-back.
The customer is expecting an enhancement on EAP6 to prevent it. In all the customers environments TM & RM are co-located, so there should be a synchronize mechanism to prevent issue.
The initially proposed solution to set orphanSafetyInterval to 10 minutes is not satisfying customers expectations. They can easily reproduce the problem by increase the artificial load and sleep time.
Moreover, our production team have not such restriction about reactivity (<10minutes) when a problems occurs.
An feasible solution might be based on https://issues.jboss.org/browse/JBTM-2583 as it might covering the scenario described above.
Created attachment 1129226 [details]
Created attachment 1129227 [details]
byteman script for reproducer
Tom Jenkinson <email@example.com> updated the status of jira JBEAP-3575 to Coding In Progress
Bartosz Baranowski <firstname.lastname@example.org> updated the status of jira JBEAP-3575 to Coding In Progress
Tom Jenkinson <email@example.com> updated the status of jira JBTM-2583 to Reopened
Tom Jenkinson <firstname.lastname@example.org> updated the status of jira JBTM-2583 to Closed
Verified with EAP 6.4.9.CP.CR2
Bartosz Baranowski <email@example.com> updated the status of jira JBEAP-3575 to Resolved
Jiří Bílek <firstname.lastname@example.org> updated the status of jira JBEAP-3575 to Reopened
Jiří Bílek <email@example.com> updated the status of jira JBEAP-3575 to Coding In Progress
Jiří Bílek <firstname.lastname@example.org> updated the status of jira JBEAP-3575 to Open
Jiří Bílek <email@example.com> updated the status of jira JBEAP-3575 to Resolved
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.