1009981 – Periodic recovery on committed transactions

Bug 1009981 - Periodic recovery on committed transactions

Summary: Periodic recovery on committed transactions

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	Transaction Manager
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ER6
Target Release:	EAP 6.2.0
Assignee:	tom.jenkinson
QA Contact:	Ondrej Chaloupka
Docs Contact:
URL:
Whiteboard:
Depends On:	1012986
Blocks:	1040521
TreeView+	depends on / blocked

Reported:	2013-09-19 15:58 UTC by Miroslav Novak
Modified:	2017-10-10 00:11 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-12-15 16:19:41 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
sources_of_mdb (3.62 KB, text/x-java) 2013-09-19 16:00 UTC, Miroslav Novak	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1164184	unspecified	CLOSED	[Doc Bug Fix] Release notes - Known issue - Periodic recovery on committed transactions	2023-09-14 02:50:52 UTC
Red Hat Issue Tracker	JBTM-1943	Major	Closed	Allow the safety interval for RecoveryXids to be configurable	2020-10-07 09:59:32 UTC
Red Hat Issue Tracker	JBTM-2583	Major	Closed	Use the local ActionStatusService to determine if a transaction containing XAResources is still in-flight before relying...	2020-10-07 09:59:32 UTC

Internal Links: 1164184

Description Miroslav Novak 2013-09-19 15:58:52 UTC

Description of problem:

Test scenario:
1. Start EAP 6.2.0.ER1 server with queues InQueue and OutQueue
2. Send 10000 messages to InQueue
3. Deploy MDB which reads messages from InQueue and sends to OutQueue
4. Wait until MDB processes all messages and start consumer which reads all messages from OutQueue

Result:
During step 3. there are errors in log:
17:20:00,441 ERROR [stderr] (Periodic Recovery) javax.transaction.xa.XAException
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at org.hornetq.core.client.impl.ClientSessionImpl.rollback(ClientSessionImpl.java:1692)
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at org.hornetq.core.client.impl.DelegatingSession.rollback(DelegatingSession.java:494)
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at org.hornetq.jms.server.recovery.HornetQXAResourceWrapper.rollback(HornetQXAResourceWrapper.java:126)
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.handleOrphan(XARecoveryModule.java:738)
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoverySecondPass(XARecoveryModule.java:644)
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.bottomUpRecovery(XARecoveryModule.java:417)
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkSecondPass(XARecoveryModule.java:194)
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:789)
17:20:00,441 ERROR [stderr] (Periodic Recovery)         at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:371)

Error above is caused by calling rollback on XID during periodic recovery which HornetQ RA does not know and throw XAException to TM which then e.printStackTrace() to log:

17:20:00,445 ERROR [org.hornetq.core.client] (Periodic Recovery) mnovak: Now this is the problem!!!!!!!!!! Calling rollback on transaction with xid:: xid=XidImpl (1237619510 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.127.0.0.1.-60.28.-71.-69.82.59.21.-109.0.0.21.-50.0.0.0.0.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.127.0.0.1.-60.28.-71.-69.82.59.21.-109.0.0.21.-51.53.50.52.51.53.50 base64:AAAAAAAAAAAAAP__fwAAAcQcubtSOxWTAAAVzgAAAAAAAAAAAAAAAAAAAAAAAP__fwAAAcQcubtSOxWTAAAVzTUyNDM1MgcCAgA=,clientXID=< formatId=131077, gtrid_length=34, bqual_length=36, tx_uid=0:ffff7f000001:-3be34645:523b1593:15cd, node_name=524352, branch_uid=0:ffff7f000001:-3be34645:523b1593:15ce, subordinatenodename=null, eis_name=unknown eis name > clientXID=< formatId=131077, gtrid_length=34, bqual_length=36, tx_uid=0:ffff7f000001:-3be34645:523b1593:15cd, node_name=524352, branch_uid=0:ffff7f000001:-3be34645:523b1593:15ce, subordinatenodename=null, eis_name=unknown eis name >. Response from this call is: PACKET(SessionXAResponseMessage)[type=55, channelID=10, packetObject=SessionXAResponseMessage] with error code: -4

Problem is that tx_uid: 0:ffff7f000001:-3be34645:523b1593:15cd was successfully committed and for some reason periodic recovery finds it (when it should be already removed) and tries to commit it.

Comment 2 Miroslav Novak 2013-09-19 16:00:46 UTC

Created attachment 800020 [details]
sources_of_mdb

Comment 3 Miroslav Novak 2013-09-19 16:01:55 UTC

Note:
Search for "mnovak" to find problem tx_uid.

Comment 4 tom.jenkinson 2013-09-19 16:12:34 UTC

was there any crash?

Comment 5 Miroslav Novak 2013-09-20 07:33:09 UTC

There is no crash. It's positive scenario. 

If you want I can give you access to prepared environment where is possible to run and debug the test.

Comment 6 tom.jenkinson 2013-09-20 14:29:05 UTC

What can happen is this:

0. Transaction::commit begins
1. HornetQ is xa_prepare(xid1)'d
2. Recovery manager calls xa_recover() on HQ which returns xid1 from above
3. Transaction calls xa_commit(xid1) on HornetQ
3. System under load so takes a more than the RecoveryManager safety interval  period (10 seconds) for the recovery thread to resume
4. Periodic recovery thread can find no trace of the transaction so it tries to rollback the XID

This will not cause data integrity issues but will have spurious rollbacks in the logs

Comment 7 tom.jenkinson 2013-09-20 14:43:17 UTC

Note that there is a field on RecoveryXids that is not configurable that defines how long the interval is between to decide if the Xid is indoubt. It is set to 10 seconds:
https://github.com/jbosstm/narayana/blob/4.17/ArjunaJTA/jta/classes/com/arjuna/ats/internal/jta/recovery/arjunacore/RecoveryXids.java#L220

Comment 8 tom.jenkinson 2013-09-20 15:01:14 UTC

The commit over here shows the scenario:
https://github.com/tomjenkinson/narayana/commit/cdf4785c65a31609c83019a448573cede545455d

Comment 9 Miroslav Novak 2013-09-25 14:08:38 UTC

Hi Tom,

thanks for your feedback. Do you think there is a way to get rid of those rollbacks?

Our customers are watching server logs very closely and error messages when no crash happened is very confusing. 

Thanks,

Mirek

Comment 10 tom.jenkinson 2013-09-25 14:12:25 UTC

No problem Mirek,

I suggest an enhancement request to make it configurable. If you raise one on Jira I will get right on it: https://issues.jboss.org/browse/JBTM

Thanks,
Tom

Comment 11 Miroslav Novak 2013-09-25 14:45:11 UTC

Thanks Tom and sorry. For EAP 6 I have to track this issue on bz. Do you think you can get the enhancement to EAP 6.2.0.ER5?

Comment 12 tom.jenkinson 2013-09-25 14:53:24 UTC

When is component freeze for EAP 6.2.0.ER5?

Comment 13 tom.jenkinson 2013-09-25 14:59:43 UTC

I created an enhancement request over on Jira for you for the JBTM part. There would need to be a WFLY issue to expose the new config via the AS for it to be formally accepted as supportable (and QE would need to test it)

Comment 14 tom.jenkinson 2013-09-25 15:34:41 UTC

So, like I mentioned above. I have raised a PR on JBTM with a change to allow the value to be configurable, however exposing this to via standalone.xml config would require a change to the app server.

That said, with the PR merged, the config can still be enabled in an unsupported manner via a system property:
com.arjuna.ats.jta.JTAEnvironmentBean.orphanSafetyInterval (default is 10000)

Comment 16 tom.jenkinson 2013-09-27 14:12:26 UTC

The component upgrade linked to provides the ability to set the value you need via a system property, it does not expose the value to the CLI

Comment 18 tom.jenkinson 2013-10-01 11:15:56 UTC

https://github.com/jbossas/jboss-eap/pull/478

Comment 19 JBoss JIRA Server 2013-10-01 16:07:44 UTC

Michael Musgrove <mmusgrov> updated the status of jira JBTM-1943 to Closed

Comment 23 Ondrej Chaloupka 2013-10-29 10:07:03 UTC

Hi Tom,

I help to Mirek with verification of this issue.
The correct name of attribute that is needed to be set is
com.arjuna.ats.jta.orphanSafetyInterval (default is 10000)

Mirek ran his test with several values of the attribute. When we set the attribute to be 20s (20000) we still get the XAExceptions time to time. When we redefine the time being 40s (40000) the exception disappeared from the log.

I think that I can verify this issue just don't you think that it would be better to set the default interval time a bit bigger?
For not to get the error as default behavior. But I'm not sure how would it influence the whole recovery process. It's true that the problem occurs on faster machines under the load of receiving a bigger amount of messages.

Ondra

Comment 24 Ondrej Chaloupka 2013-10-29 10:21:47 UTC

Hi Tom,

a one more request. I've added a doc text to this bug but I don't think that my text is well written. Could you please check it and polish it?

Thank you
Ondra

Comment 25 tom.jenkinson 2013-10-29 11:19:15 UTC

Hi Ondra,

The text was broadly correct to me, I reworded the initial few sentences:
"When a machine is under heavy load it may happen that the server log contains XAExceptions caused by the Recovery Manager attempting to rollback a resource that has already committed.  The transaction is in fact not rolled back at all, the commit is done but meanwhile the Periodic Recovery process comes into play and potentially sees now stale resources"

If you want the default time changing that is easy, but we should raise a different BZ I think. 

Tom

Comment 26 Ondrej Chaloupka 2013-10-29 14:33:10 UTC

Hi Tom,

thank you. I agree with raising the different BZ for that. Just I will need to discussed it with Miroslav when he will come back to work after holidays. I think that it won't be possible for 6.2.0 either.

Thanks
Ondra

Comment 27 Russell Dickenson 2013-11-26 04:33:15 UTC

I have drafted a release notes entry. Please review what I have written and provide feedback.

Comment 28 Ondrej Chaloupka 2013-11-26 13:35:46 UTC

I've slightly changed the doc text to say that the increasing timeout does not solve the problem in general but just the probability to see the warnings is lower.

Comment 29 Russell Dickenson 2013-11-26 21:37:17 UTC

Minor grammatical editing to the Release Notes text.

Comment 30 tom.jenkinson 2013-11-28 08:09:08 UTC

Its worth pointing out in the release notes that there is _no_ implication on data integrity. The transaction has already been committed prior to the attempt to rollback the resource manager.

Comment 32 JBoss JIRA Server 2016-03-02 15:03:30 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 33 JBoss JIRA Server 2016-05-18 07:30:20 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 34 JBoss JIRA Server 2016-05-18 07:30:33 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 35 JBoss JIRA Server 2016-05-18 07:31:03 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 36 JBoss JIRA Server 2016-05-18 07:31:12 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Comment 37 JBoss JIRA Server 2016-06-01 20:05:03 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened

Comment 38 JBoss JIRA Server 2016-06-01 20:06:44 UTC

Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed

Note You need to log in before you can comment on or make changes to this bug.