Bug 1009981
Summary: | Periodic recovery on committed transactions | ||||||
---|---|---|---|---|---|---|---|
Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Miroslav Novak <mnovak> | ||||
Component: | Transaction Manager | Assignee: | tom.jenkinson | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ondrej Chaloupka <ochaloup> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 6.2.0 | CC: | stephan.vollmer | ||||
Target Milestone: | ER6 | ||||||
Target Release: | EAP 6.2.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Known Issue | |||||
Doc Text: |
When the server is under an excessive load, the server's log may contain messages like those below, indicating that a transaction rollback had occurred.
----
17:20:00,441 ERROR [stderr] (Periodic Recovery) javax.transaction.xa.XAException
17:20:00,441 ERROR [stderr] (Periodic Recovery) at org.hornetq.core.client.impl.ClientSessionImpl.rollback(ClientSessionImpl.java:1692)
17:20:00,441 ERROR [stderr] (Periodic Recovery) at org.hornetq.core.client.impl.DelegatingSession.rollback(DelegatingSession.java:494)
17:20:00,441 ERROR [stderr] (Periodic Recovery) at org.hornetq.jms.server.recovery.HornetQXAResourceWrapper.rollback(HornetQXAResourceWrapper.java:126)
17:20:00,441 ERROR [stderr] (Periodic Recovery) at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.handleOrphan(XARecoveryModule.java:738)
17:20:00,441 ERROR [stderr] (Periodic Recovery) at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.xaRecoverySecondPass(XARecoveryModule.java:644)
17:20:00,441 ERROR [stderr] (Periodic Recovery) at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.bottomUpRecovery(XARecoveryModule.java:417)
17:20:00,441 ERROR [stderr] (Periodic Recovery) at com.arjuna.ats.internal.jta.recovery.arjunacore.XARecoveryModule.periodicWorkSecondPass(XARecoveryModule.java:194)
17:20:00,441 ERROR [stderr] (Periodic Recovery) at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:789)
17:20:00,441 ERROR [stderr] (Periodic Recovery) at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:371)
----
Under heavy load, the processing time taken by a transaction can overlap with the timing of the periodic recovery process's activity. The periodic recovery process detects the transaction still in progress and attempts to initiate a rollback but in fact the transaction continues to completion. At the time the periodic recovery attempts but fails the rollback, it records the rollback failure in the server log. The underlying cause of this issue will be addressed in a future release, but in the meantime a workaround is available.
Increase the interval between the two phases of the recovery process by setting the property `com.arjuna.ats.jta.orphanSafetyInterval`
to higher than the default of 10000 milliseconds. A value of 40000 milliseconds is recommended.
Please note that this does not solve the issue, instead it decreases the probability that it will occur and that the warning message will be shown in the log.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-12-15 16:19:41 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1012986 | ||||||
Bug Blocks: | 1040521 | ||||||
Attachments: |
|
Description
Miroslav Novak
2013-09-19 15:58:52 UTC
Created attachment 800020 [details]
sources_of_mdb
Note: Search for "mnovak" to find problem tx_uid. was there any crash? There is no crash. It's positive scenario. If you want I can give you access to prepared environment where is possible to run and debug the test. What can happen is this: 0. Transaction::commit begins 1. HornetQ is xa_prepare(xid1)'d 2. Recovery manager calls xa_recover() on HQ which returns xid1 from above 3. Transaction calls xa_commit(xid1) on HornetQ 3. System under load so takes a more than the RecoveryManager safety interval period (10 seconds) for the recovery thread to resume 4. Periodic recovery thread can find no trace of the transaction so it tries to rollback the XID This will not cause data integrity issues but will have spurious rollbacks in the logs Note that there is a field on RecoveryXids that is not configurable that defines how long the interval is between to decide if the Xid is indoubt. It is set to 10 seconds: https://github.com/jbosstm/narayana/blob/4.17/ArjunaJTA/jta/classes/com/arjuna/ats/internal/jta/recovery/arjunacore/RecoveryXids.java#L220 The commit over here shows the scenario: https://github.com/tomjenkinson/narayana/commit/cdf4785c65a31609c83019a448573cede545455d Hi Tom, thanks for your feedback. Do you think there is a way to get rid of those rollbacks? Our customers are watching server logs very closely and error messages when no crash happened is very confusing. Thanks, Mirek No problem Mirek, I suggest an enhancement request to make it configurable. If you raise one on Jira I will get right on it: https://issues.jboss.org/browse/JBTM Thanks, Tom Thanks Tom and sorry. For EAP 6 I have to track this issue on bz. Do you think you can get the enhancement to EAP 6.2.0.ER5? When is component freeze for EAP 6.2.0.ER5? I created an enhancement request over on Jira for you for the JBTM part. There would need to be a WFLY issue to expose the new config via the AS for it to be formally accepted as supportable (and QE would need to test it) So, like I mentioned above. I have raised a PR on JBTM with a change to allow the value to be configurable, however exposing this to via standalone.xml config would require a change to the app server. That said, with the PR merged, the config can still be enabled in an unsupported manner via a system property: com.arjuna.ats.jta.JTAEnvironmentBean.orphanSafetyInterval (default is 10000) The component upgrade linked to provides the ability to set the value you need via a system property, it does not expose the value to the CLI Michael Musgrove <mmusgrov> updated the status of jira JBTM-1943 to Closed Hi Tom, I help to Mirek with verification of this issue. The correct name of attribute that is needed to be set is com.arjuna.ats.jta.orphanSafetyInterval (default is 10000) Mirek ran his test with several values of the attribute. When we set the attribute to be 20s (20000) we still get the XAExceptions time to time. When we redefine the time being 40s (40000) the exception disappeared from the log. I think that I can verify this issue just don't you think that it would be better to set the default interval time a bit bigger? For not to get the error as default behavior. But I'm not sure how would it influence the whole recovery process. It's true that the problem occurs on faster machines under the load of receiving a bigger amount of messages. Ondra Hi Tom, a one more request. I've added a doc text to this bug but I don't think that my text is well written. Could you please check it and polish it? Thank you Ondra Hi Ondra, The text was broadly correct to me, I reworded the initial few sentences: "When a machine is under heavy load it may happen that the server log contains XAExceptions caused by the Recovery Manager attempting to rollback a resource that has already committed. The transaction is in fact not rolled back at all, the commit is done but meanwhile the Periodic Recovery process comes into play and potentially sees now stale resources" If you want the default time changing that is easy, but we should raise a different BZ I think. Tom Hi Tom, thank you. I agree with raising the different BZ for that. Just I will need to discussed it with Miroslav when he will come back to work after holidays. I think that it won't be possible for 6.2.0 either. Thanks Ondra I have drafted a release notes entry. Please review what I have written and provide feedback. I've slightly changed the doc text to say that the increasing timeout does not solve the problem in general but just the probability to see the warnings is lower. Minor grammatical editing to the Release Notes text. Its worth pointing out in the release notes that there is _no_ implication on data integrity. The transaction has already been committed prior to the attempt to rollback the resource manager. Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Reopened Tom Jenkinson <tom.jenkinson> updated the status of jira JBTM-2583 to Closed |