Bug 1077216 - Some "uid" left in tx log after crash recovery. JTS only.
Summary: Some "uid" left in tx log after crash recovery. JTS only.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Transaction Manager
Version: TBD EAP 6
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: EAP 6.4.0
Assignee: Gytis Trikleris
QA Contact: Hayk Hovsepyan
Russell Dickenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-03-17 13:40 UTC by Hayk Hovsepyan
Modified: 2014-08-14 15:28 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2014-08-14 15:28:18 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
JMSCrashRec (3.32 MB, text/x-log)
2014-03-17 13:40 UTC, Hayk Hovsepyan
no flags Details

Description Hayk Hovsepyan 2014-03-17 13:40:44 UTC
Created attachment 875493 [details]
JMSCrashRec

Description of problem:
In some cases, after crash recovery scenarios, there is a remaining "uid" in TX log.
This happens only for JTS. For JTA it passes constantly.

Version-Release number of selected component (if applicable):
EAP 6.2.0, EAP 6.3.0 DR2

How reproducible:
not constantly, not JDK related

Steps to Reproduce:
1. Call StlessSB on server 1 which will send message to it's own queue.
2. In server, add mock test XA resource into transaction before sending message.
3. Crash transaction on server 1 when entering "commit" method of test XA resource.
4. Reboot the server, call recovery for server.
5. Check that message sent to server queue is committed.
6. Check that server does not have any remaining "uid" in tX log. Here is the fail. Server has remaining "uid".

Actual results:
Server has remaining "uid".

Expected results:
Server should not have remaining "uid".

Additional info:

Please find the log file attached.

Here is the project git repo: http://git.app.eng.bos.redhat.com/git/jbossqe/eap-tests-transactions.git it is under 'master'.

For running the scenario locally you need to change directory "..../eap-tests-transactions/integration/jbossts" and run "mvn clean verify -Dtest=JMSMdbCrashRecoveryTestCase#commitHaltRev -Djbossts.hqobjectstore -Djboss.dist=${eap-6.3-home}"

Comment 1 tom.jenkinson 2014-03-17 16:22:16 UTC
Does this only fail on hqstore?

Comment 2 Hayk Hovsepyan 2014-03-17 16:31:59 UTC
It fails for standard store as well.

Comment 3 tom.jenkinson 2014-03-19 15:00:41 UTC
Hi Hayk,

Sorry for the delay. I can explain what is happening.

With JTS we have what is known as top down and bottom up recovery. When a resource calls replay completion on the coordinator the return value tells it whether to commit or not. Simultaneously the coordinator takes the opportunity to complete the entire transaction. Therefore there is a small race between the (threaded) coordinator and the resources recovery manager to complete the resource. 

If the coordinator completes the resource, it will be able to know the outcome and automatically clean up its transaction log.

If the resource completes itself, the coordinator when it tries to gets an receives an warning status so leaves the transaction in the store.

After 3 attempts to commit the transaction and get OBJECT_NOT_EXIST a transaction is assumed to have fully committed its resources.

In the debugger it looks like depending on timing it is easy for this counter to not reach 3 so the entries will still be in the object store. Each time a branch completes the counter is reset and in total you only have 3 recovery scans so by default it should be impossible for (bottom-up completed resources) recovery to remove the entry. It only passes when top-down recovery won the race.

Tom

Comment 4 Hayk Hovsepyan 2014-03-20 16:10:04 UTC
Hi Tom,

Thanks for the detailed description.

So what can be the solution or workaround here not to leave any uid in log?
I tried to call "recovery" 3 times, assuming that after 3 attempts it will consider as fully committed and log will be emptied, but it is still there.

/Hayk

Comment 5 tom.jenkinson 2014-03-20 16:19:14 UTC
Hi Hayk,

_After_ it has recovered the HQ x2 and TestXAResource, if you have three recovery calls it should be fine. You don't need the minute wait between recovery scans if you are calling it yourself I wouldn't think.

Tom

Comment 6 tom.jenkinson 2014-05-07 15:13:07 UTC
Did you try three recovery calls?

Comment 7 Hayk Hovsepyan 2014-05-07 15:20:04 UTC
Yes it calls recovery 3 times, and still the problem exists.

Comment 8 Hayk Hovsepyan 2014-08-14 15:28:18 UTC
The problem was in test framework.
Thanks Gytis for doing research on this.


Note You need to log in before you can comment on or make changes to this bug.