Bug 1077216

Summary:

Some "uid" left in tx log after crash recovery. JTS only.

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Hayk Hovsepyan <hhovsepy>

Component:

Transaction Manager

Assignee:

Gytis Trikleris <gtrikler>

Status:

CLOSED NOTABUG

QA Contact:

Hayk Hovsepyan <hhovsepy>

Severity:

low

Docs Contact:

Russell Dickenson <rdickens>

Priority:

unspecified

Version:

TBD EAP 6

CC:

ochaloup

Target Milestone:

---

Target Release:

EAP 6.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-08-14 15:28:18 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
JMSCrashRec	none

Description Hayk Hovsepyan 2014-03-17 13:40:44 UTC

Created attachment 875493 [details]
JMSCrashRec

Description of problem:
In some cases, after crash recovery scenarios, there is a remaining "uid" in TX log.
This happens only for JTS. For JTA it passes constantly.

Version-Release number of selected component (if applicable):
EAP 6.2.0, EAP 6.3.0 DR2

How reproducible:
not constantly, not JDK related

Steps to Reproduce:
1. Call StlessSB on server 1 which will send message to it's own queue.
2. In server, add mock test XA resource into transaction before sending message.
3. Crash transaction on server 1 when entering "commit" method of test XA resource.
4. Reboot the server, call recovery for server.
5. Check that message sent to server queue is committed.
6. Check that server does not have any remaining "uid" in tX log. Here is the fail. Server has remaining "uid".

Actual results:
Server has remaining "uid".

Expected results:
Server should not have remaining "uid".

Additional info:

Please find the log file attached.

Here is the project git repo: http://git.app.eng.bos.redhat.com/git/jbossqe/eap-tests-transactions.git it is under 'master'.

For running the scenario locally you need to change directory "..../eap-tests-transactions/integration/jbossts" and run "mvn clean verify -Dtest=JMSMdbCrashRecoveryTestCase#commitHaltRev -Djbossts.hqobjectstore -Djboss.dist=${eap-6.3-home}"

Comment 1 tom.jenkinson 2014-03-17 16:22:16 UTC

Does this only fail on hqstore?

Comment 2 Hayk Hovsepyan 2014-03-17 16:31:59 UTC

It fails for standard store as well.

Comment 3 tom.jenkinson 2014-03-19 15:00:41 UTC

Hi Hayk,

Sorry for the delay. I can explain what is happening.

With JTS we have what is known as top down and bottom up recovery. When a resource calls replay completion on the coordinator the return value tells it whether to commit or not. Simultaneously the coordinator takes the opportunity to complete the entire transaction. Therefore there is a small race between the (threaded) coordinator and the resources recovery manager to complete the resource. 

If the coordinator completes the resource, it will be able to know the outcome and automatically clean up its transaction log.

If the resource completes itself, the coordinator when it tries to gets an receives an warning status so leaves the transaction in the store.

After 3 attempts to commit the transaction and get OBJECT_NOT_EXIST a transaction is assumed to have fully committed its resources.

In the debugger it looks like depending on timing it is easy for this counter to not reach 3 so the entries will still be in the object store. Each time a branch completes the counter is reset and in total you only have 3 recovery scans so by default it should be impossible for (bottom-up completed resources) recovery to remove the entry. It only passes when top-down recovery won the race.

Tom

Comment 4 Hayk Hovsepyan 2014-03-20 16:10:04 UTC

Hi Tom,

Thanks for the detailed description.

So what can be the solution or workaround here not to leave any uid in log?
I tried to call "recovery" 3 times, assuming that after 3 attempts it will consider as fully committed and log will be emptied, but it is still there.

/Hayk

Comment 5 tom.jenkinson 2014-03-20 16:19:14 UTC

Hi Hayk,

_After_ it has recovered the HQ x2 and TestXAResource, if you have three recovery calls it should be fine. You don't need the minute wait between recovery scans if you are calling it yourself I wouldn't think.

Tom

Comment 6 tom.jenkinson 2014-05-07 15:13:07 UTC

Did you try three recovery calls?

Comment 7 Hayk Hovsepyan 2014-05-07 15:20:04 UTC

Yes it calls recovery 3 times, and still the problem exists.

Comment 8 Hayk Hovsepyan 2014-08-14 15:28:18 UTC

The problem was in test framework.
Thanks Gytis for doing research on this.