Bug 1001145

Summary: Crash recovery for tx started in MDB does not fully clean the transaction log
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: Ondrej Chaloupka <ochaloup>
Component: Transaction ManagerAssignee: tom.jenkinson
Status: CLOSED NOTABUG QA Contact: Ondrej Chaloupka <ochaloup>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1.1   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-18 14:03:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ondrej Chaloupka 2013-08-26 14:46:30 UTC
Hi Tom,

I think that I found an issue in recovery when it's run for transaction started by MDB.

In case of running "commitHaltRev" test (halting server before commit was done on JMS XA resource) I'm getting an assertion that after the recovery a record is not fully removed from the transaction log.

I was trying to find the reason but I'm not able to get it.
The messages are correctly recovered. What in this stage means that they were commited.

I can find in the log information like:
 FileSystemStore.closeAndUnlock(/home/ochaloup/Transactions/eap-tests-transactions/integration/jbossts/target/jbossas-jbossts/standalone/data/tx-object-store/Shado     wNoFileLockStore/defaultStore/EISNAME/0_ffff7f000001_2d5ac39_521b665e_1e, java.io.FileInputStream@3abe8a64, null)
1207 16:30:29,492 INFO  [org.jboss.as.test.jbossts.common.TestXAResourceRecoveryHelper] (ServerService Thread Pool -- 4) Recovered [< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff7f000001:2d5ac39:521b665e:12, node_name=1, br     anch_uid=0:ffff7f000001:2d5ac39:521b665e:19, subordinatenodename=null, eis_name=unknown eis name >] Xids
...
TransactionReaper::remove ( BasicAction: 0:ffff7f000001:31b4cc0b:521b6681:e status: ActionStatus.COMMITTED )

but when the test finishes the transaction log still contains information about that transaction.

Comment 1 Ondrej Chaloupka 2013-08-26 14:46:53 UTC
The test is created in the way that an MDB is deployed to server and starts to receive message. The JVM is halted before on entry of the commit method for the JMS XA resource.
The deployment with MDB is removed from the stopped server and new deployment without the MDB is deployed to the server. That deployment contains beans to check the status of transactions etc.
It seems that the recovery process is triggered before the deployment is fully deployed. It correctly commits the messages (they were not commited before).
In case you can check the HornetQ journal like this:
cd $JBOSS_HOME/standalone/data
java -cp $JBOSS_HOME/bin/client/jboss-client.jar:$JBOSS_HOME/modules/system/layers/base/org/hornetq/main/* org.hornetq.core.persistence.impl.journal.PrintData messagingbindings messagingjournal > journal.txt

Comment 3 tom.jenkinson 2013-09-02 16:05:37 UTC
Hi Ondra,

This is what is happening.

Phase 1 of your test:
enlist MDB XAR
enlist TestResource
enlist QueueSender XAR (isSameRM returns false for the MDB XAR so it is different)
commit MDB XAR
commit TestResource - this is when the VM fails

Phase 2 of your test
first recovery: can't find any XA records as no helpers are registered yet
second recovery: can't find MDB XAR - because this was committed before, commit indoubt TestResource record, commit indoubt QueueSender XAR (via HornetQXAResourceWrapper)
As the recovery system doesn't know what has happened with the MDB XAR, it is unable to complete recovery as it is waiting for th MDB XAR to pass it an indoubt branch back (which it clearly never will).

This is all normal, you can configure the system so that eventually it would move the transaction to assumed completed. This link describes the problem:
http://docs.jboss.org/jbosstm/5.0.0.M1/guides/failure_recovery_guide/ch02.html#d0e561

Note that setting assumed complete is erroneous in WildFly as the recovery manager can run before all the recovery modules are set. As you can see above, after the first recovery scan we would have assumed that all the resources had completed and cleaned up the transaction.

Hope that explains the problem. The solution for this is to vote for: https://issues.jboss.org/browse/JBTM-860

Tom

Comment 4 Ondrej Chaloupka 2013-09-03 15:26:51 UTC
Hi Tom,

I've checked it and I think that there could be still a problem somewhere.
The recovery tests contains two types of the tests which halt server at commit phase.
 - commitHalt
 - commitHaltRev

The first one halts the jvm after commit was invoked on JPA/JMS XA resource. I think that it is case that you were describing. Ivo mentioned that parameter for JTA (-DJTAEnvironmentBean.xaAssumeRecoveryComplete=true) could help but still it is supposed to be problem for JTS.

The case of the "commitHaltRev" test is different (the case that I was trying to describe). The server is halt before commit was done on JPA/JMS XA resource. So the workflow is like:
Phase 1
enlist TestResource
enlist MDB XAR
enlist QueueSender XAR
halt server at the enter of the TestResource commit method

Phase 2
recovery for both resources and commiting the changes (as they were already prepared) - the changes are correctly committed (at least the destination queue receives the message)

Don't you think that this could be an issue?

Thanks
Ondra

Comment 5 tom.jenkinson 2013-09-04 21:18:34 UTC
Hi Ondra,

The situation I was describing was what I observed running the following command:

mvn clean verify -Dtest=JMSMdbCrashRecoveryTestCase#commitHaltRev -Djboss.dist=$JBOSS_HOME -Djbossts.noJTS -Djpda -Dmaven.surefire.debug="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -Xnoagent -Djava.compiler=NONE"

As you can see, I am executing commitHaltRev.

When I run it in the debugger I can see that it isn't the TestResource that is enlisted first, regardless of what the test intends (it is probably a consequence of JCA).

If you want to check it put a breakpoint on line 248 of JMSMdbCrashRecoveryTestCase (to stop it killing the rest of the test when you have your next breakpoint). Then put a breakpoint in ArjunaJTA/jta/.../arjunacore/TransactionImple on 581 and you will see that the first resource is a JMS one.

As I mentioned initially, I would steer clear of xaAssumeRecoveryComplete as it will break your tests for JTA too as the resource recovery helpers are not registered in time for the first cycle of recovery.

Hope that helps,
Tom

Comment 6 tom.jenkinson 2013-09-12 09:02:56 UTC
Hi Ondra,

Have you had chance to consider my response above yet?

Thanks,
Tom

Comment 7 Ondrej Chaloupka 2013-09-12 10:46:57 UTC
Hi Tom,

I'm sorry that I've not responded yet but I would like to go through it a bit more and better debug it to understand. But I haven't found time for that. 

I see your point and I need to check the enlistment order.

But still when I was debugging the test before I got to the conclusion that the JMS Resource is not committed before the server halted. I was talking about the order of the resources because the byteman rule is bound to the method called on the TestResource. Because of it it's important whether commit is called first on TestResource or on the JMS XA Resource. The server crashes each time (commitHalt, commitHaltRev) at the start of the commit method on the TestResource.

But back to my doubt. I stopped the test at time when the server crashed. I checked the transaction log and the log of the HornetQ. The both logs contained information that the transaction is in prepared state. The transaction (or at least hq log talked about it) was not committed yet. The transaction was committed when server started second time (after the crash recovery run). I could see that in the HQ log - the log contained information about commit. But in contrary the TM log didn't clear the log.

It's possible that I missed something.

Comment 8 Ondrej Chaloupka 2013-09-18 14:03:52 UTC
Hi Tom,

you are absolutely right. The problem was that the test didn't count with MDB Inbound  queue that was enlisted first (the test contains 3 XA resources - In Queue, Outbound queue and Test XA resource). And the XA Inbound queue resource was committed before the server crash. The HornetQ forgot the info about that resource as it was committed and the recovery manager was not able to find it afterwards.

Thank you for your input and sorry for inconveniences
Ondra