Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 952746 - [GSS](6.4.z) Fix transaction recovery failures involving remote EJB resource
[GSS](6.4.z) Fix transaction recovery failures involving remote EJB resource
Status: ASSIGNED
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: EJB (Show other bugs)
6.1.0
All All
unspecified Severity high
: ER7
: EAP 6.4.0
Assigned To: Fedor Gavrilov
Ondrej Chaloupka
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-16 11:25 EDT by Jaikiran Pai
Modified: 2018-03-06 15:36 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
In this release of JBoss EAP 6, transaction recovery operations can fail if they involve remote EJB resources that may have crashed. The issue presents because when a connection breaks down between the server and the client (specifically when the client crashes and is restarted); the server and the client will not automatically communicate with each other. In these scenarios, the server will have no knowledge that the client has started again, effectively meaning that the EJB tx recovery process will not know which EJB nodes to communicate with. This issue is under investigation and a solution is being developed.
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker AS7-6029 Major Closed Recovery not fully triggered when distributed transaction falls down at prepare phase of 2PC 2018-11-05 05:17 EST
JBoss Issue Tracker JBEAP-3314 Critical Verified Fix transaction recovery failures involving remote EJB resource 2018-11-05 05:17 EST

  None (edit)
Description Jaikiran Pai 2013-04-16 11:25:31 EDT
The QA team has certain testcases (in an internal git repo) which exposed non-functioning of remote EJB transaction recovery functionality:

https://issues.jboss.org/browse/AS7-6029
https://issues.jboss.org/browse/AS7-6030

Investigating these failures led to enhancements in EJBCLIENT project and bug fix in Narayana project. With these fixes/patches, the tests are now passing. The Narayana project is going to be released tomorrow (Tom is waiting for one other fix to be completed before doing the release, which he expects to be done by tomorrow). Once that bug fix is released tomorrow, I'll be sending a PR to EAP repo to bring in the new versions for Narayana and EJB client projects and do necessary upgrades to integrate that new version.
Comment 1 JBoss JIRA Server 2013-04-19 07:23:35 EDT
jaikiran pai <jpai@redhat.com> made a comment on jira AS7-6029

Pull request sent
Comment 3 Jaikiran Pai 2013-05-06 05:46:15 EDT
FYI - I believe this needs to be tested against ER7 instead of ER6 since there was a change that was required in EJB client project (as well as server side from what I remember) to fix one of the recovery tests.
Comment 4 Ondrej Chaloupka 2013-05-13 12:11:18 EDT
Hi,

I've retested the issue on ER8 and the test is still failing. The problematic test is commitHaltRevServer where commit after recovery is supposed and rollback is provided.

I'm using your test fix.

I was doing several changes to be sure but I'm still getting the same fail result. 

Reproducing should be possible to do in the way similar to:
git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-transactions.git
wget http://download.devel.redhat.com/devel/candidates/JBEAP/JBEAP-6.1.0-ER8/jboss-eap-6.1.0.ER8.zip
unzip jboss-eap-6.1.0.ER8.zip
export JBOSS_HOME=$PWD/jboss-eap-6.1
cd eap-tests-transactions/integration/jbossts
mvn clean verify -Djboss.dist=$JBOSS_HOME -Dtest=TxPropagationCrashRecoveryTestCase#commitHaltRevServer -Djbossts.noJTS 

Would you be so kind and check this?
Comment 5 Jaikiran Pai 2013-05-14 00:13:20 EDT
Ondra, have you applied the patch(es) to the test setup that I sent across the other day?
Comment 6 Ondrej Chaloupka 2013-05-14 04:14:49 EDT
Yeap, I've applied the patch. There is added callDoNothing call.
You can check it in the git repo:
http://git.app.eng.bos.redhat.com/?p=jbossqe/eap-tests-transactions.git;a=blob;f=integration/jbossts/src/test/java/org/jboss/as/test/jbossts/crashrec/txpropagation/TxPropagationCrashRecoveryTestBase.java;h=0ba2e8392b9d54df9e24742c2ebaab7d79f14f82;hb=911910ff12a9f30a5fdef562427dda447f2b6886#l458

There is a warn message which seems to be related to the issue. 
14:52:03,960 WARN  [com.arjuna.ats.jta] (EJB default - 4) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff7f000001:-72f4ed54:5190e173:11, node_name=1, branch_uid=0:ffff7f000001:27821ad6:5190e17a:1a, subordinatenodename=2, eis_name=java:/JmsXA >

And the "client" server is still showing the exception message.
14:50:43,137 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016036: commit on < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff7f000001:-72f4ed54:5190e173:11, node_name=1, branch_uid=0:ffff7f000001:-72f4ed54:5190e173:1e, subordinatenodename=null, eis_name=unknown eis name > (RecoveryOnlySerializedEJBXAResource{ejbReceiverNodeName='jbossts2'}) failed with exception $XAException.XA_RETRY: javax.transaction.xa.XAException
 at_org.jboss.ejb.client.RecoveryOnlySerializedEJBXAResource.commit(RecoveryOnlySerializedEJBXAResource.java:51)
 at_com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelCommit(XAResourceRecord.java:451) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2732) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2648) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.arjuna.coordinator.BasicAction.phase2Commit(BasicAction.java:1813) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.arjuna.recovery.RecoverAtomicAction.replayPhase2(RecoverAtomicAction.java:71) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.doRecoverTransaction(AtomicActionRecoveryModule.java:152) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.processTransactionsStatus(AtomicActionRecoveryModule.java:251) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.periodicWorkSecondPass(AtomicActionRecoveryModule.java:109) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:789) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:371) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2] 


But...
Now I tried to put the callDoNothing immediately after the server reboot (line 481) the recovery started to work. But there is sti.l the warning message on the "client" server.
Just I wonder whether this is how it should work. Without immediate remote call the recovery fails...?
Comment 7 Ondrej Chaloupka 2013-05-14 04:36:08 EDT
Reaction from Jaikiran (see jira WFLY-88):
That is intentional. Notice the XA_RETRY. The EJB client resource recovery throws this XA_RETRY exception if it doesn't yet have any connected servers to communicate to. This lets the Recovery Manager service know that the recovery of this XAResource has to be retried. Once the client->server communication is established (for example, via the callDoNothing()) a subsequent recovery attempt will pass.

Then I assume that this issue is verified.
Comment 8 Jaikiran Pai 2013-05-14 04:51:23 EDT
>> Then I assume that this issue is verified.

Ondra, give me a few more minutes. Although, that WARN is fine, what you currently had in the testcase should have worked. You shouldn't have had to move that callDoNothing() call from where it was earlier. I am taking a look as to why it was failing in that scenario.
Comment 9 Jaikiran Pai 2013-05-14 08:44:28 EDT
The test is failing after a fix we did for the Xid decoding issue https://github.com/jbossas/jboss-ejb-client/commit/6401f45e45a36a4c9e19a755d0da281d15107ba1. The fix looks right to me so I'll have to dig into why this fails the way it does with HornetQ XA resource handling no longer identifying its XAResource. I think I know why moving the callDoNothing "solves" this, but that's a workaround. 

So to summarize, we have a specific case where the recovery can potentially fail. I'll update this bugzilla once I understand what's going on.
Comment 10 Ondrej Chaloupka 2013-05-15 10:29:15 EDT
I'm moving this to MODIFIED because Jaikiran will be checking it.

From QA point of view this is not a blocker for the release. The functionality for crash recovery over ejb remoting was added and there is just one specific case which fails.
It's supposed that customers will be using (as it was so far) JTS which works fine.
Comment 11 Ondrej Chaloupka 2013-08-16 07:47:10 EDT
Hi Jaikiran,

please, could you give me info what is status of this bug? Is it already fixed in 6.1.1? 
In case that this issue could be verified, please, change the status of the bz to ON_QA for me being able to verified it.

Thank you
Ondra
Comment 14 Rostislav Svoboda 2013-08-28 12:39:09 EDT
Jaikiran, do you have results ?
Comment 15 Jaikiran Pai 2013-08-28 12:56:30 EDT
Hi Rostislav,

I do have the results and one specific test is failing out of a bunch of tests. I see what's going on but I don't yet know why. Progress has been a bit slow because the tests are long running to reproduce, investigate and retry again. So far, I haven't figured out the fix or an area/project to fix this. So I don't really have any real update yet, but am looking into it.
Comment 16 Scott Mumford 2013-08-28 23:28:47 EDT
Marking for exclusion from the 6.1.1 Release Notes document as an entry for this bug could not be completed or verified in time.
Comment 17 Rostislav Svoboda 2013-08-29 09:46:11 EDT
Jaikiran, thank you for update.
Comment 20 Dimitris Andreadis 2013-10-24 14:28:43 EDT
Assigning jpai@redhat.com EJB issues to david.lloyd@redhat.com. Please re-assign to Cheng or others as needed.
Comment 21 David M. Lloyd 2013-10-25 09:27:27 EDT
Per agreement with Ondrej, I'm marking this as "not a blocker".
Comment 22 Ondrej Chaloupka 2013-10-25 09:36:14 EDT
Agreed that this is not a blocker. Customers will use JTS transactions for distributed cases as it's supported way.

The JTA distributed transactions are not ready till this will be validated and fixed.
Comment 23 mark yarborough 2013-10-25 09:48:10 EDT
Triage: QE and Dev agree not a blocker for 6.2.
Comment 24 Ondrej Chaloupka 2013-11-01 09:08:59 EDT
Hi David,

I've checked the current state of the issue (as it's longer time that I've been checking it) and I can say that there is still the problem in the waking up the ejb remote connection when the remote server (remote server which is called from client server - via outbound connection from client server) crashes and then comes up again. Then the client sever (it started the tx) does not know nothing about the remote server is up and that the recovery can be done.

This happen just for the distributed JTA transactions. The JTS transactions manage the distributed communication between nodes and the recovery starts without problem.

The workaround for the recovery is to call a remote method from the client server to the remote server after the remote server comes back to life. Then the crash recovery will start.

The test scenario when this problem occurs look:
 - transaction is started on the client server 
 - the client server does call via outbound connection to the remote server (tx context is propagated to remote server)
 - the remote server sends a message to a queue (simulation of some action done during the transaction)
 - finishing the remote call and the bean method
 - the transaction started 2PC. The prepare phase is done and the commit phase is started. The remote server crashes at the entry to the commit method
 - client server is still alive
 - remote server comes to life
 - the crash recovery should proceed the commit as all the participant agreed on it

I would put here the explanation from Jaikiran:
When a connection breaks down between the server and the client, specifically when the client goes down and comes back up again, then the server and the client will not auto communicate with each other. 
In other words, the server will have no knowledge (in EJB resource sense) that the client has come back up again. That effectively means that the EJB tx recovery process will have no clue of the EJB nodes to communicate with.
To deal with that, there should be some communication from the client (which is now up) to the server to reestablish that connection. 
In a real application, it would be the first invocation from the client to the server. 


I've checked that the call from the client server to remote one really establishes the connection and recovery starts.
B the next call from the client to server could take some time and meanwhile the transaction could be rollbacked because of the timeout.

What do you think about this?
I think that current behavior is not correct. We agreed on it with Jaikiran before as well but he haven't got a time to fix it (https://bugzilla.redhat.com/show_bug.cgi?id=952746#c15).

Thanks
Ondra
Comment 25 David M. Lloyd 2014-04-10 09:50:59 EDT
Correct me if I'm wrong but the test scenario has two parts:

- The connection to the remote server does not come up during recovery
- The remote server should not need to be contacted, as that server already agreed on an outcome for the txn in question

If that is correct, then I would say, the first part is an EJB client bug (but I may not be able to fix it).  The second part *may* be an Arjuna/Narayana bug but only someone from that team can say for sure - there might be a very good reason to need to talk to the partially-committed server (maybe to report some heuristic outcome if the commit ultimately failed perhaps).
Comment 26 tom.jenkinson 2014-04-11 06:24:19 EDT
Hi David,

Narayana needs to be able to call XAResource::commit on the remote resource (i.e. server). Although it returned XA_OK out of prepare, we still need to tell it to commit.

As I understand it from the test case, Narayana has called XAR::commit but the call fails so we try it again as we don't know what happened.

i.e (where resource2 is the remote server):

1. TransactionManager->resource1::prepare()
2. TransactionManager->resource2::prepare()
3. TransactionManager->resource1::commit()
4. TransactionManager->resource2::commit() -> resource2 throws XAException(XA_RETRY)
5. TransactionManager->client(allOK) - basically because the intention is to commit so we tell them it committed
6. time elapses
7. RecoverManager->resource2::commit() - if OK then good, else log the error

Let me know if I misunderstood when the crash happens.

Tom
Comment 28 tom.jenkinson 2014-09-02 06:11:22 EDT
Hi,

Is there any update on this issue? It does not appear to be a transaction manager issue so I removed the TM component. It sounds like quite a large issue in the automated transaction recovery protocol for EJB remoting though so if it is still an issue on EAP 6.4 I would increase the severity (if we do support that transport for distributed transactions).

Tom
Comment 29 tom.jenkinson 2014-09-02 06:11:22 EDT
Hi,

Is there any update on this issue? It does not appear to be a transaction manager issue so I removed the TM component. It sounds like quite a large issue in the automated transaction recovery protocol for EJB remoting though so if it is still an issue on EAP 6.4 I would increase the severity (if we do support that transport for distributed transactions).

Tom
Comment 32 dstephan 2015-12-08 18:13:38 EST
Hi,

This still seems to be an issue on EAP 6.4. Still getting XA_RETRY when RecoveryOnlySerializedEJBXAResource.commit is called on periodic recovery after all server instances are back up.

Dave

Note You need to log in before you can comment on or make changes to this bug.