Bug 952746 - [GSS](6.4.z) Fix transaction recovery failures involving remote EJB resource
Summary: [GSS](6.4.z) Fix transaction recovery failures involving remote EJB resource
Keywords:
Status: CLOSED EOL
Alias: None
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: EJB
Version: 6.1.0
Hardware: All
OS: All
unspecified
high
Target Milestone: ER7
: EAP 6.4.0
Assignee: jboss-set
QA Contact: Ondrej Chaloupka
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-04-16 15:25 UTC by Jaikiran Pai
Modified: 2019-08-19 12:47 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
In this release of JBoss EAP 6, transaction recovery operations can fail if they involve remote EJB resources that may have crashed. The issue presents because when a connection breaks down between the server and the client (specifically when the client crashes and is restarted); the server and the client will not automatically communicate with each other. In these scenarios, the server will have no knowledge that the client has started again, effectively meaning that the EJB tx recovery process will not know which EJB nodes to communicate with. This issue is under investigation and a solution is being developed.
Clone Of:
Environment:
Last Closed: 2019-08-19 12:47:25 UTC
Type: Bug


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
JBoss Issue Tracker AS7-6029 Major Closed Recovery not fully triggered when distributed transaction falls down at prepare phase of 2PC 2019-03-08 14:59:21 UTC
JBoss Issue Tracker JBEAP-3314 Critical Verified Fix transaction recovery failures involving remote EJB resource 2019-03-08 14:59:21 UTC

Description Jaikiran Pai 2013-04-16 15:25:31 UTC
The QA team has certain testcases (in an internal git repo) which exposed non-functioning of remote EJB transaction recovery functionality:

https://issues.jboss.org/browse/AS7-6029
https://issues.jboss.org/browse/AS7-6030

Investigating these failures led to enhancements in EJBCLIENT project and bug fix in Narayana project. With these fixes/patches, the tests are now passing. The Narayana project is going to be released tomorrow (Tom is waiting for one other fix to be completed before doing the release, which he expects to be done by tomorrow). Once that bug fix is released tomorrow, I'll be sending a PR to EAP repo to bring in the new versions for Narayana and EJB client projects and do necessary upgrades to integrate that new version.

Comment 1 JBoss JIRA Server 2013-04-19 11:23:35 UTC
jaikiran pai <jpai@redhat.com> made a comment on jira AS7-6029

Pull request sent

Comment 3 Jaikiran Pai 2013-05-06 09:46:15 UTC
FYI - I believe this needs to be tested against ER7 instead of ER6 since there was a change that was required in EJB client project (as well as server side from what I remember) to fix one of the recovery tests.

Comment 4 Ondrej Chaloupka 2013-05-13 16:11:18 UTC
Hi,

I've retested the issue on ER8 and the test is still failing. The problematic test is commitHaltRevServer where commit after recovery is supposed and rollback is provided.

I'm using your test fix.

I was doing several changes to be sure but I'm still getting the same fail result. 

Reproducing should be possible to do in the way similar to:
git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-transactions.git
wget http://download.devel.redhat.com/devel/candidates/JBEAP/JBEAP-6.1.0-ER8/jboss-eap-6.1.0.ER8.zip
unzip jboss-eap-6.1.0.ER8.zip
export JBOSS_HOME=$PWD/jboss-eap-6.1
cd eap-tests-transactions/integration/jbossts
mvn clean verify -Djboss.dist=$JBOSS_HOME -Dtest=TxPropagationCrashRecoveryTestCase#commitHaltRevServer -Djbossts.noJTS 

Would you be so kind and check this?

Comment 5 Jaikiran Pai 2013-05-14 04:13:20 UTC
Ondra, have you applied the patch(es) to the test setup that I sent across the other day?

Comment 6 Ondrej Chaloupka 2013-05-14 08:14:49 UTC
Yeap, I've applied the patch. There is added callDoNothing call.
You can check it in the git repo:
http://git.app.eng.bos.redhat.com/?p=jbossqe/eap-tests-transactions.git;a=blob;f=integration/jbossts/src/test/java/org/jboss/as/test/jbossts/crashrec/txpropagation/TxPropagationCrashRecoveryTestBase.java;h=0ba2e8392b9d54df9e24742c2ebaab7d79f14f82;hb=911910ff12a9f30a5fdef562427dda447f2b6886#l458

There is a warn message which seems to be related to the issue. 
14:52:03,960 WARN  [com.arjuna.ats.jta] (EJB default - 4) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff7f000001:-72f4ed54:5190e173:11, node_name=1, branch_uid=0:ffff7f000001:27821ad6:5190e17a:1a, subordinatenodename=2, eis_name=java:/JmsXA >

And the "client" server is still showing the exception message.
14:50:43,137 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016036: commit on < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff7f000001:-72f4ed54:5190e173:11, node_name=1, branch_uid=0:ffff7f000001:-72f4ed54:5190e173:1e, subordinatenodename=null, eis_name=unknown eis name > (RecoveryOnlySerializedEJBXAResource{ejbReceiverNodeName='jbossts2'}) failed with exception $XAException.XA_RETRY: javax.transaction.xa.XAException
 at_org.jboss.ejb.client.RecoveryOnlySerializedEJBXAResource.commit(RecoveryOnlySerializedEJBXAResource.java:51)
 at_com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelCommit(XAResourceRecord.java:451) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2732) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2648) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.arjuna.coordinator.BasicAction.phase2Commit(BasicAction.java:1813) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.arjuna.recovery.RecoverAtomicAction.replayPhase2(RecoverAtomicAction.java:71) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.doRecoverTransaction(AtomicActionRecoveryModule.java:152) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.processTransactionsStatus(AtomicActionRecoveryModule.java:251) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.AtomicActionRecoveryModule.periodicWorkSecondPass(AtomicActionRecoveryModule.java:109) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doWorkInternal(PeriodicRecovery.java:789) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2]
 at_com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:371) [jbossjts-jacorb-4.17.4.Final-redhat-2.jar:4.17.4.Final-redhat-2] 


But...
Now I tried to put the callDoNothing immediately after the server reboot (line 481) the recovery started to work. But there is sti.l the warning message on the "client" server.
Just I wonder whether this is how it should work. Without immediate remote call the recovery fails...?

Comment 7 Ondrej Chaloupka 2013-05-14 08:36:08 UTC
Reaction from Jaikiran (see jira WFLY-88):
That is intentional. Notice the XA_RETRY. The EJB client resource recovery throws this XA_RETRY exception if it doesn't yet have any connected servers to communicate to. This lets the Recovery Manager service know that the recovery of this XAResource has to be retried. Once the client->server communication is established (for example, via the callDoNothing()) a subsequent recovery attempt will pass.

Then I assume that this issue is verified.

Comment 8 Jaikiran Pai 2013-05-14 08:51:23 UTC
>> Then I assume that this issue is verified.

Ondra, give me a few more minutes. Although, that WARN is fine, what you currently had in the testcase should have worked. You shouldn't have had to move that callDoNothing() call from where it was earlier. I am taking a look as to why it was failing in that scenario.

Comment 9 Jaikiran Pai 2013-05-14 12:44:28 UTC
The test is failing after a fix we did for the Xid decoding issue https://github.com/jbossas/jboss-ejb-client/commit/6401f45e45a36a4c9e19a755d0da281d15107ba1. The fix looks right to me so I'll have to dig into why this fails the way it does with HornetQ XA resource handling no longer identifying its XAResource. I think I know why moving the callDoNothing "solves" this, but that's a workaround. 

So to summarize, we have a specific case where the recovery can potentially fail. I'll update this bugzilla once I understand what's going on.

Comment 10 Ondrej Chaloupka 2013-05-15 14:29:15 UTC
I'm moving this to MODIFIED because Jaikiran will be checking it.

From QA point of view this is not a blocker for the release. The functionality for crash recovery over ejb remoting was added and there is just one specific case which fails.
It's supposed that customers will be using (as it was so far) JTS which works fine.

Comment 11 Ondrej Chaloupka 2013-08-16 11:47:10 UTC
Hi Jaikiran,

please, could you give me info what is status of this bug? Is it already fixed in 6.1.1? 
In case that this issue could be verified, please, change the status of the bz to ON_QA for me being able to verified it.

Thank you
Ondra

Comment 14 Rostislav Svoboda 2013-08-28 16:39:09 UTC
Jaikiran, do you have results ?

Comment 15 Jaikiran Pai 2013-08-28 16:56:30 UTC
Hi Rostislav,

I do have the results and one specific test is failing out of a bunch of tests. I see what's going on but I don't yet know why. Progress has been a bit slow because the tests are long running to reproduce, investigate and retry again. So far, I haven't figured out the fix or an area/project to fix this. So I don't really have any real update yet, but am looking into it.

Comment 16 Scott Mumford 2013-08-29 03:28:47 UTC
Marking for exclusion from the 6.1.1 Release Notes document as an entry for this bug could not be completed or verified in time.

Comment 17 Rostislav Svoboda 2013-08-29 13:46:11 UTC
Jaikiran, thank you for update.

Comment 20 Dimitris Andreadis 2013-10-24 18:28:43 UTC
Assigning jpai@redhat.com EJB issues to david.lloyd@redhat.com. Please re-assign to Cheng or others as needed.

Comment 21 David M. Lloyd 2013-10-25 13:27:27 UTC
Per agreement with Ondrej, I'm marking this as "not a blocker".

Comment 22 Ondrej Chaloupka 2013-10-25 13:36:14 UTC
Agreed that this is not a blocker. Customers will use JTS transactions for distributed cases as it's supported way.

The JTA distributed transactions are not ready till this will be validated and fixed.

Comment 23 mark yarborough 2013-10-25 13:48:10 UTC
Triage: QE and Dev agree not a blocker for 6.2.

Comment 24 Ondrej Chaloupka 2013-11-01 13:08:59 UTC
Hi David,

I've checked the current state of the issue (as it's longer time that I've been checking it) and I can say that there is still the problem in the waking up the ejb remote connection when the remote server (remote server which is called from client server - via outbound connection from client server) crashes and then comes up again. Then the client sever (it started the tx) does not know nothing about the remote server is up and that the recovery can be done.

This happen just for the distributed JTA transactions. The JTS transactions manage the distributed communication between nodes and the recovery starts without problem.

The workaround for the recovery is to call a remote method from the client server to the remote server after the remote server comes back to life. Then the crash recovery will start.

The test scenario when this problem occurs look:
 - transaction is started on the client server 
 - the client server does call via outbound connection to the remote server (tx context is propagated to remote server)
 - the remote server sends a message to a queue (simulation of some action done during the transaction)
 - finishing the remote call and the bean method
 - the transaction started 2PC. The prepare phase is done and the commit phase is started. The remote server crashes at the entry to the commit method
 - client server is still alive
 - remote server comes to life
 - the crash recovery should proceed the commit as all the participant agreed on it

I would put here the explanation from Jaikiran:
When a connection breaks down between the server and the client, specifically when the client goes down and comes back up again, then the server and the client will not auto communicate with each other. 
In other words, the server will have no knowledge (in EJB resource sense) that the client has come back up again. That effectively means that the EJB tx recovery process will have no clue of the EJB nodes to communicate with.
To deal with that, there should be some communication from the client (which is now up) to the server to reestablish that connection. 
In a real application, it would be the first invocation from the client to the server. 


I've checked that the call from the client server to remote one really establishes the connection and recovery starts.
B the next call from the client to server could take some time and meanwhile the transaction could be rollbacked because of the timeout.

What do you think about this?
I think that current behavior is not correct. We agreed on it with Jaikiran before as well but he haven't got a time to fix it (https://bugzilla.redhat.com/show_bug.cgi?id=952746#c15).

Thanks
Ondra

Comment 25 David M. Lloyd 2014-04-10 13:50:59 UTC
Correct me if I'm wrong but the test scenario has two parts:

- The connection to the remote server does not come up during recovery
- The remote server should not need to be contacted, as that server already agreed on an outcome for the txn in question

If that is correct, then I would say, the first part is an EJB client bug (but I may not be able to fix it).  The second part *may* be an Arjuna/Narayana bug but only someone from that team can say for sure - there might be a very good reason to need to talk to the partially-committed server (maybe to report some heuristic outcome if the commit ultimately failed perhaps).

Comment 26 tom.jenkinson 2014-04-11 10:24:19 UTC
Hi David,

Narayana needs to be able to call XAResource::commit on the remote resource (i.e. server). Although it returned XA_OK out of prepare, we still need to tell it to commit.

As I understand it from the test case, Narayana has called XAR::commit but the call fails so we try it again as we don't know what happened.

i.e (where resource2 is the remote server):

1. TransactionManager->resource1::prepare()
2. TransactionManager->resource2::prepare()
3. TransactionManager->resource1::commit()
4. TransactionManager->resource2::commit() -> resource2 throws XAException(XA_RETRY)
5. TransactionManager->client(allOK) - basically because the intention is to commit so we tell them it committed
6. time elapses
7. RecoverManager->resource2::commit() - if OK then good, else log the error

Let me know if I misunderstood when the crash happens.

Tom

Comment 28 tom.jenkinson 2014-09-02 10:11:22 UTC
Hi,

Is there any update on this issue? It does not appear to be a transaction manager issue so I removed the TM component. It sounds like quite a large issue in the automated transaction recovery protocol for EJB remoting though so if it is still an issue on EAP 6.4 I would increase the severity (if we do support that transport for distributed transactions).

Tom

Comment 29 tom.jenkinson 2014-09-02 10:11:22 UTC
Hi,

Is there any update on this issue? It does not appear to be a transaction manager issue so I removed the TM component. It sounds like quite a large issue in the automated transaction recovery protocol for EJB remoting though so if it is still an issue on EAP 6.4 I would increase the severity (if we do support that transport for distributed transactions).

Tom

Comment 32 dstephan 2015-12-08 23:13:38 UTC
Hi,

This still seems to be an issue on EAP 6.4. Still getting XA_RETRY when RecoveryOnlySerializedEJBXAResource.commit is called on periodic recovery after all server instances are back up.

Dave


Note You need to log in before you can comment on or make changes to this bug.