Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 901132 (JBPAPP6-1183)

Summary:

Transaction manager is not accepting uniqueID on its configuration, causing issues with recovery in clustering

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Miroslav Novak <mnovak>

Component:

Transaction Manager

Assignee:

Clebert Suconic <clebert.suconic>

Status:

CLOSED NOTABUG

QA Contact:

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

6.0.0

CC:

clebert.suconic, fnasser, mnovak, ochaloup, tom.jenkinson

Target Milestone:

---

Target Release:

EAP 6.0.1

Hardware:

Unspecified

OS:

Unspecified

URL:

http://jira.jboss.org/jira/browse/JBPAPP6-1183

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-11-08 15:23:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs.zip	none
reproducer.zip	none
SoakMdbWithRemoteOutQueueToContaniner1.java	none
client-project.zip	none
mdb1.jar	none
mdbToRemoteQueue-maven-project.zip	none
reproducer2.zip	none
server1consoleAfterServer2Killed.txt	none

Description Miroslav Novak 2012-10-30 16:22:14 UTC

project_key: JBPAPP6

Test replication scenario scenario:

1. Start two EAP 6 servers - mdb and jms server. mdb server is configured to connect through remote JCA to jms server.
2. Deploy "InQueue" and "OutQueue" to jms server
3. Send 500 messages to "InQueue"
4. Deploy MDB to mdb server. MDB read messages from InQueue and sends to OutQueue.
5. During processing of messages kill mdb server and restart it.
6. Wait until MDB processes all the messages 
7. Receive messages from "OutQueue"

Result:
All messages are received. There is no lost of duplicated message but there are unfinished arjuna transactions on mdb server which are periodically (2 min) written to log:
{code}
17:04:28,713 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@2cec4462 >
17:04:28,714 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >
17:04:28,773 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@4b7d94f8 >
17:04:28,774 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >
17:06:38,829 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@29f9cb2c >
17:06:38,830 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >
17:06:38,881 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@456a5c7c >
17:06:38,881 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >
17:08:48,928 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@adea215 >
17:08:48,929 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >
17:08:48,988 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@1e631646 >
17:08:48,988 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >
{code}

Messages on jms server are paged to disk. max-size-bytes and page-size-bytes is set to very low values.

Attached reproducer.zip:
1. Download and unzip reproducer.zip
2. Prepare servers - "sh prepare.sh"
3. Start jms server - "sh start-server1.sh localhost"
4. Start producer and send 500 messages to InQueue - "sh start-producer.sh localhost 500"
5. Start mdb server - "sh start-server2.sh <some_other_ip>"
6. Kill (kill -9 ...) mdb server and start again by "sh start-server2.sh <some_other_ip>"
7. Receive all messages from OutQueue and check warnings in mdb servers log

Comment 1 Miroslav Novak 2012-10-30 16:22:40 UTC

Attachment: Added: logs.zip
Attachment: Added: reproducer.zip

Comment 2 Miroslav Novak 2012-10-30 16:29:25 UTC

Link: Added: This issue is related to JBPAPP-10067

Comment 3 Miroslav Novak 2012-10-30 16:31:18 UTC

Attached SoakMdbWithRemoteOutQueueToContaniner1.java

Comment 4 Miroslav Novak 2012-10-30 16:31:18 UTC

Attachment: Added: SoakMdbWithRemoteOutQueueToContaniner1.java

Comment 5 Clebert Suconic 2012-10-30 17:40:20 UTC

Where is the source for producer and consumer?

Comment 6 Clebert Suconic 2012-10-30 17:47:26 UTC

I'm probably going to reject this issue.. the testcase is wrong:


You are looking up the remote ConnectionFactory here:

           final Properties env = new Properties();
            env.put(Context.INITIAL_CONTEXT_FACTORY, "org.jboss.naming.remote.client.InitialContextFactory");
            env.put(Context.PROVIDER_URL, "remote://" + hostname + ":4447");
            ctxRemote = new InitialContext(env);
            queue = (Queue) ctxRemote.lookup("jms/queue/OutQueue");
            ctx = new InitialContext();
            // i want connection factory configured here
            cf = (ConnectionFactory) ctx.lookup("java:/JmsXA");



You should regsiter a RemoteJmsXA... I'm actually surprised how the java:/JmsXA worked.. we should talk before I reject this.

Comment 7 Clebert Suconic 2012-10-30 20:57:59 UTC

[~mnovak] When you add these reproducers, I would need a version where I can build from scratch... so I can debug, understand and eventually find issues....


From the little I could see there is something wrong with the test, but I don't want to de-compile anything to debug around this.

I've tried to debug and fix the test, but it's hard to do with just binaries....  I usually always need compilable versions.



Anyway: we will talk tomorrow.... but so far:

- This is definitely not related to the patch, so please just let the patch proceed.
- This seems a reject based on what I saw

Comment 8 Miroslav Novak 2012-10-31 08:22:23 UTC

Attachment: Added: client-project.zip

Comment 9 Miroslav Novak 2012-10-31 10:14:17 UTC

[~clebertsuconic]
Only remote jndi lookup in mdb is for "jms/queue/OutQueue":

Connection factory "java:/JmsXA" is local from local jndi tree (on mdb server) and is configured to create connections to remote jms server(localhost:5445):
{code}
ctx = new InitialContext();
// i want connection factory configured here
cf = (ConnectionFactory) ctx.lookup("java:/JmsXA");
{code}

Anyway I've tried to clean MDB for you - attached mdbToRemoteQueue-maven-project.zip. Do any further modification as you need.
Also new mdb1.jar attached so you don't need to compile.

If client sources are really needed - client-project.zip

Is there anything else you need?

Comment 10 Miroslav Novak 2012-10-31 10:14:39 UTC

Attachment: Added: mdb1.jar
Attachment: Added: mdbToRemoteQueue-maven-project.zip

Comment 11 Clebert Suconic 2012-10-31 20:52:32 UTC

[~mnovak]  I"m rejecting this issue:


your test is broken: You don't specify the node-id for the Recovery. Each Arjuna instance needs its own node-identifier....
Server1 is doing the periodical discovery finding XIDs that would belong to it and it would rollback it right away... and the second server wouldn't find the XIDs and you would get those pending transactions because of the missed node-identifier configuration.



Now: I tried to fix it by setting node-identifier on jboss:domain:transactions:1.2:

= I first tried to add <node-identifier>something</node-identifier> under core-environment and the XML didn't parse it
= then I tried adding it to the process-id, no more errors but then nothing worked

<! -- this is how the CLI will set it -->
< process-id node-idenetifier="jms"/>


I even tried to guess by setting it on the recovery-environment, by setting node-identifier propery on recovery-environment.


So, it's uncertain how to set this on AS7. It seems there's a bug about that

Tom Ross pointed me There's a known issue, which is actually being discussed now on the jboss-support-eap6 list, thread called "XA recovery in AP 6"



As for this bug this is a reject.. as for how to configure the node-identifier you either have to find the proper way or fix it on arjuna subsystem on AS7.

Comment 12 Clebert Suconic 2012-10-31 20:53:06 UTC

my comment makes it clear why it's a reject...

I could see recovery working properly

Comment 13 Clebert Suconic 2012-10-31 21:03:24 UTC

Apparently (from what Tom Ross told me) the proper way to set it is by adding a property on core-environment


<core-environment node-identifier="jms">

and

<core-environment node-identifier="mdb">


in your examples

Comment 14 Clebert Suconic 2012-10-31 21:11:13 UTC

well.. I couldn't make it work.. you guys will have to talk to arjuna guys

Comment 15 Miroslav Novak 2012-11-01 10:44:16 UTC

You might be right that this is just configuration issue but setting node-identifier on both of the servers did not fix the issue:
{code}
<subsystem xmlns="urn:jboss:domain:transactions:1.2">
            <core-environment node-identifier="10">
                <process-id>
                    <uuid/>
                </process-id>
            </core-environment>
            <recovery-environment socket-binding="txn-recovery-environment" status-socket-binding="txn-status-manager"/>
            <coordinator-environment default-timeout="300"/>
</subsystem>
{code}

Re-opening since problem still exists.

Comment 16 Miroslav Novak 2012-11-01 10:51:48 UTC

This is customer's use case. Increasing priority.

Comment 17 tom.jenkinson 2012-11-01 13:22:25 UTC

Hi Miroslav,

Can you post the standalone.xml for both servers? I am guessing (hoping) they aren't both node-identifier="10" (I tried to check this but the config in reproducer.zip is stale, according to your notes above).

Is there an internal git repo you can post your code to so we can collaborate on it more easily?

Tom

Comment 18 Miroslav Novak 2012-11-01 13:34:16 UTC

Attachment: Added: reproducer2.zip

Comment 19 Miroslav Novak 2012-11-01 13:36:14 UTC

Hi Tom,

I've attached new reproducer2.zip with all modifications. There is updated mdb and set different node-identifier attribute on both of the servers. 

Thanks,

Mirek

Comment 20 tom.jenkinson 2012-11-01 22:29:10 UTC

Hi,

I can see in the debugger that the unique ID is being passed to the transaction manager. Also I can see the following in the output:
22:26:34,531 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:3c3932b9:5092f554:7ac, node_name=20, branch_uid=0:ffff7f000001:3c3932b9:5092f554:7c1, subordinatenodename=null, eis_name=java:/JmsXA >

As you can see node_name=20 so the node_name was passed to the transaction manager and it utilized it.

I am not convinced there is an issue with the transaction manager here. Still investigating though.

Tom

Comment 21 tom.jenkinson 2012-11-01 23:00:26 UTC

Is server2 scanning an XAResource that can contact server1 for Xids is the crux of this. I am just not too sure where to start looking at this, as I don't know how to configure HornetQ.

Also, the reproducer is nice and so is the text accompanying it but I feel it is somewhat incomplete sorry. From what I can see:
1. a producer (none-transactionally) sends messages to inqueue on server 1
2. server 2 is booted up with an mdb that:
starts an XA transaction
dequeues a single message from inqueue on server1
enqueues the same message to outqueue on server1
commits the transaction
3. Where you are falling short is to expand on "Receive all messages from OutQueue and check warnings in mdb servers log" From what you are suggesting I should be able to fire up some other application which can dequeue from OutQueue and from what I think I am understanding it should be possible to dequeue all 500 messages? Please can you revise the description with these steps as appropriate?

Anyway, it sounds like the transaction manager is running on server 2, the resource manager (HQ) itself is running on server 1 (there is an HQ on both servers, but the queues are all on server 1). What I can't confirm is whether the HQ configuration is set up to allow the transaction manager on server 2 to scan the hornetq RA on server 1 for XIDs. The error message implies that there are branches in the transaction log that we can't locate when scanning using XAR::recover on the RA in server 2, in which case I would say some misconfiguration isn't allowing the TM on server 2 to see the indoubt XID from the HQ RA on server1.

Lets put that a different way, the transaction manager on server 2 is scanning the HQ XAResource from the RA deployed on server 2. I suspect that server is meant to somehow aggregate/proxy the call to server 1 but I don't think this can be happening, otherwise the TM would be able to find these branches and call commit/rollback.

Hope that helps,
Tom

Comment 22 tom.jenkinson 2012-11-02 08:32:43 UTC

Hi Clebert,

As much as I don;t like doing "assignee tag", please can you confirm that the users config is set up in a manner that would allow the RA on server2 to return the list of indoubt branches on server1 also. That is the missing information I am unable to answer here as it is HQ specific. Please feel free to assign back if that is not the issue.

Tom

Comment 23 tom.jenkinson 2012-11-02 08:39:30 UTC

Also, I have confirmed that when the servers start they have the right values for XARecoveryNodes. If this wasn't correct (e.g. the default of "*" was used) the scenario that Clebert described where tm1 would roll back the indoubt XIDs there would happen.

To be clear there are two common configuration issues that I have verified are not happening with reproducer2.zip:

1. Both nodes have unique node identifiers (confirmed in the debugger that these are passed to JBTM). If they had the same node id the issue described can happen.
2. Both nodes have the correct value for com.arjuna.ats.jta.xaRecoveryNode (again, confirmed in the debugger). If they had the default value ("*") the issue described can happen.

I therefore suspect a misconfiguration of the RA where the RA in server2 is not proxying the call to XAResource::recover for server1 so TM in server2 can't match up the indoubt XIDs.

Please confirm that the scenario I describe is catered for in your configuration (I would be interested to know how this is configured anyway in case I get something similar in future). If this is not the issue, please do consider assigning it back to me.

Comment 24 Miroslav Novak 2012-11-02 14:11:54 UTC

Link: Added: This issue is related to JBPAPP-10338

Comment 25 Clebert Suconic 2012-11-02 20:03:05 UTC

No, what I'm seeing is that Node1 is rolling back XIDs that belong to Node2.

Node1 has an InVM connection, and Node2 has a remote connection towards Node1.

Node2 will insert XIDs into Node1, and Node1 is rollinback the XIDs that belong to Node2.

Comment 26 Clebert Suconic 2012-11-02 22:44:52 UTC

At least with the proper config the test is doing the rollback at the right node...

it wasn't clear where to place the nodeID config (it wasn't to me and I couldn't find any information from some support guys as well).


At the end of the test I still see Can't find XIDs (can't serialize information.. that kind of warn)... but I see the XIDs properly configured now


[~mnovak] did you retest with the proper configuration?

Comment 27 Misty Stanley-Jones 2012-11-05 04:40:08 UTC

Release Notes Docs Status: Added: Not Required

Comment 28 Miroslav Novak 2012-11-05 13:13:35 UTC

[~clebert.suconic]
I did a review of configuration in reproducer2.zip and could not find any issues in it. This feature is not documented. Could you specify the changes which should be made?

Comment 29 Clebert Suconic 2012-11-05 21:05:00 UTC

[~mnovak] Those are changes required by the Transaction Manager.. AFAIK it well beyond documented.. you should talk to the Transactions team on how to make each node to have its own unique ID when using a cluster.


There's nothing to be done on HornetQ's doc here. WE can add a paragraph maybe referring the TM's doc, but it's not on our scope.

Comment 30 Clebert Suconic 2012-11-05 21:06:02 UTC

[~mnovak] if you're not hitting issues with the proper config, then just close this JIRA. 

If you look at the tests on EAP5 they are also using unitque IDs on the Transaction Manager (I know you guys changed this before).

Comment 31 Clebert Suconic 2012-11-05 21:06:48 UTC

I'm assigning this back to you Miro... we can talk about it tomorrow if you like. but so far I see no issues.

Comment 32 Miroslav Novak 2012-11-06 13:59:18 UTC

[~clebert.suconic]
From the Tom's comment above it seems that unique node identifier for transaction manager was set for both of the servers. There are different node-identifiers set for transaction manager on mdb and jms server in reproducer2.zip.

Comment 33 Miroslav Novak 2012-11-06 17:57:14 UTC

[~clebertsuconic]
You seem to be right that HornetQ recovery module on node 1 rolls back xids which belong to node 2. There is periodic scan for unfinished transactions on node 1 which are roll backed and removed from journal. 
Problem is that when recovery manager on node 2 starts periodic recovery and provide xid which it wants to commit then there is no XAResource which would know about this xid because it was already removed by node 1. Is it correct? 
How can be HornetQ recovery module on node 1 configured not to take xids of node 2 into account during this periodic recovery scan?

Comment 34 Clebert Suconic 2012-11-06 20:24:24 UTC

[~mnovak] That's the job of the Transaction Manager. We deliver all the XIDs, which we don't know how to deserialize... and the TM will then ignore the ones that doesn't belong to that server.

Comment 35 Clebert Suconic 2012-11-06 20:25:02 UTC

I'm not sure why you assigned this back to me.. I don't see any issues now on HornetQ. These issues are on the TM.

Comment 36 Clebert Suconic 2012-11-06 20:28:03 UTC

I have debugged this, and there are no duplicate resources on the transaction manager.. there's only one... 

if the TM is rolling back at the wrong node it's an issue on the transaction manager.

Comment 37 tom.jenkinson 2012-11-06 20:29:38 UTC

Can you clarify (for the bug report) " Receive all messages from OutQueue and check warnings in mdb servers log" are all messages actually received?

Comment 38 Clebert Suconic 2012-11-06 20:31:43 UTC

[~tomjenkinson] there are eventual losses caused by these XID failures. (rollback from the wrong node)

Comment 39 tom.jenkinson 2012-11-06 20:51:51 UTC

I am going to take another look see if I can work out what is going on again. I know the correct TM config appeared to be used but I can't understand why node 1 would rollback XIDs propagated from node 2 as the orphan detection shouldn't touch them.

Comment 40 tom.jenkinson 2012-11-06 21:13:36 UTC

Can someone clarify why we think that server 1 is rolling anything back? Are we seeing it in the logs somewhere?

Comment 41 tom.jenkinson 2012-11-06 21:42:19 UTC

I just put this in the debugger and can see TM1 scanning the XIDs from HQ in server 1 but we are definitely doing the node name detection correctly and not rolling them back. That said I do have some output in the console showing the transaction has timed out, could this be your issue? I have attached the console output post server 2 crash.

Comment 42 tom.jenkinson 2012-11-06 21:42:47 UTC

Attachment: Added: server1consoleAfterServer2Killed.txt

Comment 43 tom.jenkinson 2012-11-06 21:45:54 UTC

To be clear, what I have done is:

1. Start server1 in the debugger
2. run the producer
3. start server2 not in the debugger
4. crash server2
5. monitor what is happening in server1 to make sure it isn't rolling back any of the XIDs HQ is providing to it.

I can say for sure from what I am seeing server1 is not rolling back the server2 XID as orphans, it could be something to do with the transaction timing out but I haven't looked at your code nor do I know how HQ reacts when a transaction times out regarding delivery, I suspect it drops the message which would be fine in my view and would account for some lost messages depending on when you restart the server - don't forget it can take several minutes for recovery to kick in!

Comment 44 Miroslav Novak 2012-11-07 09:39:13 UTC

Thanks Tom for feedback. I also see this on server 1 log:
{code}
...
18:39:09,050 WARN  [org.hornetq.core.transaction.impl.ResourceManagerImpl] (Thread-2 (HornetQ-scheduled-threads-1410328673)) transaction with xid XidImpl (1677246734 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-118.112.-104.-95.80.-103.74.8.0.0.4.-27.0.0.0.0.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-118.112.-104.-95.80.-103.74.8.0.0.4.-28.50.48 base64:AAAAAAAAAAAAAP__CiIDv4pwmKFQmUoIAAAE5QAAAAAAAAAAAAAAAAAAAAAAAP__CiIDv4pwmKFQmUoIAAAE5DIwBwICAA== timed out
...
{code}

I'll try to increase transaction-timeout on jms server to see what happens.

Comment 45 Miroslav Novak 2012-11-07 09:54:41 UTC

It doesn't seem to have effect.

Comment 46 tom.jenkinson 2012-11-07 10:27:35 UTC

I think it is the actual XA transaction that is timing out, the default for AS7 is 300. You can change it in: <coordinator-environment default-timeout="300"/>

That would be changed in server2

Comment 47 Miroslav Novak 2012-11-07 13:22:35 UTC

HornetQ has its own time outs for transactions. I tried to increase transaction-timeout attribute in jms server configuration. This didn't have affect in the way that transaction timeouted in 300s instead of 600s which I set.
I tried to set transactionTimeout activation property in mdb:
{code}
@ActivationConfigProperty(propertyName = "transactionTimeout", propertyValue = "600")
{code} 

This seems to have really effect. Transaction was timouted on server 1 after 10 min but I can still see warnings from arjuna on mdb server even when those transaction were not timeouted on node 1.

Comment 48 Miroslav Novak 2012-11-07 13:24:07 UTC

I'm wondering how hornetq recovery module is registered to recovery manager on node 2.

Comment 49 Clebert Suconic 2012-11-07 13:26:04 UTC

[~mnovak]  The Transaction Timeout that you saw in hornetq was for incomplete transactions. That is... not prepared, not committed..

What happens is you killed server2, the commit will never arrive. So, HornetQ will rollback any transaction that's not been prepared for the timeout of hornet.

The HornetQ's timeout must be greater than the TM's timeout. So: I don't think the timeout should be changed. There shouldn't be any transaction taking more than 5 minutes to be completed. (Everything is taking a couple of milliseconds between start and prepare on this example).

Comment 50 tom.jenkinson 2012-11-07 13:49:14 UTC

I think the timeout could be because the transaction is created at server 2 which is not going to complete the transaction until it recovers which may be over 5 minutes depending upon when the server2 is recovered. I think it is going to be at least 2 minutes as that is the scan interval.

Comment 51 Fernando Nasser 2012-11-07 14:26:50 UTC

Could HQ check if the destination is down when the first timeout expires and if that is the case wait more to give a chance for that node to be restored?

Comment 52 tom.jenkinson 2012-11-07 14:47:17 UTC

To be clear, from my interpretation, the transaction timeout is somewhat unrelated to the timetolive on a message. A message may have a certain time to live value which is less than the transaction timeout.

It would explain supposedly "dropped" messages as HQ will presumably drop the message as soon as the TX commits if the ttl has expired (it may even drop them before, so long as it doesn't drop the corresponding XID we will be fine).

I don't think the JMS spec mandates that the TTL > XA tx timeout, even if is a little confusing/contradictory when it happens

Comment 53 Fernando Nasser 2012-11-07 14:54:16 UTC

Tom, what is your suggested way to avoid (fix) the problem described in this JIRA?

Comment 54 tom.jenkinson 2012-11-07 15:09:27 UTC

Hi Fernando,

Unfortunately I haven't sanity checked the JMS configuration but:
1. It seems to me that the TTL on the JMS messages should be greater that the TX timeout (default 300s)
2. It seems strange that server2 is not getting the corresponding XID out of the RM on server2 it is expecting. From what I am reading I am thinking that if HQ TTL expires messages an XA TX it is unilaterally forgetting the branch (hence the messages about NoXAResource to recover) I would suggest this is a bug in HQ if that is the case.

All I have confirmed is that server1 is not rolling back the indoubt XIDs for server2 and that "Transaction manager is not accepting uniqueID on its configuration, causing issues with recovery in clustering" is not a true assertion. I think the Jira needs renaming.

Tom

Comment 55 Fernando Nasser 2012-11-07 15:13:26 UTC

Clebert can you check HQ handling of the timeout (see above) please?

Comment 56 Fernando Nasser 2012-11-07 15:15:37 UTC

I guess we should wait Clebert's investigation to see what the Subject should say...
Either something like "timeout too soon" or "unilateral dropping of XA TX branch"(?)

Comment 57 Clebert Suconic 2012-11-07 16:17:52 UTC

[~tomjenkinson] 


1. It seems to me that the TTL on the JMS messages should be greater that the TX timeout (default 300s)

The only TX that are being timed out.. are the ones after the server was killed.. they are never going to be committed. And they will never be delivered from the Resource Manager.

2. It seems strange that server2 is not getting the corresponding XID out of the RM on server2 it is expecting. From what I am reading I am thinking that if HQ TTL expires messages an XA TX it is unilaterally forgetting the branch (hence the messages about NoXAResource to recover) I would suggest this is a bug in HQ if that is the case.

I don't see that happening. If this was a bug on HornetQ, it would be failing on EAP 5.1 which is the same code basically.

Comment 58 tom.jenkinson 2012-11-07 18:32:52 UTC

How do I perform step 7, I am wanting to see how many messages are lost. Is it in the reproducer2.zip somewhere?

Comment 59 Miroslav Novak 2012-11-07 18:38:56 UTC

Producer does not set any expiration time or ttl on messages. I believe that point 1. is not a problem.

Trace log for the unfinished xa transaction shows that before the server 2 was killed the xa transaction got to state where it's calling commit on "inbound" transaction and failed during calling commit for "jmsXA" transaction. Node 2 didn't get a response that this call was successful. I don't see "finished commit on xid=XidImpl..." for it. I'm suspecting that this call was actually successful on node 1 and it removed record about this xid from journal. But I did not manage to debug the jms server.

Comment 60 Miroslav Novak 2012-11-07 18:44:46 UTC

It's possible to receive them by (it's also in reproducer2.zip just forgot to add it to instructions):
{code}sh start-consumer.sh localhost{code}

This command is very useful for reading hornetq journal:
{code}
export JBOSS_HOME=$PWD/server1/jboss-eap-6.0
java -cp $JBOSS_HOME/modules/org/hornetq/main/hornetq-core-2.2.22.Final-redhat-1.jar:$JBOSS_HOME/modules/org/jboss/netty/main/netty-3.2.6.Final-redhat-2.jar org.hornetq.core.persistence.impl.journal.PrintData $JBOSS_HOME/standalone/data/messagingbindings $JBOSS_HOME/standalone/data/messagingjournal >  log1.txt
{code}

Comment 61 tom.jenkinson 2012-11-07 18:55:57 UTC

I have run this in the debugger from the perspective of server2 this time.

(I can confirm server1 kept rattling around two XIDs and kept detecting that they didn't belong to it so it did nothing)

server2
1. first scan detects 4 xids, two are part of a transaction so it completes that transaction

2. second scan, detects 2 orphan XIDs (happened because the crash had prepared the RM but no log was written - this is a normal failure scenario), we rolled them back:
18:50:07,842 INFO  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016013: Rolling back < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:14e87dd0:509aa9a7:a2b, node_name=20, branch_uid=0:ffff7f000001:14e87dd0:509aa9a7:a2c, subordinatenodename=null, eis_name=unknown eis name >
18:50:21,763 INFO  [org.jboss.qa.hornetq.apps.mdb.MdbWithRemoteOutQueueToContaniner1] (Thread-38 (HornetQ-client-global-threads-1502765380))  Start of message:ID:454db0fc-2909-11e2-a876-2d2016835aec, count:943
18:50:21,867 INFO  [org.jboss.qa.hornetq.apps.mdb.MdbWithRemoteOutQueueToContaniner1] (Thread-38 (HornetQ-client-global-threads-1502765380))  End of ID:454db0fc-2909-11e2-a876-2d2016835aec, count:943 in 104 ms
18:51:16,505 INFO  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016013: Rolling back < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:14e87dd0:509aa9a7:a2b, node_name=20, branch_uid=0:ffff7f000001:14e87dd0:509aa9a7:a98, subordinatenodename=null, eis_name=java:/JmsXA >

3. third scan
No XIDs returned from hornetq

I didn't get the error messages:
17:04:28,713 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@2cec4462 >
17:04:28,714 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >

It may require the server to crash at a certain point to get this behavior?

Comment 62 tom.jenkinson 2012-11-07 18:59:25 UTC

Also, using Miroslavs updated notes:
5000. got message id = ID:fbb1f910-290b-11e2-90da-01d39391c4f6 message counter: null
Receiver ended - number of messages 5000

I am sending 5000 as I am not quick enough to identify the process and kill it when I only do 500 ;)

Comment 63 tom.jenkinson 2012-11-07 21:03:34 UTC

"I'm suspecting that this call was actually successful on node 1 and it removed record about this xid from journal. But I did not manage to debug the jms server."

I agree that this might cause the error messages and would be considered expected behavior for JBTM though would not cause any message loss.

According to Clebert: "Tom Jenkinson there are eventual losses caused by these XID failures. (rollback from the wrong node)" - Miroslav, please confirm what you are seeing because my reading of your report is that there are no message losses. If there is no message loss then I don't think there is any problem here.

To be clear:

xar1.prepare()
xar2.prepare()
tm.log(xar1, xar2)
xar1.commit()
xar2.commit()
--- CRASH BEFORE WE CAN DELETE tm log record

When we recover we scan xar* for XIDs and try to match them up to the entries in the log, clearly as they are committed they will never be found. I will have to check my code in the morning (not on work laptop) but that would possibly print out that error message. I think that after a number of scans they are moved to assumedcomplete and that is all there is to it.

Miroslav, please confirm whether you have ever had data integrity issues here, Cleberts comment suggests there is. If so we should try to isolate under what circumstances your test can result in data integrity issues and I will try to explain them.

Comment 64 Miroslav Novak 2012-11-08 08:51:55 UTC

Thanks Tom for looking at this issue. In test scenario described here is no message loss. Just warnings. But I linked another jira where message loss occurs - JBPAPP-10338. There is the same reproducer(reproducer2.zip). But there is the jms server killed and restarted, not the mdb server. Otherwise the test scenario is the same.

Today we started EAP 5.2.0.ER4 testing and I'll have to move my attention to it. There will be not much time to investigate this issue. Anyway I'll try to help as much as I can.

Comment 65 tom.jenkinson 2012-11-08 10:47:51 UTC

I think it sounds like this scenario is documented here: http://docs.jboss.org/jbosstm/5.0.0.M1/guides/failure_recovery_guide/ch02.html#d0e561

I would be tempted to close this bug.

Also note, I would be tempted to use Byteman to crash the server rather than "kill -9" as then you can describe the failure scenario in a cleaner manner, e.g. "RM committed, TM did not remove log" (which is what seems to have happened here). "kill -9" chooses a failure scenario at random :)

Comment 66 Miroslav Novak 2012-11-08 11:41:26 UTC

As I mentioned before I'm not sure that this is really happening. Just suspecting it. I have no evidence that this is happening.

Yes, byteman could narrow the issue but thanks to EAP 5.2.0.ER4 I'm not sure when I will get to it. For now provided reproducer2.zip must be sufficient.

Comment 67 tom.jenkinson 2012-11-08 13:55:16 UTC

to be clear, I don't think there is a bug here, just components working as normal. I checked this by:

1. run server1 in the debugger
2. allow HQ RM to commit
3. kill server2 (before writing out the log)
4. recover server2 and you get those messages:
13:44:35,310 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:-4332148:509bb6a2:15, node_name=20, branch_uid=0:ffff7f000001:-4332148:509bb6a2:16, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6fad79ed >
13:44:44,238 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:-4332148:509bb6a2:15, node_name=20, branch_uid=0:ffff7f000001:-4332148:509bb6a2:49, subordinatenodename=null, eis_name=java:/JmsXA >, heuristic: TwoPhaseOutcome.FINISH_OK, product: HornetQ/2.0, jndiName: java:/JmsXA com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@647a609c >
13:44:50,796 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:-4332148:509bb6a2:15, node_name=20, branch_uid=0:ffff7f000001:-4332148:509bb6a2:16, subordinatenodename=null, eis_name=unknown eis name >
13:44:56,030 WARN  [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:-4332148:509bb6a2:15, node_name=20, branch_uid=0:ffff7f000001:-4332148:509bb6a2:49, subordinatenodename=null, eis_name=java:/JmsXA >

But also, there is no message loss (verified by "sh start-consumer.sh localhost" and dequeuing "Receiver ended - number of messages 15500"), I therefore think you should close the bug with something like "not a bug".

Comment 68 Clebert Suconic 2012-11-08 14:20:43 UTC

There is a message loss but not related to this: https://issues.jboss.org/browse/JBPAPP-10296

The only issue with this is the WARNs appearing *forever*


Which I also see those with EAP5

Comment 69 Clebert Suconic 2012-11-08 14:21:28 UTC

Wrong linke before... https://issues.jboss.org/browse/JBPAPP-10338

Comment 70 Miroslav Novak 2012-11-08 14:29:45 UTC

Great work, Tom! This would explain why there are those warnings on server 2.

But there is one more thing. Probably not related to warnings. On the jms server there are time-outed xids:
{code}
...
18:39:09,050 WARN  [org.hornetq.core.transaction.impl.ResourceManagerImpl] (Thread-2 (HornetQ-scheduled-threads-1410328673)) transaction with xid XidImpl (1677246734 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-118.112.-104.-95.80.-103.74.8.0.0.4.-27.0.0.0.0.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-118.112.-104.-95.80.-103.74.8.0.0.4.-28.50.48 base64:AAAAAAAAAAAAAP__CiIDv4pwmKFQmUoIAAAE5QAAAAAAAAAAAAAAAAAAAAAAAP__CiIDv4pwmKFQmUoIAAAE5DIwBwICAA== timed out
...
{code}

Clebert mentioned that HornetQ time-outs its transactions which are not prepared/committed. I increased that timeout so Recovery Manager on server 2 had a few chances to rollback them but it seems that it's not doing it.

Comment 71 tom.jenkinson 2012-11-08 14:32:52 UTC

Hi Clebert, I assumed from your comment: https://issues.jboss.org/browse/JBPAPP-10305?focusedCommentId=12732283&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12732283 that there was message loss here?

Anyway, that link I sent earlier: http://docs.jboss.org/jbosstm/5.0.0.M1/guides/failure_recovery_guide/ch02.html#d0e561 it explains the problem and how you can address it.

Comment 72 Clebert Suconic 2012-11-08 14:35:02 UTC

sorry guys.. I edited the wrong issue :(... rolling back my changes

Comment 73 Clebert Suconic 2012-11-08 14:37:58 UTC

So, there's a message loss, but it's not related to transactions at all.. there was a bug on paging.

If this WARNs are expected (I don't see any message loss caused by those).. we should just close this issue as not a bug.

Comment 74 Clebert Suconic 2012-11-08 15:23:51 UTC

these WARNs are not an issue...


This is what happens:

tomjenkinson: xar1.prepare
[08:39am] tomjenkinson: xar2.prepare
[08:39am] tomjenkinson: tm.log (xar1, xar2)
[08:39am] tomjenkinson: xar1.commit
[08:39am] tomjenkinson: xar2.commit
[08:40am] tomjenkinson: CRASH
[08:40am] tomjenkinson: now when we recover, TM has a log which references xar1 and 2, do you see that?
[08:40am] clebert: what would happen after the crash.. if a crash didn't happen?
[08:40am] clebert: tm.unlog(xar1, xar2)?
[08:40am] tomjenkinson: delete(tm.log(xar1,xar2))


This condition is normal. and you can configure to ignore these WARNs.

They are there to prevent cases where the user didn't configure Recovery properly.

Comment 75 Miroslav Novak 2012-11-08 16:15:29 UTC

[~clebertsuconic]
There is one more thing. There are time-outed xids on jms server. Should they be recovered by server 2 and not time-outed? In last comment is mentioned that warnings are there to prevent cases where user didn't configure recovery properly. Could you specify what should be changed?

Comment 76 Clebert Suconic 2012-11-08 16:21:26 UTC

No, they should be timed out... anything non-prepared will be timed out.

The TM didn't send a prepare yet, so the TM doesn't know about it (it didn't record the prepare anywhere), so it will be HornetQ's responsibility to time it out.

Comment 77 Anne-Louise Tangring 2012-11-13 20:43:49 UTC

Release Notes Docs Status: Removed: Not Required 
Docs QE Status: Removed: NEW