Bug 901132 (JBPAPP6-1183)
| Summary: | Transaction manager is not accepting uniqueID on its configuration, causing issues with recovery in clustering | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Miroslav Novak <mnovak> | ||||||||||||||||||
| Component: | Transaction Manager | Assignee: | Clebert Suconic <clebert.suconic> | ||||||||||||||||||
| Status: | CLOSED NOTABUG | QA Contact: | |||||||||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||||||||
| Priority: | urgent | ||||||||||||||||||||
| Version: | 6.0.0 | CC: | clebert.suconic, fnasser, mnovak, ochaloup, tom.jenkinson | ||||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||||
| Target Release: | EAP 6.0.1 | ||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||
| URL: | http://jira.jboss.org/jira/browse/JBPAPP6-1183 | ||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||
| Last Closed: | 2012-11-08 15:23:51 UTC | Type: | Bug | ||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
| Embargoed: | |||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||
Attachment: Added: logs.zip Attachment: Added: reproducer.zip Link: Added: This issue is related to JBPAPP-10067 Attached SoakMdbWithRemoteOutQueueToContaniner1.java Attachment: Added: SoakMdbWithRemoteOutQueueToContaniner1.java Where is the source for producer and consumer? I'm probably going to reject this issue.. the testcase is wrong:
You are looking up the remote ConnectionFactory here:
final Properties env = new Properties();
env.put(Context.INITIAL_CONTEXT_FACTORY, "org.jboss.naming.remote.client.InitialContextFactory");
env.put(Context.PROVIDER_URL, "remote://" + hostname + ":4447");
ctxRemote = new InitialContext(env);
queue = (Queue) ctxRemote.lookup("jms/queue/OutQueue");
ctx = new InitialContext();
// i want connection factory configured here
cf = (ConnectionFactory) ctx.lookup("java:/JmsXA");
You should regsiter a RemoteJmsXA... I'm actually surprised how the java:/JmsXA worked.. we should talk before I reject this.
[~mnovak] When you add these reproducers, I would need a version where I can build from scratch... so I can debug, understand and eventually find issues.... From the little I could see there is something wrong with the test, but I don't want to de-compile anything to debug around this. I've tried to debug and fix the test, but it's hard to do with just binaries.... I usually always need compilable versions. Anyway: we will talk tomorrow.... but so far: - This is definitely not related to the patch, so please just let the patch proceed. - This seems a reject based on what I saw Attachment: Added: client-project.zip [~clebertsuconic]
Only remote jndi lookup in mdb is for "jms/queue/OutQueue":
Connection factory "java:/JmsXA" is local from local jndi tree (on mdb server) and is configured to create connections to remote jms server(localhost:5445):
{code}
ctx = new InitialContext();
// i want connection factory configured here
cf = (ConnectionFactory) ctx.lookup("java:/JmsXA");
{code}
Anyway I've tried to clean MDB for you - attached mdbToRemoteQueue-maven-project.zip. Do any further modification as you need.
Also new mdb1.jar attached so you don't need to compile.
If client sources are really needed - client-project.zip
Is there anything else you need?
Attachment: Added: mdb1.jar Attachment: Added: mdbToRemoteQueue-maven-project.zip [~mnovak] I"m rejecting this issue: your test is broken: You don't specify the node-id for the Recovery. Each Arjuna instance needs its own node-identifier.... Server1 is doing the periodical discovery finding XIDs that would belong to it and it would rollback it right away... and the second server wouldn't find the XIDs and you would get those pending transactions because of the missed node-identifier configuration. Now: I tried to fix it by setting node-identifier on jboss:domain:transactions:1.2: = I first tried to add <node-identifier>something</node-identifier> under core-environment and the XML didn't parse it = then I tried adding it to the process-id, no more errors but then nothing worked <! -- this is how the CLI will set it --> < process-id node-idenetifier="jms"/> I even tried to guess by setting it on the recovery-environment, by setting node-identifier propery on recovery-environment. So, it's uncertain how to set this on AS7. It seems there's a bug about that Tom Ross pointed me There's a known issue, which is actually being discussed now on the jboss-support-eap6 list, thread called "XA recovery in AP 6" As for this bug this is a reject.. as for how to configure the node-identifier you either have to find the proper way or fix it on arjuna subsystem on AS7. my comment makes it clear why it's a reject... I could see recovery working properly Apparently (from what Tom Ross told me) the proper way to set it is by adding a property on core-environment <core-environment node-identifier="jms"> and <core-environment node-identifier="mdb"> in your examples well.. I couldn't make it work.. you guys will have to talk to arjuna guys You might be right that this is just configuration issue but setting node-identifier on both of the servers did not fix the issue:
{code}
<subsystem xmlns="urn:jboss:domain:transactions:1.2">
<core-environment node-identifier="10">
<process-id>
<uuid/>
</process-id>
</core-environment>
<recovery-environment socket-binding="txn-recovery-environment" status-socket-binding="txn-status-manager"/>
<coordinator-environment default-timeout="300"/>
</subsystem>
{code}
Re-opening since problem still exists.
This is customer's use case. Increasing priority. Hi Miroslav, Can you post the standalone.xml for both servers? I am guessing (hoping) they aren't both node-identifier="10" (I tried to check this but the config in reproducer.zip is stale, according to your notes above). Is there an internal git repo you can post your code to so we can collaborate on it more easily? Tom Attachment: Added: reproducer2.zip Hi Tom, I've attached new reproducer2.zip with all modifications. There is updated mdb and set different node-identifier attribute on both of the servers. Thanks, Mirek Hi, I can see in the debugger that the unique ID is being passed to the transaction manager. Also I can see the following in the output: 22:26:34,531 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:3c3932b9:5092f554:7ac, node_name=20, branch_uid=0:ffff7f000001:3c3932b9:5092f554:7c1, subordinatenodename=null, eis_name=java:/JmsXA > As you can see node_name=20 so the node_name was passed to the transaction manager and it utilized it. I am not convinced there is an issue with the transaction manager here. Still investigating though. Tom Is server2 scanning an XAResource that can contact server1 for Xids is the crux of this. I am just not too sure where to start looking at this, as I don't know how to configure HornetQ. Also, the reproducer is nice and so is the text accompanying it but I feel it is somewhat incomplete sorry. From what I can see: 1. a producer (none-transactionally) sends messages to inqueue on server 1 2. server 2 is booted up with an mdb that: starts an XA transaction dequeues a single message from inqueue on server1 enqueues the same message to outqueue on server1 commits the transaction 3. Where you are falling short is to expand on "Receive all messages from OutQueue and check warnings in mdb servers log" From what you are suggesting I should be able to fire up some other application which can dequeue from OutQueue and from what I think I am understanding it should be possible to dequeue all 500 messages? Please can you revise the description with these steps as appropriate? Anyway, it sounds like the transaction manager is running on server 2, the resource manager (HQ) itself is running on server 1 (there is an HQ on both servers, but the queues are all on server 1). What I can't confirm is whether the HQ configuration is set up to allow the transaction manager on server 2 to scan the hornetq RA on server 1 for XIDs. The error message implies that there are branches in the transaction log that we can't locate when scanning using XAR::recover on the RA in server 2, in which case I would say some misconfiguration isn't allowing the TM on server 2 to see the indoubt XID from the HQ RA on server1. Lets put that a different way, the transaction manager on server 2 is scanning the HQ XAResource from the RA deployed on server 2. I suspect that server is meant to somehow aggregate/proxy the call to server 1 but I don't think this can be happening, otherwise the TM would be able to find these branches and call commit/rollback. Hope that helps, Tom Hi Clebert, As much as I don;t like doing "assignee tag", please can you confirm that the users config is set up in a manner that would allow the RA on server2 to return the list of indoubt branches on server1 also. That is the missing information I am unable to answer here as it is HQ specific. Please feel free to assign back if that is not the issue. Tom Also, I have confirmed that when the servers start they have the right values for XARecoveryNodes. If this wasn't correct (e.g. the default of "*" was used) the scenario that Clebert described where tm1 would roll back the indoubt XIDs there would happen.
To be clear there are two common configuration issues that I have verified are not happening with reproducer2.zip:
1. Both nodes have unique node identifiers (confirmed in the debugger that these are passed to JBTM). If they had the same node id the issue described can happen.
2. Both nodes have the correct value for com.arjuna.ats.jta.xaRecoveryNode (again, confirmed in the debugger). If they had the default value ("*") the issue described can happen.
I therefore suspect a misconfiguration of the RA where the RA in server2 is not proxying the call to XAResource::recover for server1 so TM in server2 can't match up the indoubt XIDs.
Please confirm that the scenario I describe is catered for in your configuration (I would be interested to know how this is configured anyway in case I get something similar in future). If this is not the issue, please do consider assigning it back to me.
Link: Added: This issue is related to JBPAPP-10338 No, what I'm seeing is that Node1 is rolling back XIDs that belong to Node2. Node1 has an InVM connection, and Node2 has a remote connection towards Node1. Node2 will insert XIDs into Node1, and Node1 is rollinback the XIDs that belong to Node2. At least with the proper config the test is doing the rollback at the right node... it wasn't clear where to place the nodeID config (it wasn't to me and I couldn't find any information from some support guys as well). At the end of the test I still see Can't find XIDs (can't serialize information.. that kind of warn)... but I see the XIDs properly configured now [~mnovak] did you retest with the proper configuration? Release Notes Docs Status: Added: Not Required [~clebert.suconic] I did a review of configuration in reproducer2.zip and could not find any issues in it. This feature is not documented. Could you specify the changes which should be made? [~mnovak] Those are changes required by the Transaction Manager.. AFAIK it well beyond documented.. you should talk to the Transactions team on how to make each node to have its own unique ID when using a cluster. There's nothing to be done on HornetQ's doc here. WE can add a paragraph maybe referring the TM's doc, but it's not on our scope. [~mnovak] if you're not hitting issues with the proper config, then just close this JIRA. If you look at the tests on EAP5 they are also using unitque IDs on the Transaction Manager (I know you guys changed this before). I'm assigning this back to you Miro... we can talk about it tomorrow if you like. but so far I see no issues. [~clebert.suconic] From the Tom's comment above it seems that unique node identifier for transaction manager was set for both of the servers. There are different node-identifiers set for transaction manager on mdb and jms server in reproducer2.zip. [~clebertsuconic] You seem to be right that HornetQ recovery module on node 1 rolls back xids which belong to node 2. There is periodic scan for unfinished transactions on node 1 which are roll backed and removed from journal. Problem is that when recovery manager on node 2 starts periodic recovery and provide xid which it wants to commit then there is no XAResource which would know about this xid because it was already removed by node 1. Is it correct? How can be HornetQ recovery module on node 1 configured not to take xids of node 2 into account during this periodic recovery scan? [~mnovak] That's the job of the Transaction Manager. We deliver all the XIDs, which we don't know how to deserialize... and the TM will then ignore the ones that doesn't belong to that server. I'm not sure why you assigned this back to me.. I don't see any issues now on HornetQ. These issues are on the TM. I have debugged this, and there are no duplicate resources on the transaction manager.. there's only one... if the TM is rolling back at the wrong node it's an issue on the transaction manager. Can you clarify (for the bug report) " Receive all messages from OutQueue and check warnings in mdb servers log" are all messages actually received? [~tomjenkinson] there are eventual losses caused by these XID failures. (rollback from the wrong node) I am going to take another look see if I can work out what is going on again. I know the correct TM config appeared to be used but I can't understand why node 1 would rollback XIDs propagated from node 2 as the orphan detection shouldn't touch them. Can someone clarify why we think that server 1 is rolling anything back? Are we seeing it in the logs somewhere? I just put this in the debugger and can see TM1 scanning the XIDs from HQ in server 1 but we are definitely doing the node name detection correctly and not rolling them back. That said I do have some output in the console showing the transaction has timed out, could this be your issue? I have attached the console output post server 2 crash. Attachment: Added: server1consoleAfterServer2Killed.txt To be clear, what I have done is: 1. Start server1 in the debugger 2. run the producer 3. start server2 not in the debugger 4. crash server2 5. monitor what is happening in server1 to make sure it isn't rolling back any of the XIDs HQ is providing to it. I can say for sure from what I am seeing server1 is not rolling back the server2 XID as orphans, it could be something to do with the transaction timing out but I haven't looked at your code nor do I know how HQ reacts when a transaction times out regarding delivery, I suspect it drops the message which would be fine in my view and would account for some lost messages depending on when you restart the server - don't forget it can take several minutes for recovery to kick in! Thanks Tom for feedback. I also see this on server 1 log:
{code}
...
18:39:09,050 WARN [org.hornetq.core.transaction.impl.ResourceManagerImpl] (Thread-2 (HornetQ-scheduled-threads-1410328673)) transaction with xid XidImpl (1677246734 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-118.112.-104.-95.80.-103.74.8.0.0.4.-27.0.0.0.0.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-118.112.-104.-95.80.-103.74.8.0.0.4.-28.50.48 base64:AAAAAAAAAAAAAP__CiIDv4pwmKFQmUoIAAAE5QAAAAAAAAAAAAAAAAAAAAAAAP__CiIDv4pwmKFQmUoIAAAE5DIwBwICAA== timed out
...
{code}
I'll try to increase transaction-timeout on jms server to see what happens.
It doesn't seem to have effect. I think it is the actual XA transaction that is timing out, the default for AS7 is 300. You can change it in: <coordinator-environment default-timeout="300"/> That would be changed in server2 HornetQ has its own time outs for transactions. I tried to increase transaction-timeout attribute in jms server configuration. This didn't have affect in the way that transaction timeouted in 300s instead of 600s which I set.
I tried to set transactionTimeout activation property in mdb:
{code}
@ActivationConfigProperty(propertyName = "transactionTimeout", propertyValue = "600")
{code}
This seems to have really effect. Transaction was timouted on server 1 after 10 min but I can still see warnings from arjuna on mdb server even when those transaction were not timeouted on node 1.
I'm wondering how hornetq recovery module is registered to recovery manager on node 2. [~mnovak] The Transaction Timeout that you saw in hornetq was for incomplete transactions. That is... not prepared, not committed.. What happens is you killed server2, the commit will never arrive. So, HornetQ will rollback any transaction that's not been prepared for the timeout of hornet. The HornetQ's timeout must be greater than the TM's timeout. So: I don't think the timeout should be changed. There shouldn't be any transaction taking more than 5 minutes to be completed. (Everything is taking a couple of milliseconds between start and prepare on this example). I think the timeout could be because the transaction is created at server 2 which is not going to complete the transaction until it recovers which may be over 5 minutes depending upon when the server2 is recovered. I think it is going to be at least 2 minutes as that is the scan interval. Could HQ check if the destination is down when the first timeout expires and if that is the case wait more to give a chance for that node to be restored? To be clear, from my interpretation, the transaction timeout is somewhat unrelated to the timetolive on a message. A message may have a certain time to live value which is less than the transaction timeout. It would explain supposedly "dropped" messages as HQ will presumably drop the message as soon as the TX commits if the ttl has expired (it may even drop them before, so long as it doesn't drop the corresponding XID we will be fine). I don't think the JMS spec mandates that the TTL > XA tx timeout, even if is a little confusing/contradictory when it happens Tom, what is your suggested way to avoid (fix) the problem described in this JIRA? Hi Fernando, Unfortunately I haven't sanity checked the JMS configuration but: 1. It seems to me that the TTL on the JMS messages should be greater that the TX timeout (default 300s) 2. It seems strange that server2 is not getting the corresponding XID out of the RM on server2 it is expecting. From what I am reading I am thinking that if HQ TTL expires messages an XA TX it is unilaterally forgetting the branch (hence the messages about NoXAResource to recover) I would suggest this is a bug in HQ if that is the case. All I have confirmed is that server1 is not rolling back the indoubt XIDs for server2 and that "Transaction manager is not accepting uniqueID on its configuration, causing issues with recovery in clustering" is not a true assertion. I think the Jira needs renaming. Tom Clebert can you check HQ handling of the timeout (see above) please? I guess we should wait Clebert's investigation to see what the Subject should say... Either something like "timeout too soon" or "unilateral dropping of XA TX branch"(?) [~tomjenkinson] 1. It seems to me that the TTL on the JMS messages should be greater that the TX timeout (default 300s) The only TX that are being timed out.. are the ones after the server was killed.. they are never going to be committed. And they will never be delivered from the Resource Manager. 2. It seems strange that server2 is not getting the corresponding XID out of the RM on server2 it is expecting. From what I am reading I am thinking that if HQ TTL expires messages an XA TX it is unilaterally forgetting the branch (hence the messages about NoXAResource to recover) I would suggest this is a bug in HQ if that is the case. I don't see that happening. If this was a bug on HornetQ, it would be failing on EAP 5.1 which is the same code basically. How do I perform step 7, I am wanting to see how many messages are lost. Is it in the reproducer2.zip somewhere? Producer does not set any expiration time or ttl on messages. I believe that point 1. is not a problem. Trace log for the unfinished xa transaction shows that before the server 2 was killed the xa transaction got to state where it's calling commit on "inbound" transaction and failed during calling commit for "jmsXA" transaction. Node 2 didn't get a response that this call was successful. I don't see "finished commit on xid=XidImpl..." for it. I'm suspecting that this call was actually successful on node 1 and it removed record about this xid from journal. But I did not manage to debug the jms server. It's possible to receive them by (it's also in reproducer2.zip just forgot to add it to instructions):
{code}sh start-consumer.sh localhost{code}
This command is very useful for reading hornetq journal:
{code}
export JBOSS_HOME=$PWD/server1/jboss-eap-6.0
java -cp $JBOSS_HOME/modules/org/hornetq/main/hornetq-core-2.2.22.Final-redhat-1.jar:$JBOSS_HOME/modules/org/jboss/netty/main/netty-3.2.6.Final-redhat-2.jar org.hornetq.core.persistence.impl.journal.PrintData $JBOSS_HOME/standalone/data/messagingbindings $JBOSS_HOME/standalone/data/messagingjournal > log1.txt
{code}
I have run this in the debugger from the perspective of server2 this time. (I can confirm server1 kept rattling around two XIDs and kept detecting that they didn't belong to it so it did nothing) server2 1. first scan detects 4 xids, two are part of a transaction so it completes that transaction 2. second scan, detects 2 orphan XIDs (happened because the crash had prepared the RM but no log was written - this is a normal failure scenario), we rolled them back: 18:50:07,842 INFO [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016013: Rolling back < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:14e87dd0:509aa9a7:a2b, node_name=20, branch_uid=0:ffff7f000001:14e87dd0:509aa9a7:a2c, subordinatenodename=null, eis_name=unknown eis name > 18:50:21,763 INFO [org.jboss.qa.hornetq.apps.mdb.MdbWithRemoteOutQueueToContaniner1] (Thread-38 (HornetQ-client-global-threads-1502765380)) Start of message:ID:454db0fc-2909-11e2-a876-2d2016835aec, count:943 18:50:21,867 INFO [org.jboss.qa.hornetq.apps.mdb.MdbWithRemoteOutQueueToContaniner1] (Thread-38 (HornetQ-client-global-threads-1502765380)) End of ID:454db0fc-2909-11e2-a876-2d2016835aec, count:943 in 104 ms 18:51:16,505 INFO [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016013: Rolling back < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:14e87dd0:509aa9a7:a2b, node_name=20, branch_uid=0:ffff7f000001:14e87dd0:509aa9a7:a98, subordinatenodename=null, eis_name=java:/JmsXA > 3. third scan No XIDs returned from hornetq I didn't get the error messages: 17:04:28,713 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@2cec4462 > 17:04:28,714 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name > It may require the server to crash at a certain point to get this behavior? Also, using Miroslavs updated notes: 5000. got message id = ID:fbb1f910-290b-11e2-90da-01d39391c4f6 message counter: null Receiver ended - number of messages 5000 I am sending 5000 as I am not quick enough to identify the process and kill it when I only do 500 ;) "I'm suspecting that this call was actually successful on node 1 and it removed record about this xid from journal. But I did not manage to debug the jms server." I agree that this might cause the error messages and would be considered expected behavior for JBTM though would not cause any message loss. According to Clebert: "Tom Jenkinson there are eventual losses caused by these XID failures. (rollback from the wrong node)" - Miroslav, please confirm what you are seeing because my reading of your report is that there are no message losses. If there is no message loss then I don't think there is any problem here. To be clear: xar1.prepare() xar2.prepare() tm.log(xar1, xar2) xar1.commit() xar2.commit() --- CRASH BEFORE WE CAN DELETE tm log record When we recover we scan xar* for XIDs and try to match them up to the entries in the log, clearly as they are committed they will never be found. I will have to check my code in the morning (not on work laptop) but that would possibly print out that error message. I think that after a number of scans they are moved to assumedcomplete and that is all there is to it. Miroslav, please confirm whether you have ever had data integrity issues here, Cleberts comment suggests there is. If so we should try to isolate under what circumstances your test can result in data integrity issues and I will try to explain them. Thanks Tom for looking at this issue. In test scenario described here is no message loss. Just warnings. But I linked another jira where message loss occurs - JBPAPP-10338. There is the same reproducer(reproducer2.zip). But there is the jms server killed and restarted, not the mdb server. Otherwise the test scenario is the same. Today we started EAP 5.2.0.ER4 testing and I'll have to move my attention to it. There will be not much time to investigate this issue. Anyway I'll try to help as much as I can. I think it sounds like this scenario is documented here: http://docs.jboss.org/jbosstm/5.0.0.M1/guides/failure_recovery_guide/ch02.html#d0e561 I would be tempted to close this bug. Also note, I would be tempted to use Byteman to crash the server rather than "kill -9" as then you can describe the failure scenario in a cleaner manner, e.g. "RM committed, TM did not remove log" (which is what seems to have happened here). "kill -9" chooses a failure scenario at random :) As I mentioned before I'm not sure that this is really happening. Just suspecting it. I have no evidence that this is happening. Yes, byteman could narrow the issue but thanks to EAP 5.2.0.ER4 I'm not sure when I will get to it. For now provided reproducer2.zip must be sufficient. to be clear, I don't think there is a bug here, just components working as normal. I checked this by: 1. run server1 in the debugger 2. allow HQ RM to commit 3. kill server2 (before writing out the log) 4. recover server2 and you get those messages: 13:44:35,310 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:-4332148:509bb6a2:15, node_name=20, branch_uid=0:ffff7f000001:-4332148:509bb6a2:16, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6fad79ed > 13:44:44,238 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:-4332148:509bb6a2:15, node_name=20, branch_uid=0:ffff7f000001:-4332148:509bb6a2:49, subordinatenodename=null, eis_name=java:/JmsXA >, heuristic: TwoPhaseOutcome.FINISH_OK, product: HornetQ/2.0, jndiName: java:/JmsXA com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@647a609c > 13:44:50,796 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:-4332148:509bb6a2:15, node_name=20, branch_uid=0:ffff7f000001:-4332148:509bb6a2:16, subordinatenodename=null, eis_name=unknown eis name > 13:44:56,030 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=30, bqual_length=36, tx_uid=0:ffff7f000001:-4332148:509bb6a2:15, node_name=20, branch_uid=0:ffff7f000001:-4332148:509bb6a2:49, subordinatenodename=null, eis_name=java:/JmsXA > But also, there is no message loss (verified by "sh start-consumer.sh localhost" and dequeuing "Receiver ended - number of messages 15500"), I therefore think you should close the bug with something like "not a bug". There is a message loss but not related to this: https://issues.jboss.org/browse/JBPAPP-10296 The only issue with this is the WARNs appearing *forever* Which I also see those with EAP5 Wrong linke before... https://issues.jboss.org/browse/JBPAPP-10338 Great work, Tom! This would explain why there are those warnings on server 2.
But there is one more thing. Probably not related to warnings. On the jms server there are time-outed xids:
{code}
...
18:39:09,050 WARN [org.hornetq.core.transaction.impl.ResourceManagerImpl] (Thread-2 (HornetQ-scheduled-threads-1410328673)) transaction with xid XidImpl (1677246734 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-118.112.-104.-95.80.-103.74.8.0.0.4.-27.0.0.0.0.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-118.112.-104.-95.80.-103.74.8.0.0.4.-28.50.48 base64:AAAAAAAAAAAAAP__CiIDv4pwmKFQmUoIAAAE5QAAAAAAAAAAAAAAAAAAAAAAAP__CiIDv4pwmKFQmUoIAAAE5DIwBwICAA== timed out
...
{code}
Clebert mentioned that HornetQ time-outs its transactions which are not prepared/committed. I increased that timeout so Recovery Manager on server 2 had a few chances to rollback them but it seems that it's not doing it.
Hi Clebert, I assumed from your comment: https://issues.jboss.org/browse/JBPAPP-10305?focusedCommentId=12732283&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12732283 that there was message loss here? Anyway, that link I sent earlier: http://docs.jboss.org/jbosstm/5.0.0.M1/guides/failure_recovery_guide/ch02.html#d0e561 it explains the problem and how you can address it. sorry guys.. I edited the wrong issue :(... rolling back my changes So, there's a message loss, but it's not related to transactions at all.. there was a bug on paging. If this WARNs are expected (I don't see any message loss caused by those).. we should just close this issue as not a bug. these WARNs are not an issue... This is what happens: tomjenkinson: xar1.prepare [08:39am] tomjenkinson: xar2.prepare [08:39am] tomjenkinson: tm.log (xar1, xar2) [08:39am] tomjenkinson: xar1.commit [08:39am] tomjenkinson: xar2.commit [08:40am] tomjenkinson: CRASH [08:40am] tomjenkinson: now when we recover, TM has a log which references xar1 and 2, do you see that? [08:40am] clebert: what would happen after the crash.. if a crash didn't happen? [08:40am] clebert: tm.unlog(xar1, xar2)? [08:40am] tomjenkinson: delete(tm.log(xar1,xar2)) This condition is normal. and you can configure to ignore these WARNs. They are there to prevent cases where the user didn't configure Recovery properly. [~clebertsuconic] There is one more thing. There are time-outed xids on jms server. Should they be recovered by server 2 and not time-outed? In last comment is mentioned that warnings are there to prevent cases where user didn't configure recovery properly. Could you specify what should be changed? No, they should be timed out... anything non-prepared will be timed out. The TM didn't send a prepare yet, so the TM doesn't know about it (it didn't record the prepare anywhere), so it will be HornetQ's responsibility to time it out. Release Notes Docs Status: Removed: Not Required Docs QE Status: Removed: NEW |
project_key: JBPAPP6 Test replication scenario scenario: 1. Start two EAP 6 servers - mdb and jms server. mdb server is configured to connect through remote JCA to jms server. 2. Deploy "InQueue" and "OutQueue" to jms server 3. Send 500 messages to "InQueue" 4. Deploy MDB to mdb server. MDB read messages from InQueue and sends to OutQueue. 5. During processing of messages kill mdb server and restart it. 6. Wait until MDB processes all the messages 7. Receive messages from "OutQueue" Result: All messages are received. There is no lost of duplicated message but there are unfinished arjuna transactions on mdb server which are periodically (2 min) written to log: {code} 17:04:28,713 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@2cec4462 > 17:04:28,714 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name > 17:04:28,773 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@4b7d94f8 > 17:04:28,774 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name > 17:06:38,829 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@29f9cb2c > 17:06:38,830 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name > 17:06:38,881 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@456a5c7c > 17:06:38,881 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name > 17:08:48,928 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@adea215 > 17:08:48,929 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47d, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:480, subordinatenodename=null, eis_name=unknown eis name > 17:08:48,988 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016037: Could not find new XAResource to use for recovering non-serializable XAResource XAResourceRecord < resource:null, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@1e631646 > 17:08:48,988 WARN [com.arjuna.ats.jta] (Periodic Recovery) ARJUNA016038: No XAResource to recover < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47c, node_name=1, branch_uid=0:ffff0a2203bf:4c034c2e:508ff8eb:47f, subordinatenodename=null, eis_name=unknown eis name > {code} Messages on jms server are paged to disk. max-size-bytes and page-size-bytes is set to very low values. Attached reproducer.zip: 1. Download and unzip reproducer.zip 2. Prepare servers - "sh prepare.sh" 3. Start jms server - "sh start-server1.sh localhost" 4. Start producer and send 500 messages to InQueue - "sh start-producer.sh localhost 500" 5. Start mdb server - "sh start-server2.sh <some_other_ip>" 6. Kill (kill -9 ...) mdb server and start again by "sh start-server2.sh <some_other_ip>" 7. Receive all messages from OutQueue and check warnings in mdb servers log