Sometimes there are lost messages after failover of RA from live to backup server (replicated journal). Can affect users/customers and have negative impact for their production systems. Test scenario: 1. Start live/backup pair in dedicated topology with replicated journal and deploy jms/queue/InQueue and jms/queue/OutQueue. Attribute address-full-policy is set to PAGE 2. Send 5000 messages to jms/queue/InQueue 3. Start 3rd server with deployed MDB. RA is configured to connect to remote live server. MDB consumes message from jms/queue/InQueue, creates new message and sends to jms/queue/OutQueue 4. When MDB is processing messages kill live server. 5. RA on 3rd server failovers to backup server 6. Wait for MDB to process all messages from jms/queue/InQueue 7. Start consumer on jms/queue/OutQueue and receive all messages from jms/queue/OutQueue Result: Sometimes consumer in step 7. receives less than 5000 messages. For example in this case 4998 messages. There are 2 lost messages. Attaching trace logs from all servers + dump of HornetQ journal from server2 + log from the test. See instructions how to download them below. Message from InQueue for which there is not corresponding message in OutQueue messageID: 69e3259a-f532-11e2-addc-559b4b0210c8 _HQ_DUPL_ID= 6d694d45-6b2b-4568-ab25-98f963629a6c MDB received this message and created new message with: messageID: e6a77483-f532-11e2-a7df-e125ec787dcb _HQ_DUPL_ID=6d694d45-6b2b-4568-ab25-98f963629a6c inMessageId= 69e3259a-f532-11e2-addc-559b4b0210c8 I tried to track down what happened. The lost message was processed by MDB not during but after failover of RA to backup (it was in step 6.). So there was no failure/failover in this time. From trace of arjuna XA transaction was succesfully commited and message was sent to jms/queue/OutQueue. It seems that message gets lost when it is paged to disk (from server2/backup log): 16:02:54,146 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: -1195762270, [id: 0xb8ba1da2, /192.168.40.2:5445])) Paging message PagedMessageImpl [queueIDs=[5], transactionID=2147525748, mes sage=ServerMessage[messageID=2147525750,priority=4, bodySize=1024,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=6d694d45-6b2b-4568-ab25-98f963629a6c, __HQ_CID =79f2797d-f532-11e2-a7df-e125ec787dcb, inMessageId=ID:69e3259a-f532-11e2-addc-559b4b0210c8}]]@1438526286] on pageStore jms.queue.OutQueue pageId=5 Consumer in step 7. can't consume it.
I did not hit this issue in EAP 6.1.0 or EAP 6.1.1.ER1-2. That's why I set Regression. Or I just didn't have enough luck.
Are you waiting sync between live and backup before you killed the server? With replication you have to wait the servers to catch up before failover is effective
How often does the test fails? I've been running it continuously and have not managed to get a failure.
Update: On Friday we were able to reproduce the issue on station1/10.34.3.189 and allowed access to Francisco for investigation.
After a lot of investigation from me both on this issue and on BZ963215, I believe this issue will be a duplicate of BZ963215 Looking at the print-data QE produced on their tests, a paged message was ACKed without any transaction: operation@AddRecord;recordID=2147529766;userRecordType=39;isUpdate=false;CursorAckRecordEncoding [queueID=5, position=PagePositionImpl [pageNr=6, messageNr=-1, recordID=-1]] Which is exactly the same issue that was happening on the Bridge. after a failover, a XASession would behave as non transacted. Depending on the race you could get the Ack to sneak in on the Session as non Transacted. With the changes on BZ963215 this would always be rejected and the transaction replayed. I will mark this as Posted due to this extensive investigation. This is an issue hard to replicate using QE tests but it is easy with the Unit tests I and Howard created on BZ963215. I'm hoping this issue won't repeat again, but as usual let me know if you still hit any issues after this fix. I will proceed with the component update now.
This has been fixed as part of HornetQ 2.3.3. This was a duplicate of the one of the inner issues on the JMS FAilover bridge. It should be marked as fixed.
This BZ should be in ON_QA status. At this moment it appears to be fixed but I need to run tests many times to be sure. I'll also check tests from dev.
I tried to do some basic investigation of the logs: ID of message which has no corresponding message in OutQueue: 674eafc7-08f0-11e3-a74a-33c86ab6a2c0 Message sent to InQueue in: 18:57:17,026 Thread-16 DEBUG [org.jboss.qa.hornetq.apps.clients.ProducerClientAck:101] Producer for node: 192.168.40.1and queue: jms/queue/InQueue. Sent message with property counter: 1591, messageId:ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0, _HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874 Processed by MDB and sent and new (this one is actually lost) message sent to OutQueue (this is after failover from live to backup): Log from server with MDB: 18:59:36,847 DEBUG [org.jboss.qa.hornetq.apps.mdb.MdbWithRemoteOutQueueToContaniner1] (Thread-9 (HornetQ-client-global-threads-1035402077)) End of ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0, count:1590. Sending new message with inMessageId: ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0 and messageId: ID:baaf4aba-08f0-11e3-8d2a-9d4972893a18 in 133 ms Processing of message on backup server: 18:59:36,847 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 248062764, [id: 0x0ec9232c, /192.168.40.2:5445])) send(message=ServerMessage[messageID=29612,priority=4, bodySize=514,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874, __HQ_CID=991aa458-08f0-11e3-8d2a-9d4972893a18, inMessageId=ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0}]]@697005135, direct=true) being called 18:59:36,847 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 248062764, [id: 0x0ec9232c, /192.168.40.2:5445])) Routing message ServerMessage[messageID=29612,priority=4, bodySize=514,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874, __HQ_CID=991aa458-08f0-11e3-8d2a-9d4972893a18, inMessageId=ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0}]]@697005135 on binding=BindingsImpl [name=jms.queue.OutQueue] 18:59:36,847 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 248062764, [id: 0x0ec9232c, /192.168.40.2:5445])) Message after routed=ServerMessage[messageID=29612,priority=4, bodySize=514,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874, __HQ_CID=991aa458-08f0-11e3-8d2a-9d4972893a18, inMessageId=ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0}]]@697005135 Message was paged to disk in txID 29600: 18:59:36,882 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 248062764, [id: 0x0ec9232c, /192.168.40.2:5445])) Paging message PagedMessageImpl [queueIDs=[5], transactionID=29600, message=ServerMessage[messageID=29612,priority=4, bodySize=1028,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874, __HQ_CID=991aa458-08f0-11e3-8d2a-9d4972893a18, inMessageId=ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0}]]@697005135] on pageStore jms.queue.OutQueue pageId=3 txId 29600 was committed: 18:59:36,950 TRACE [org.hornetq.journal] (Thread-5 (HornetQ-server-HornetQServerImpl::serverUUID=null-689350335)) appendAddRecordTransactional:txID=29600,id=29635, userRecordType=35, usedFile = JournalFileImpl: (hornetq-data-4.hq id = 7, recordID = 7) There is prepare for txID=29600: 18:59:36,950 TRACE [org.hornetq.journal] (Thread-5 (HornetQ-server-HornetQServerImpl::serverUUID=null-689350335)) appendPrepareRecord::txID=29600, usedFile = JournalFileImpl: (hornetq-data-4.hq id = 7, recordID = 7) Calling commit for txID=29600: 18:59:37,205 TRACE [org.hornetq.core.server] (Thread-19 (HornetQ-server-HornetQServerImpl::serverUUID=null-689350335)) XAcommit into TransactionImpl [xid=XidImpl (805364243 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-31.-43.126.-64.82.18.78.-70.0.0.49.2.0.0.0.1.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-31.-43.126.-64.82.18.78.-70.0.0.48.-29.49 base64:AAAAAAAAAAAAAP__CiIDv-HVfsBSEk66AAAxAgAAAAEAAAAAAAAAAAAAAAAAAP__CiIDv-HVfsBSEk66AAAw4zEHAgIA, id=29600, state=PREPARED, createTime=1376931576838(Mon Aug 19 18:59:36 CEST 2013), timeoutSeconds=300, nr operations = 5]@9fa6604, xid=XidImpl (161317831 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-31.-43.126.-64.82.18.78.-70.0.0.49.2.0.0.0.1.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-31.-43.126.-64.82.18.78.-70.0.0.48.-29.49 base64:AAAAAAAAAAAAAP_n_CiIDv-HVfsBSEk66AAAxAgAAAAEAAAAAAAAAAAAAAAAAAP__CiIDv-HVfsBSEk66AAAw4zEHAgIA 18:59:37,374 TRACE [org.hornetq.journal] (Thread-18 (HornetQ-server-HornetQServerImpl::serverUUID=null-689350335)) appendCommitRecord::txID=29600, usedFile = JournalFileImpl: (hornetq-data-4.hq id = 7, recordID = 7) I got lost here. Content of hornetq journal of backup/server2 is printed in: /home/jbossqa/hornetq_eap6_testsuite/eap6.1.1.ER6/server2/jboss-eap-6.1/standalone/data/journal.txt
It's definitely not the same root cause.. I'm still investigating.
It would help a lot if you didn't ack the messages on consuming, so I could verify where is the message. it seems from what I read that the message you are looking for was acked regularly. I didn't see any anomaly on logs, but the fact you ack the message makes it difficult for me.
Your arquillian container was using hornetq 2.3.2... This fix was made on hornetQ 2.3.4 and it involved client changes... so if your MDBs are running with 2.3.2 then your test is invalid.
I have changed it your test to use 2.3.4 now, I'm running again.
What I found so far is that you are not using node-id on the transaction manager. The times I had seen this fail it was because the TM decided to rollback stuff. I also had cases where everything worked but I needed to wait recovery. I'm still investigating
After I had set node-id, I can't see messages pending even though your test still informing about them... I can see they are available for consuming on paging with a late commit. (after recovery)
I wouldn't call this a blocker.. I would off course make it high priority to fix right after the release. My reasoning: I - if there is an issue here: it's a really rare race. II - It's not a regression. Nothing new introduced on the code. III - it involves paging, recovery and failover. Having said that I will continue the investigation. we may raise it as a known issue if you like and we may have an one-off after the investigation. @Miroslav: I have changed your code on the lab to not ack messages. and I have adde node-ids on the transaction-managers for each node. if you could please repeat these tests.. I will talk to you after I wake up.
Hi Clebert, thank you for your feedback. Property hornetq.version is used only in compatibility tests for standalone JMS clients in different maven profile. It's not used in this test and different version of JMS client does not influence this test. There is default node-identifier set on transaction manager on server with MDB. Test is waiting 5 min to recover all transactions which should provide enough time. Anyway even with suggested changes test is still failing. Mirek
PR sent: https://github.com/hornetq/hornetq/pull/1278
I'm removing the word regression since this wasn't a regression... it's just that you didn't hit this issue before. it was a really rare race on failover.
`Requires doc text` flag cleared as it is too late to include this issue in the JBoss EAP 6.2.0 Release Notes.
This bugzilla was lost from focus. There are no issues in failover tests with MDBs in EAP 6.4.0.DR12. This had to be fixed in some of the previous versions. Closing bz as current release.