988706 – Sometimes messages get lost after failover of RA connected to remote server

Bug 988706 - Sometimes messages get lost after failover of RA connected to remote server

Summary: Sometimes messages get lost after failover of RA connected to remote server

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	HornetQ
Sub Component:
Version:	6.1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	DR12
Target Release:	EAP 6.4.0
Assignee:	Andy Taylor
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	994214
TreeView+	depends on / blocked

Reported:	2013-07-26 07:52 UTC by Miroslav Novak
Modified:	2014-12-04 12:40 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-12-04 12:40:54 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description Miroslav Novak 2013-07-26 07:52:08 UTC

Sometimes there are lost messages after failover of RA from live to backup server (replicated journal). Can affect users/customers and have negative impact for their production systems.

Test scenario:
1. Start live/backup pair in dedicated topology with replicated journal and deploy jms/queue/InQueue and jms/queue/OutQueue. Attribute address-full-policy is set to PAGE
2. Send 5000 messages to jms/queue/InQueue
3. Start 3rd server with deployed MDB. RA is configured to connect to remote live server. MDB consumes message from jms/queue/InQueue, creates new message and sends to jms/queue/OutQueue
4. When MDB is processing messages kill live server.
5. RA on 3rd server failovers to backup server
6. Wait for MDB to process all messages from jms/queue/InQueue
7. Start consumer on jms/queue/OutQueue and receive all messages from jms/queue/OutQueue

Result:
Sometimes consumer in step 7. receives less than 5000 messages. For example in this case 4998 messages. There are 2 lost messages.

Attaching trace logs from all servers + dump of HornetQ journal from server2 + log from the test. See instructions how to download them below.

Message from InQueue for which there is not corresponding message in OutQueue messageID: 69e3259a-f532-11e2-addc-559b4b0210c8  _HQ_DUPL_ID= 6d694d45-6b2b-4568-ab25-98f963629a6c
MDB received this message and created new message with:
messageID: e6a77483-f532-11e2-a7df-e125ec787dcb _HQ_DUPL_ID=6d694d45-6b2b-4568-ab25-98f963629a6c inMessageId= 69e3259a-f532-11e2-addc-559b4b0210c8

I tried to track down what happened. The lost message was processed by MDB not during but after failover of RA to backup (it was in step 6.). So there was no failure/failover in this time. From trace of arjuna XA transaction was succesfully commited and message was sent to jms/queue/OutQueue. 

It seems that message gets lost when it is paged to disk (from server2/backup log):
16:02:54,146 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: -1195762270, [id: 0xb8ba1da2, /192.168.40.2:5445])) Paging message PagedMessageImpl [queueIDs=[5], transactionID=2147525748, mes
sage=ServerMessage[messageID=2147525750,priority=4, bodySize=1024,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=6d694d45-6b2b-4568-ab25-98f963629a6c, __HQ_CID
=79f2797d-f532-11e2-a7df-e125ec787dcb, inMessageId=ID:69e3259a-f532-11e2-addc-559b4b0210c8}]]@1438526286] on pageStore jms.queue.OutQueue pageId=5

Consumer in step 7. can't consume it.

Comment 2 Miroslav Novak 2013-07-26 08:04:35 UTC

I did not hit this issue in EAP 6.1.0 or EAP 6.1.1.ER1-2. That's why I set Regression. Or I just didn't have enough luck.

Comment 5 Clebert Suconic 2013-07-30 13:51:07 UTC

Are you waiting sync between live and backup before you killed the server? 

With replication you have to wait the servers to catch up before failover is effective

Comment 6 Francisco Borges 2013-08-01 11:56:48 UTC

How often does the test fails? I've been running it continuously and have not managed to get a failure.

Comment 7 Miroslav Novak 2013-08-05 09:40:34 UTC

Update: On Friday we were able to reproduce the issue on station1/10.34.3.189 and allowed access to Francisco for investigation.

Comment 11 Clebert Suconic 2013-08-07 19:59:15 UTC

After a lot of investigation from me both on this issue and on BZ963215,

I believe this issue will be a duplicate of BZ963215



Looking at the print-data QE produced on their tests, a paged message was ACKed without any transaction:

operation@AddRecord;recordID=2147529766;userRecordType=39;isUpdate=false;CursorAckRecordEncoding [queueID=5, position=PagePositionImpl [pageNr=6, messageNr=-1, recordID=-1]]


Which is exactly the same issue that was happening on the Bridge.


after a failover, a XASession would behave as non transacted. Depending on the race you could get the Ack to sneak in on the Session as non Transacted.

With the changes on BZ963215 this would always be rejected and the transaction replayed.


I will mark this as Posted due to this extensive investigation. This is an issue hard to replicate using QE tests but it is easy with the Unit tests I and Howard created on BZ963215. I'm hoping this issue won't repeat again, but as usual let me know if you still hit any issues after this fix.


I will proceed with the component update now.

Comment 13 Clebert Suconic 2013-08-19 12:28:11 UTC

This has been fixed as part of HornetQ 2.3.3. This was a duplicate of the one of the inner issues on the JMS FAilover bridge. It should be marked as fixed.

Comment 14 Miroslav Novak 2013-08-19 12:35:30 UTC

This BZ should be in ON_QA status. 

At this moment it appears to be fixed but I need to run tests many times to be sure. I'll also check tests from dev.

Comment 16 Miroslav Novak 2013-08-19 17:52:51 UTC

I tried to do some basic investigation of the logs:
ID of message which has no corresponding message in OutQueue: 674eafc7-08f0-11e3-a74a-33c86ab6a2c0

Message sent to InQueue in:
18:57:17,026 Thread-16 DEBUG [org.jboss.qa.hornetq.apps.clients.ProducerClientAck:101] Producer for node: 192.168.40.1and queue: jms/queue/InQueue. Sent message with property counter: 1591, messageId:ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0, _HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874

Processed by MDB and sent and new (this one is actually lost) message sent to OutQueue (this is after failover from live to backup):
Log from server with MDB:
18:59:36,847 DEBUG [org.jboss.qa.hornetq.apps.mdb.MdbWithRemoteOutQueueToContaniner1] (Thread-9 (HornetQ-client-global-threads-1035402077)) End of ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0, count:1590. Sending new message with inMessageId: ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0 and messageId: ID:baaf4aba-08f0-11e3-8d2a-9d4972893a18 in 133 ms

Processing of message on backup server:
18:59:36,847 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 248062764, [id: 0x0ec9232c, /192.168.40.2:5445])) send(message=ServerMessage[messageID=29612,priority=4, bodySize=514,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874, __HQ_CID=991aa458-08f0-11e3-8d2a-9d4972893a18, inMessageId=ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0}]]@697005135, direct=true) being called
18:59:36,847 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 248062764, [id: 0x0ec9232c, /192.168.40.2:5445])) Routing message ServerMessage[messageID=29612,priority=4, bodySize=514,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874, __HQ_CID=991aa458-08f0-11e3-8d2a-9d4972893a18, inMessageId=ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0}]]@697005135 on binding=BindingsImpl [name=jms.queue.OutQueue]
18:59:36,847 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 248062764, [id: 0x0ec9232c, /192.168.40.2:5445])) Message after routed=ServerMessage[messageID=29612,priority=4, bodySize=514,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874, __HQ_CID=991aa458-08f0-11e3-8d2a-9d4972893a18, inMessageId=ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0}]]@697005135

Message was paged to disk in txID 29600:
18:59:36,882 TRACE [org.hornetq.core.server] (Old I/O server worker (parentId: 248062764, [id: 0x0ec9232c, /192.168.40.2:5445])) Paging message PagedMessageImpl [queueIDs=[5], transactionID=29600, message=ServerMessage[messageID=29612,priority=4, bodySize=1028,expiration=0, durable=true, address=jms.queue.OutQueue,properties=TypedProperties[{_HQ_DUPL_ID=eff715d2-643e-4108-82cb-3e2f9b3a8874, __HQ_CID=991aa458-08f0-11e3-8d2a-9d4972893a18, inMessageId=ID:674eafc7-08f0-11e3-a74a-33c86ab6a2c0}]]@697005135] on pageStore jms.queue.OutQueue pageId=3

txId 29600 was committed:
18:59:36,950 TRACE [org.hornetq.journal] (Thread-5 (HornetQ-server-HornetQServerImpl::serverUUID=null-689350335)) appendAddRecordTransactional:txID=29600,id=29635, userRecordType=35, usedFile = JournalFileImpl: (hornetq-data-4.hq id = 7, recordID = 7)
There is prepare for txID=29600:
18:59:36,950 TRACE [org.hornetq.journal] (Thread-5 (HornetQ-server-HornetQServerImpl::serverUUID=null-689350335)) appendPrepareRecord::txID=29600, usedFile = JournalFileImpl: (hornetq-data-4.hq id = 7, recordID 
= 7)
Calling commit for txID=29600:
18:59:37,205 TRACE [org.hornetq.core.server] (Thread-19 (HornetQ-server-HornetQServerImpl::serverUUID=null-689350335)) XAcommit into TransactionImpl [xid=XidImpl (805364243 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-31.-43.126.-64.82.18.78.-70.0.0.49.2.0.0.0.1.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-31.-43.126.-64.82.18.78.-70.0.0.48.-29.49 base64:AAAAAAAAAAAAAP__CiIDv-HVfsBSEk66AAAxAgAAAAEAAAAAAAAAAAAAAAAAAP__CiIDv-HVfsBSEk66AAAw4zEHAgIA, id=29600, state=PREPARED, createTime=1376931576838(Mon Aug 19 18:59:36 CEST 2013), timeoutSeconds=300, nr operations = 5]@9fa6604, xid=XidImpl (161317831 bq:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-31.-43.126.-64.82.18.78.-70.0.0.49.2.0.0.0.1.0.0.0.0 formatID:131077 gtxid:0.0.0.0.0.0.0.0.0.0.-1.-1.10.34.3.-65.-31.-43.126.-64.82.18.78.-70.0.0.48.-29.49 base64:AAAAAAAAAAAAAP_n_CiIDv-HVfsBSEk66AAAxAgAAAAEAAAAAAAAAAAAAAAAAAP__CiIDv-HVfsBSEk66AAAw4zEHAgIA
18:59:37,374 TRACE [org.hornetq.journal] (Thread-18 (HornetQ-server-HornetQServerImpl::serverUUID=null-689350335)) appendCommitRecord::txID=29600, usedFile = JournalFileImpl: (hornetq-data-4.hq id = 7, recordID = 7)

I got lost here. Content of hornetq journal of backup/server2 is printed in:
/home/jbossqa/hornetq_eap6_testsuite/eap6.1.1.ER6/server2/jboss-eap-6.1/standalone/data/journal.txt

Comment 17 Clebert Suconic 2013-08-19 20:11:33 UTC

It's definitely not the same root cause.. I'm still investigating.

Comment 18 Clebert Suconic 2013-08-19 20:44:02 UTC

It would help a lot if you didn't ack the messages on consuming, so I could verify where is the message.

it seems from what I read that the message you are looking for was acked regularly. I didn't see any anomaly on logs, but the fact you ack the message makes it difficult for me.

Comment 19 Clebert Suconic 2013-08-19 20:48:58 UTC

Your arquillian container was using hornetq 2.3.2... This fix was made on hornetQ 2.3.4 and it involved client changes... so if your MDBs are running with 2.3.2 then your test is invalid.

Comment 20 Clebert Suconic 2013-08-19 20:49:22 UTC

I have changed it your test to use 2.3.4 now, I'm running again.

Comment 21 Clebert Suconic 2013-08-20 03:12:46 UTC

What I found so far is that you are not using node-id on the transaction manager.


The times I had seen this fail it was because the TM decided to rollback stuff.


I also had cases where everything worked but I needed to wait recovery.


I'm still investigating

Comment 22 Clebert Suconic 2013-08-20 03:39:25 UTC

After I had set node-id, I can't see messages pending even though your test still informing about them... I can see they are available for consuming on paging with a late commit. (after recovery)

Comment 23 Clebert Suconic 2013-08-20 03:53:24 UTC

I wouldn't call this a blocker.. 

I would off course make it high priority to fix right after the release.

My reasoning:

I - if there is an issue here: it's a really rare race.
II - It's not a regression. Nothing new introduced on the code.
III - it involves paging, recovery and failover.


Having said that I will continue the investigation. we may raise it as a known issue if you like and we may have an one-off after the investigation.


@Miroslav: I have changed your code on the lab to not ack messages. and I have adde node-ids on the transaction-managers for each node. if you could please repeat these tests.. I will talk to you after I wake up.

Comment 24 Miroslav Novak 2013-08-20 10:26:17 UTC

Hi Clebert,

thank you for your feedback.

Property hornetq.version is used only in compatibility tests for standalone JMS clients in different maven profile. It's not used in this test and different version of JMS client does not influence this test.

There is default node-identifier set on transaction manager on server with MDB. Test is waiting 5 min to recover all transactions which should provide enough time. Anyway even with suggested changes test is still failing.

Mirek

Comment 27 Clebert Suconic 2013-09-13 15:42:50 UTC

PR sent: https://github.com/hornetq/hornetq/pull/1278

Comment 28 Clebert Suconic 2013-09-18 21:23:12 UTC

I'm removing the word regression since this wasn't a regression...
it's just that you didn't hit this issue before. it was a really rare race on failover.

Comment 33 Russell Dickenson 2013-12-03 13:20:41 UTC

`Requires doc text` flag cleared as it is too late to include this issue in the JBoss EAP 6.2.0 Release Notes.

Comment 36 Miroslav Novak 2014-12-04 12:40:54 UTC

This bugzilla was lost from focus.

There are no issues in failover tests with MDBs in EAP 6.4.0.DR12. This had to be fixed in some of the previous versions.

Closing bz as current release.

Note You need to log in before you can comment on or make changes to this bug.