Bug 1019378 - Message Redistribution could lead to loss of messages if paging and reading with batched Transactions
Message Redistribution could lead to loss of messages if paging and reading w...
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: HornetQ (Show other bugs)
Unspecified Unspecified
unspecified Severity urgent
: CR1
: EAP 6.2.0
Assigned To: Clebert Suconic
Miroslav Novak
Russell Dickenson
Depends On: 1026553
  Show dependency treegraph
Reported: 2013-10-15 11:11 EDT by Miroslav Novak
Modified: 2013-12-15 11:17 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
In a rare circumstances, if messages were being acknowledged too fast with big chunks on a HornetQ server, a message redistribution could read a record before the transaction was instantiated on the page system. This situation would result in message loss. This issue has been fixed in this release of JBoss EAP 6 by making sure the paging system will correctly instantiate a page transaction, and only writing the file after the page transaction is instantiated. As a result of this fix, under the same circumstances there will be no lost messages.
Story Points: ---
Clone Of:
Last Closed: 2013-12-15 11:17:00 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
reproducer.zip (44.98 KB, application/zip)
2013-10-15 11:12 EDT, Miroslav Novak
no flags Details
logs.zip (6.02 MB, application/zip)
2013-10-15 11:13 EDT, Miroslav Novak
no flags Details
client-maven-project.zip (140.91 KB, application/zip)
2013-10-15 11:33 EDT, Miroslav Novak
no flags Details

  None (edit)
Description Miroslav Novak 2013-10-15 11:11:52 EDT
There are lost messages in HornetQ cluster. This can have severe impact on production environment in case during maintenance or server failure.

Test scenario:
1. Start two EAP 6.2.0.ER5 servers in HQ cluster with deployed queue "InQueue"
2. Send 6000 messages to 1st server. Producer is using transacted session and acks every 1000th message.
3. Immediately after producer finishes. Kill or cleanly shutdown 2nd server.
4. Start 2nd server again
5. Start consumer connected to 1st server and read all messages.

There are some messages missing.

How to use reproducer.zip - in unzipped "reproducer" directory run following commands:
1. "sh prepare.sh"
2. "sh start-server1.sh first_ip"
3. "sh start-server2.sh second_ip"
4. "sh start-producer.sh first_ip jms/queue/InQueue 6000" and kill/shutdown 2nd server (this must be done as soon as producer finishes)
5. "sh start-server2.sh second_ip"
6. "sh start-consumer.sh first_ip jms/queue/InQueue"

In my run I can see lost message:

Logs from both of the servers are attached (logs.zip)
Comment 1 Miroslav Novak 2013-10-15 11:12:31 EDT
Created attachment 812582 [details]
Comment 2 Miroslav Novak 2013-10-15 11:13:39 EDT
Created attachment 812583 [details]
Comment 3 Miroslav Novak 2013-10-15 11:33:39 EDT
Created attachment 812592 [details]
Comment 4 Justin Bertram 2013-10-15 14:12:40 EDT
I've reproduced the behavior you're seeing and I'm investigating further.
Comment 5 Justin Bertram 2013-10-15 14:27:04 EDT
Observations so far...

When the producer sends messages to server1 HornetQ is load-balancing the messages between server1 and server2 in a round-robin fashion.  The messages meant for server1 go straight into "InQueue".  However, messages meant for server2 go into a special queue on server1 and are moved to server2 via the cluster bridge.  When server2 is killed this special queue on server1 still has all those messages meant for server2.  When server2 is restarted then the cluster is put back together an those messages are finally moved to server2.  At the end of this process both servers have 3,000 messages in "InQueue".  I verified this using jboss-cli.sh and this command on each server:


So far, everything seems to be working as expected.

When I start the consumer and it runs until completion, but it reports that it has not received all 6,000 messages.  However, according to the server (using the same command as above), there are 0 messages left in "InQueue" on both servers.  

Obviously this doesn't add up.  I'm continuing to investigate.
Comment 6 Justin Bertram 2013-10-15 18:14:28 EDT
I've reproduced the problem on the HornetQ test-suite.  The problem appears directly related to paging.  My investigation in continuing.
Comment 7 Clebert Suconic 2013-10-24 16:28:06 EDT
This is related to paging... it seems you have the bridged queue in page mode... 

Also.. I've writtent a similar test to this and it all passes with redeliveryDelay=1 (or 0)

I will keep investigating this.. but I don't think it's a blocker.

I will aim this for next week.
Comment 8 Miroslav Novak 2013-10-25 04:44:45 EDT
Thanks for feedback. This issue appears only when producer is sending big amount of messages (commits every 1000th message).
Comment 10 Clebert Suconic 2013-10-28 16:23:58 EDT
This is not a blocker...

it happened because you had paging on the bridged address, and set the number of messages >> min-page-size.. and had a failure in between...

I'm still working on this but I couldn't find a fix on time for 6.2... I will be working on this issue now, but I wouldn't block the release... I worked for a week without finding the cause and likely I would work another 3 or 4 days.

I don't see a reason to hold the release on this.
Comment 11 Clebert Suconic 2013-10-28 16:24:55 EDT
definitely not a regressions BTW
Comment 12 mark yarborough 2013-10-29 10:59:05 EDT
Marked negotiable blocker for 6.2.

myarboro: clebert: https://bugzilla.redhat.com/show_bug.cgi?id=1019378  <== how is it looking ?
[10:55am] unifiedbot: [1019378] Kill/Shutdown of server in cluster leads to lost messages [JBoss Enterprise Application Platform 6] [csuconic@redhat.com:ASSIGNED]
[10:56am] clebert: myarboro: I'm writing another testcase... let me tell you at the end of today?
[10:56am] myarboro: you bet
[10:56am] clebert: myarboro: I'm really convinced it's not a blocker though.. but I will get there
[10:56am] rsvoboda_: pgier, qa_ack granted
[10:57am] myarboro: clebert, jdoyle: marking 1019378 as negotiable blocker… if no fix en route by eod, we'll remove from list <== okay ?
Comment 14 Clebert Suconic 2013-10-31 21:47:37 EDT
I have the fix already. 

Since it's in place.. it's better to have a cut with this.
Comment 15 Clebert Suconic 2013-11-01 17:10:01 EDT
I have replicated this on a testsuite.. my test is now passing..

However the original replicator still failing...

I will need some extra time to investigate this...

I will work this along part of the weekend and I want to hold a release at least until monday.

Still a negotiable blocker.. but we should at least post the fix I have in place now:

Comment 16 Miroslav Novak 2013-11-11 05:46:54 EST
I was on PTO. 

This issue is/was problem because PAGE mode is default value and scenario is quite easy. Can be hit for example during maintance.

We'll verify it with CR1.
Comment 17 Martin Svehla 2013-11-11 07:47:08 EST
This issue was verified using the 6.2.0.CR1 preview bits.

Not able to hit it with the reproducer anymore, while I was hitting it with ER7 every time. It's either gone or became rare(er) occurrence, so I'm setting this as verified.

Note You need to log in before you can comment on or make changes to this bug.