Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1019378

Summary:

Message Redistribution could lead to loss of messages if paging and reading with batched Transactions

Product:

[JBoss] JBoss Enterprise Application Platform 6

Reporter:

Miroslav Novak <mnovak>

Component:

HornetQ

Assignee:

Clebert Suconic <csuconic>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Miroslav Novak <mnovak>

Severity:

urgent

Docs Contact:

Russell Dickenson <rdickens>

Priority:

unspecified

Version:

6.2.0

CC:

ataylor, brian.stansberry, csuconic, jbertram, jmesnil, lcosti, msvehla, myarboro

Target Milestone:

CR1

Target Release:

EAP 6.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

In a rare circumstances, if messages were being acknowledged too fast with big chunks on a HornetQ server, a message redistribution could read a record before the transaction was instantiated on the page system. This situation would result in message loss. This issue has been fixed in this release of JBoss EAP 6 by making sure the paging system will correctly instantiate a page transaction, and only writing the file after the page transaction is instantiated. As a result of this fix, under the same circumstances there will be no lost messages.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-12-15 16:17:00 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1026553

Bug Blocks:

Attachments:

Description	Flags
reproducer.zip	none
logs.zip	none
client-maven-project.zip	none

Description Miroslav Novak 2013-10-15 15:11:52 UTC

There are lost messages in HornetQ cluster. This can have severe impact on production environment in case during maintenance or server failure.

Test scenario:
1. Start two EAP 6.2.0.ER5 servers in HQ cluster with deployed queue "InQueue"
2. Send 6000 messages to 1st server. Producer is using transacted session and acks every 1000th message.
3. Immediately after producer finishes. Kill or cleanly shutdown 2nd server.
4. Start 2nd server again
5. Start consumer connected to 1st server and read all messages.

Result:
There are some messages missing.

How to use reproducer.zip - in unzipped "reproducer" directory run following commands:
1. "sh prepare.sh"
2. "sh start-server1.sh first_ip"
3. "sh start-server2.sh second_ip"
4. "sh start-producer.sh first_ip jms/queue/InQueue 6000" and kill/shutdown 2nd server (this must be done as soon as producer finishes)
5. "sh start-server2.sh second_ip"
6. "sh start-consumer.sh first_ip jms/queue/InQueue"

In my run I can see lost message:
ID:4916afc1-35ab-11e3-94ef-57666a0472c9

Logs from both of the servers are attached (logs.zip)

Comment 1 Miroslav Novak 2013-10-15 15:12:31 UTC

Created attachment 812582 [details]
reproducer.zip

Comment 2 Miroslav Novak 2013-10-15 15:13:39 UTC

Created attachment 812583 [details]
logs.zip

Comment 3 Miroslav Novak 2013-10-15 15:33:39 UTC

Created attachment 812592 [details]
client-maven-project.zip

Comment 4 Justin Bertram 2013-10-15 18:12:40 UTC

I've reproduced the behavior you're seeing and I'm investigating further.

Comment 5 Justin Bertram 2013-10-15 18:27:04 UTC

Observations so far...

When the producer sends messages to server1 HornetQ is load-balancing the messages between server1 and server2 in a round-robin fashion.  The messages meant for server1 go straight into "InQueue".  However, messages meant for server2 go into a special queue on server1 and are moved to server2 via the cluster bridge.  When server2 is killed this special queue on server1 still has all those messages meant for server2.  When server2 is restarted then the cluster is put back together an those messages are finally moved to server2.  At the end of this process both servers have 3,000 messages in "InQueue".  I verified this using jboss-cli.sh and this command on each server:

  /subsystem=messaging/hornetq-server=default/jms-queue=InQueue/:count-messages

So far, everything seems to be working as expected.

When I start the consumer and it runs until completion, but it reports that it has not received all 6,000 messages.  However, according to the server (using the same command as above), there are 0 messages left in "InQueue" on both servers.  

Obviously this doesn't add up.  I'm continuing to investigate.

Comment 6 Justin Bertram 2013-10-15 22:14:28 UTC

I've reproduced the problem on the HornetQ test-suite.  The problem appears directly related to paging.  My investigation in continuing.

Comment 7 Clebert Suconic 2013-10-24 20:28:06 UTC

This is related to paging... it seems you have the bridged queue in page mode... 


Also.. I've writtent a similar test to this and it all passes with redeliveryDelay=1 (or 0)


I will keep investigating this.. but I don't think it's a blocker.


I will aim this for next week.

Comment 8 Miroslav Novak 2013-10-25 08:44:45 UTC

Thanks for feedback. This issue appears only when producer is sending big amount of messages (commits every 1000th message).

Comment 10 Clebert Suconic 2013-10-28 20:23:58 UTC

This is not a blocker...

it happened because you had paging on the bridged address, and set the number of messages >> min-page-size.. and had a failure in between...


I'm still working on this but I couldn't find a fix on time for 6.2... I will be working on this issue now, but I wouldn't block the release... I worked for a week without finding the cause and likely I would work another 3 or 4 days.


I don't see a reason to hold the release on this.

Comment 11 Clebert Suconic 2013-10-28 20:24:55 UTC

definitely not a regressions BTW

Comment 12 mark yarborough 2013-10-29 14:59:05 UTC

Marked negotiable blocker for 6.2.

myarboro: clebert: https://bugzilla.redhat.com/show_bug.cgi?id=1019378  <== how is it looking ?
[10:55am] unifiedbot: [1019378] Kill/Shutdown of server in cluster leads to lost messages [JBoss Enterprise Application Platform 6] [csuconic:ASSIGNED]
[10:56am] clebert: myarboro: I'm writing another testcase... let me tell you at the end of today?
[10:56am] myarboro: you bet
[10:56am] clebert: myarboro: I'm really convinced it's not a blocker though.. but I will get there
[10:56am] rsvoboda_: pgier, qa_ack granted
[10:57am] myarboro: clebert, jdoyle: marking 1019378 as negotiable blocker… if no fix en route by eod, we'll remove from list <== okay ?

Comment 14 Clebert Suconic 2013-11-01 01:47:37 UTC

I have the fix already. 

Since it's in place.. it's better to have a cut with this.

Comment 15 Clebert Suconic 2013-11-01 21:10:01 UTC

I have replicated this on a testsuite.. my test is now passing..

However the original replicator still failing...


I will need some extra time to investigate this...


I will work this along part of the weekend and I want to hold a release at least until monday.


Still a negotiable blocker.. but we should at least post the fix I have in place now:

https://github.com/hornetq/hornetq/pull/1360

Comment 16 Miroslav Novak 2013-11-11 10:46:54 UTC

I was on PTO. 

This issue is/was problem because PAGE mode is default value and scenario is quite easy. Can be hit for example during maintance.

We'll verify it with CR1.

Comment 17 Martin Svehla 2013-11-11 12:47:08 UTC

This issue was verified using the 6.2.0.CR1 preview bits.

Not able to hit it with the reproducer anymore, while I was hitting it with ER7 every time. It's either gone or became rare(er) occurrence, so I'm setting this as verified.