Bug 1019378
| Summary: | Message Redistribution could lead to loss of messages if paging and reading with batched Transactions | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | Miroslav Novak <mnovak> | ||||||||
| Component: | HornetQ | Assignee: | Clebert Suconic <csuconic> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Miroslav Novak <mnovak> | ||||||||
| Severity: | urgent | Docs Contact: | Russell Dickenson <rdickens> | ||||||||
| Priority: | unspecified | ||||||||||
| Version: | 6.2.0 | CC: | ataylor, brian.stansberry, csuconic, jbertram, jmesnil, lcosti, msvehla, myarboro | ||||||||
| Target Milestone: | CR1 | ||||||||||
| Target Release: | EAP 6.2.0 | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: |
In a rare circumstances, if messages were being acknowledged too fast with big chunks on a HornetQ server, a message redistribution could read a record before the transaction was instantiated on the page system. This situation would result in message loss.
This issue has been fixed in this release of JBoss EAP 6 by making sure the paging system will correctly instantiate a page transaction, and only writing the file after the page transaction is instantiated.
As a result of this fix, under the same circumstances there will be no lost messages.
|
Story Points: | --- | ||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2013-12-15 16:17:00 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 1026553 | ||||||||||
| Bug Blocks: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Miroslav Novak
2013-10-15 15:11:52 UTC
Created attachment 812582 [details]
reproducer.zip
Created attachment 812583 [details]
logs.zip
Created attachment 812592 [details]
client-maven-project.zip
I've reproduced the behavior you're seeing and I'm investigating further. Observations so far... When the producer sends messages to server1 HornetQ is load-balancing the messages between server1 and server2 in a round-robin fashion. The messages meant for server1 go straight into "InQueue". However, messages meant for server2 go into a special queue on server1 and are moved to server2 via the cluster bridge. When server2 is killed this special queue on server1 still has all those messages meant for server2. When server2 is restarted then the cluster is put back together an those messages are finally moved to server2. At the end of this process both servers have 3,000 messages in "InQueue". I verified this using jboss-cli.sh and this command on each server: /subsystem=messaging/hornetq-server=default/jms-queue=InQueue/:count-messages So far, everything seems to be working as expected. When I start the consumer and it runs until completion, but it reports that it has not received all 6,000 messages. However, according to the server (using the same command as above), there are 0 messages left in "InQueue" on both servers. Obviously this doesn't add up. I'm continuing to investigate. I've reproduced the problem on the HornetQ test-suite. The problem appears directly related to paging. My investigation in continuing. This is related to paging... it seems you have the bridged queue in page mode... Also.. I've writtent a similar test to this and it all passes with redeliveryDelay=1 (or 0) I will keep investigating this.. but I don't think it's a blocker. I will aim this for next week. Thanks for feedback. This issue appears only when producer is sending big amount of messages (commits every 1000th message). This is not a blocker... it happened because you had paging on the bridged address, and set the number of messages >> min-page-size.. and had a failure in between... I'm still working on this but I couldn't find a fix on time for 6.2... I will be working on this issue now, but I wouldn't block the release... I worked for a week without finding the cause and likely I would work another 3 or 4 days. I don't see a reason to hold the release on this. definitely not a regressions BTW Marked negotiable blocker for 6.2. myarboro: clebert: https://bugzilla.redhat.com/show_bug.cgi?id=1019378 <== how is it looking ? [10:55am] unifiedbot: [1019378] Kill/Shutdown of server in cluster leads to lost messages [JBoss Enterprise Application Platform 6] [csuconic:ASSIGNED] [10:56am] clebert: myarboro: I'm writing another testcase... let me tell you at the end of today? [10:56am] myarboro: you bet [10:56am] clebert: myarboro: I'm really convinced it's not a blocker though.. but I will get there [10:56am] rsvoboda_: pgier, qa_ack granted [10:57am] myarboro: clebert, jdoyle: marking 1019378 as negotiable blocker… if no fix en route by eod, we'll remove from list <== okay ? I have the fix already. Since it's in place.. it's better to have a cut with this. I have replicated this on a testsuite.. my test is now passing.. However the original replicator still failing... I will need some extra time to investigate this... I will work this along part of the weekend and I want to hold a release at least until monday. Still a negotiable blocker.. but we should at least post the fix I have in place now: https://github.com/hornetq/hornetq/pull/1360 I was on PTO. This issue is/was problem because PAGE mode is default value and scenario is quite easy. Can be hit for example during maintance. We'll verify it with CR1. This issue was verified using the 6.2.0.CR1 preview bits. Not able to hit it with the reproducer anymore, while I was hitting it with ER7 every time. It's either gone or became rare(er) occurrence, so I'm setting this as verified. |