There are lost messages in HornetQ cluster. This can have severe impact on production environment in case during maintenance or server failure. Test scenario: 1. Start two EAP 6.2.0.ER5 servers in HQ cluster with deployed queue "InQueue" 2. Send 6000 messages to 1st server. Producer is using transacted session and acks every 1000th message. 3. Immediately after producer finishes. Kill or cleanly shutdown 2nd server. 4. Start 2nd server again 5. Start consumer connected to 1st server and read all messages. Result: There are some messages missing. How to use reproducer.zip - in unzipped "reproducer" directory run following commands: 1. "sh prepare.sh" 2. "sh start-server1.sh first_ip" 3. "sh start-server2.sh second_ip" 4. "sh start-producer.sh first_ip jms/queue/InQueue 6000" and kill/shutdown 2nd server (this must be done as soon as producer finishes) 5. "sh start-server2.sh second_ip" 6. "sh start-consumer.sh first_ip jms/queue/InQueue" In my run I can see lost message: ID:4916afc1-35ab-11e3-94ef-57666a0472c9 Logs from both of the servers are attached (logs.zip)
Created attachment 812582 [details] reproducer.zip
Created attachment 812583 [details] logs.zip
Created attachment 812592 [details] client-maven-project.zip
I've reproduced the behavior you're seeing and I'm investigating further.
Observations so far... When the producer sends messages to server1 HornetQ is load-balancing the messages between server1 and server2 in a round-robin fashion. The messages meant for server1 go straight into "InQueue". However, messages meant for server2 go into a special queue on server1 and are moved to server2 via the cluster bridge. When server2 is killed this special queue on server1 still has all those messages meant for server2. When server2 is restarted then the cluster is put back together an those messages are finally moved to server2. At the end of this process both servers have 3,000 messages in "InQueue". I verified this using jboss-cli.sh and this command on each server: /subsystem=messaging/hornetq-server=default/jms-queue=InQueue/:count-messages So far, everything seems to be working as expected. When I start the consumer and it runs until completion, but it reports that it has not received all 6,000 messages. However, according to the server (using the same command as above), there are 0 messages left in "InQueue" on both servers. Obviously this doesn't add up. I'm continuing to investigate.
I've reproduced the problem on the HornetQ test-suite. The problem appears directly related to paging. My investigation in continuing.
This is related to paging... it seems you have the bridged queue in page mode... Also.. I've writtent a similar test to this and it all passes with redeliveryDelay=1 (or 0) I will keep investigating this.. but I don't think it's a blocker. I will aim this for next week.
Thanks for feedback. This issue appears only when producer is sending big amount of messages (commits every 1000th message).
This is not a blocker... it happened because you had paging on the bridged address, and set the number of messages >> min-page-size.. and had a failure in between... I'm still working on this but I couldn't find a fix on time for 6.2... I will be working on this issue now, but I wouldn't block the release... I worked for a week without finding the cause and likely I would work another 3 or 4 days. I don't see a reason to hold the release on this.
definitely not a regressions BTW
Marked negotiable blocker for 6.2. myarboro: clebert: https://bugzilla.redhat.com/show_bug.cgi?id=1019378 <== how is it looking ? [10:55am] unifiedbot: [1019378] Kill/Shutdown of server in cluster leads to lost messages [JBoss Enterprise Application Platform 6] [csuconic:ASSIGNED] [10:56am] clebert: myarboro: I'm writing another testcase... let me tell you at the end of today? [10:56am] myarboro: you bet [10:56am] clebert: myarboro: I'm really convinced it's not a blocker though.. but I will get there [10:56am] rsvoboda_: pgier, qa_ack granted [10:57am] myarboro: clebert, jdoyle: marking 1019378 as negotiable blocker… if no fix en route by eod, we'll remove from list <== okay ?
I have the fix already. Since it's in place.. it's better to have a cut with this.
I have replicated this on a testsuite.. my test is now passing.. However the original replicator still failing... I will need some extra time to investigate this... I will work this along part of the weekend and I want to hold a release at least until monday. Still a negotiable blocker.. but we should at least post the fix I have in place now: https://github.com/hornetq/hornetq/pull/1360
I was on PTO. This issue is/was problem because PAGE mode is default value and scenario is quite easy. Can be hit for example during maintance. We'll verify it with CR1.
This issue was verified using the 6.2.0.CR1 preview bits. Not able to hit it with the reproducer anymore, while I was hitting it with ER7 every time. It's either gone or became rare(er) occurrence, so I'm setting this as verified.