Description of problem: The countDelta attribute supposed to contain a positive integer value representing the number of messages been added to the corresponding destination since last message counter update. However the countDelta shows a negative value at times. Version-Release number of selected component (if applicable): JBoss-EAP-6.4.0 HornetQ-2.3.25 How reproducible: Steps to Reproduce: 1. Please create two collocated live-backup servers, NodeA and NodeB 2. Please configure max-size-bytes to a lower figure to make the destination to page 3. Please send 20000 messages to NodeA 4. Issue the following CLI command on NodeB to check the message statistics : ./subsystem=messaging/hornetq-server=live-server/jms-queue=A:list- message-counter-as-json() 5 Please attach a consumer on NodeB and consume a few messages and stop the consumer 6 Please invoke "list-message-counter-as-json()" method periodically. 7. You would see countDelta showing a negative value : http://pastebin.test.redhat.com/412555 Actual results: Expected results: Additional info:
PRs for this bug: https://github.com/hornetq/hornetq/pull/2103 https://github.com/hornetq/hornetq/pull/2104
7.0 branch has Jira ARTEMIS-875 7.1 and WF are fixed by ARTEMIS-812
ARTEMIS-875 moved to Red Hat tracking system as JBEAP-7800
This causes a regression as it breaks slow consumer detection with paging configured. We have a test scenario where we set KILL slow consumer policy with a threshold of 20 and set max-size-bytes to 10240 to force paging. We then start a consumer that reads 1 message per second and after a while check if it got disconnected. However, after these changes it stays connected, because the paged messages don't seem to be taken into account when determining queue rate, so it never gets higher than the threshold and the condition at [1] stays true. To reproduce this, you can run SlowConsumersTestCase#testSlowReceiverKill from our internal testsuite: clone eap6.4.x branch from git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git Go to eap-tests-hornetq/scripts and run groovy script PrepareServers.groovy with -DEAP_VERSION=6.4.13.CP.CR1 parameter: groovy -DEAP_VERSION=6.4.13.CP.CR1 PrepareServers.groovy (Script will prepare 4 servers - server1..4 in you current directory.) Export these paths to server directories + mcast address: export JBOSS_HOME_1=$PWD/server1/jboss-eap export JBOSS_HOME_2=$PWD/server2/jboss-eap export JBOSS_HOME_3=$PWD/server3/jboss-eap export JBOSS_HOME_4=$PWD/server4/jboss-eap export MCAST_ADDR=235.3.4.5 And finally: go to jboss-hornetq-testsuite/ in our testsuite and run mvn clean test -Darquillian.xml=arquillian-4-nodes.xml -Dtest=SlowConsumersTestCase#testSlowReceiverKill [1] https://github.com/hornetq/hornetq/blob/9d61fbe948bd5c53c1f87879859da36284344e64/hornetq-server/src/main/java/org/hornetq/core/server/impl/QueueImpl.java#L3255
Hi Peter, When I ran groovy -DEAP_VERSION=6.4.13.CP.CR1 PrepareServers.groovy I got error: Validating Url file:///home/hudson/static_build_env/eap/6.4.13.CP.CR1/jboss-eap-6.4.13.CP.CR1.zip Url file:///home/hudson/static_build_env/eap/6.4.13.CP.CR1/jboss-eap-6.4.13.CP.CR1.zip does not exists. I checked the url and I found the zip file named: jboss-eap-6.4.13.CP.CR1-full-build.zip might be the one. How can I change the script/option to get this file? Thanks Howard
Ah I see, I'm on master. Thanks Peter!
Hi Peter, I'm currently having problem downloading the zip. The network seems very slow for now. I'll update you the result once I got the test up and running. In the meantime I prepared a simple unit test that simulate the scenario. Here is what my test does: 1. create a queue with addressSettings as such (same as yours I think) addressSettings.setAddressFullMessagePolicy(AddressFullMessagePolicy.PAGE); addressSettings.setMaxSizeBytes(10240); addressSettings.setRedeliveryDelay(1000); addressSettings.setRedistributionDelay(1000); addressSettings.setPageSizeBytes(1024); addressSettings.setSlowConsumerCheckPeriod(1); addressSettings.setSlowConsumerThreshold(20); addressSettings.setSlowConsumerPolicy(SlowConsumerPolicy.KILL); 2. start a thread sending 10000 (max) messages to the queue, I can observe that the messages entering into paging mode. 3. start another thread receiving messages, sleep 1 sec for each receive. 4. After receiving one message I can see the consumer is closed due to slow rate. So basically it fails to reproduce the issue so far. I'll keep digging and will update you once I made any progress on this. Thanks Howard
I could replicate the issue by changing our own SlowConsumerTest on our testsuite. I should have a fix soon. What should I do? send another TAG?
I have cut SP16 including a commit for this fix.
New fix in HornetQ: https://github.com/hornetq/hornetq/commit/3855d95709a7fe5284f898b33e7b95bde7d039e1
Hi, the slow consumer regression is gone, thanks for fixing that. However, I still managed to get a negative value for count delta. The issue seemed to be gone when using the original steps to reproduce, but now I discovered another scenario: 1. Two collocated live-backup pairs with low enough max-size-bytes to force paging. 2. Send enough messages to nodeA to force paging. 3. Call list-message-counter-as-json() on nodeB. 4. Consume ALL messages from nodeB (or nodeA, doesn't seem to matter as long as the node where list-message-counter is called is also consumed from). 5. Call list-message-counter-as-json() again after all messages have been received. Count delta will be >= zero at first, but eventually will become negative. The difference is, that now countDelta will only become negative some time after all messages were received (messageCount == 0). As you can see from the pastebin Tyronne provided, this was not the case before. I prepared a reproducer in our internal testsuite. Just follow the steps from my previous comment [1]. Make sure you are on the eap6.4.x branch and use -Dtest=CountDeltaAttributeTestCase#testCountDeltaNotNegativeAfterLastMessage Here is a (very) brief summary of what I've been able to tell is happening from a quick look with a debugger: long latestMessagesAdded = serverQueue.getMessagesAdded(); [2] long newMessagesAdded = latestMessagesAdded - lastMessagesAdded; countTotal += newMessagesAdded; newMessagesAdded can become negative here which then causes the countTotal to decrease which then results in countDelta being negative as well. This happens because serverQueue.getMessagesAdded() returns a value lower than previous after the counter from its pageSubscription gets delete() called on it, since most of the getMessagesAdded return value up until that point comes from the pageSubscription (return messagesAdded + pageSubscription.getCounter().getValueAdded()). I'm happy to provide more detailed info if needed. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1377703#c5 [2]: https://github.com/hornetq/hornetq/blob/9264bd1fbdc0f58336f8deddcc7485f6d2fade38/hornetq-server/src/main/java/org/hornetq/core/messagecounter/MessageCounter.java#L121
Make that a new issue... it's a very corner involving colocated. I don't think we should / want retag because of this
are you sure this is related to colocated or anything like that?
Anyway, can we create a new issue and folllow as part of that.
The reason seems to be the delete of the page counter. After counter is cleaned up it's value is reset to zero. I'm working on a fix that before it's reset, add the current values to queue. the QA test pass with my fix. But I need add a unit test as well as make sure my fix won't cause regression. Howard
I'd agree with Jiri. While achievable by different steps, negative values can still occur, so I wouldn't consider this fixed. +1 for moving the issue to CP14.
Same symptom, but a different issue... Before the calculation was wrong.. now the page-counter is deleted, so the value is affected after page is cleaned (only at that point). Still worth to mention the previous fix on a release notes... and to track this one as a known issue and fix it on next release.
I don't think anyone will care for a HornetQ JIRA. Just a BZ is fine... I haven't been using HOrnetQ JIRAs to manage EAP fixes on HornetQ for some time already.
ok, just a new bug then.
I believe it's fixed on 2.3.25.x already. it wasn't an issue on artemis/master or artemis/1.x
Verified with EAP 6.4.13.CP.CR2. The new issue is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1414747
Released with EAP 6.4.13 on Feb 02 2017.