1377703 – [GSS](6.4.z) The countDelta attribute showing negative values

Bug 1377703 - [GSS](6.4.z) The countDelta attribute showing negative values

Summary: [GSS](6.4.z) The countDelta attribute showing negative values

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	HornetQ
Sub Component:
Version:	6.4.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	CR2
Target Release:	EAP 6.4.13
Assignee:	Yong Hao Gao
QA Contact:	Peter Mackay
Docs Contact:
URL:
Whiteboard:
Depends On:	1414747
Blocks:	eap6413-payload 1390788
TreeView+	depends on / blocked

Reported:	2016-09-20 12:18 UTC by Tyronne Wickramarathne
Modified:	2020-05-14 15:18 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-02-03 16:45:44 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Apache JIRA	ARTEMIS-812	None	None	None	2016-12-09 11:09:13 UTC
Red Hat Bugzilla	1414747	unspecified	CLOSED	[GSS](6.4.z) Page Cleanup will make MessageCount to show negative values	2021-02-22 00:41:40 UTC
Red Hat Issue Tracker	JBEAP-7800	Major	Verified	(7.0.z) The countDelta attribute showing negative values	2019-10-01 09:23:28 UTC

Internal Links: 1414747

Description Tyronne Wickramarathne 2016-09-20 12:18:21 UTC

Description of problem:

The countDelta attribute supposed to contain a positive integer value representing the number of messages been added to the corresponding destination since last message counter update. However the countDelta shows a negative value at times.


Version-Release number of selected component (if applicable):

JBoss-EAP-6.4.0
HornetQ-2.3.25


How reproducible:


Steps to Reproduce:
1. Please create two collocated live-backup servers, NodeA and NodeB
2. Please configure max-size-bytes to a lower figure to make the
destination to page
3. Please send 20000 messages to NodeA
4. Issue the following CLI command on NodeB to check the message
statistics :
 ./subsystem=messaging/hornetq-server=live-server/jms-queue=A:list-
message-counter-as-json()
5 Please attach a consumer on NodeB and consume a few messages and stop
the consumer
6 Please invoke "list-message-counter-as-json()" method periodically. 
7. You would see countDelta showing a negative value : http://pastebin.test.redhat.com/412555



Actual results:


Expected results:


Additional info:

Comment 1 Yong Hao Gao 2016-10-24 01:35:15 UTC

PRs for this bug:

https://github.com/hornetq/hornetq/pull/2103
https://github.com/hornetq/hornetq/pull/2104

Comment 3 Radovan STANCEL 2016-12-09 11:12:23 UTC

7.0 branch has Jira ARTEMIS-875
7.1 and WF are fixed by ARTEMIS-812

Comment 4 Radovan STANCEL 2016-12-09 15:10:59 UTC

ARTEMIS-875 moved to Red Hat tracking system as JBEAP-7800

Comment 5 Peter Mackay 2017-01-06 12:34:05 UTC

This causes a regression as it breaks slow consumer detection with paging configured.

We have a test scenario where we set KILL slow consumer policy with a threshold of 20 and set max-size-bytes to 10240 to force paging. We then start a consumer that reads 1 message per second and after a while check if it got disconnected.
However, after these changes it stays connected, because the paged messages don't seem to be taken into account when determining queue rate, so it never gets higher than the threshold and the condition at [1] stays true.

To reproduce this, you can run SlowConsumersTestCase#testSlowReceiverKill from our internal testsuite:
clone eap6.4.x branch from git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git

Go to eap-tests-hornetq/scripts and run groovy script PrepareServers.groovy with -DEAP_VERSION=6.4.13.CP.CR1 parameter:
groovy -DEAP_VERSION=6.4.13.CP.CR1 PrepareServers.groovy

(Script will prepare 4 servers - server1..4 in you current directory.)

Export these paths to server directories + mcast address:
export JBOSS_HOME_1=$PWD/server1/jboss-eap
export JBOSS_HOME_2=$PWD/server2/jboss-eap
export JBOSS_HOME_3=$PWD/server3/jboss-eap
export JBOSS_HOME_4=$PWD/server4/jboss-eap
export MCAST_ADDR=235.3.4.5

And finally: go to jboss-hornetq-testsuite/ in our testsuite and run
mvn clean test -Darquillian.xml=arquillian-4-nodes.xml -Dtest=SlowConsumersTestCase#testSlowReceiverKill

[1] https://github.com/hornetq/hornetq/blob/9d61fbe948bd5c53c1f87879859da36284344e64/hornetq-server/src/main/java/org/hornetq/core/server/impl/QueueImpl.java#L3255

Comment 6 Yong Hao Gao 2017-01-09 07:14:26 UTC

Hi Peter,

When I ran 
groovy -DEAP_VERSION=6.4.13.CP.CR1 PrepareServers.groovy

I got error:

Validating Url file:///home/hudson/static_build_env/eap/6.4.13.CP.CR1/jboss-eap-6.4.13.CP.CR1.zip
Url file:///home/hudson/static_build_env/eap/6.4.13.CP.CR1/jboss-eap-6.4.13.CP.CR1.zip does not exists.

I checked the url and I found the zip file named:

jboss-eap-6.4.13.CP.CR1-full-build.zip

might be the one. How can I change the script/option to get this file?

Thanks
Howard

Comment 8 Yong Hao Gao 2017-01-09 11:19:30 UTC

Ah I see, I'm on master.

Thanks Peter!

Comment 9 Yong Hao Gao 2017-01-09 13:44:49 UTC

Hi Peter,

I'm currently having problem downloading the zip. The network seems very slow for now. I'll update you the result once I got the test up and running.

In the meantime I prepared a simple unit test that simulate the scenario.

Here is what my test does:

1. create a queue with addressSettings as such (same as yours I think)

      addressSettings.setAddressFullMessagePolicy(AddressFullMessagePolicy.PAGE);
      addressSettings.setMaxSizeBytes(10240);
      addressSettings.setRedeliveryDelay(1000);
      addressSettings.setRedistributionDelay(1000);
      addressSettings.setPageSizeBytes(1024);
      addressSettings.setSlowConsumerCheckPeriod(1);
      addressSettings.setSlowConsumerThreshold(20);
      addressSettings.setSlowConsumerPolicy(SlowConsumerPolicy.KILL);

2. start a thread sending 10000 (max) messages to the queue, I can observe that the messages entering into paging mode.

3. start another thread receiving messages, sleep 1 sec for each receive.

4. After receiving one message I can see the consumer is closed due to slow rate.

So basically it fails to reproduce the issue so far. 

I'll keep digging and will update you once I made any progress on this.

Thanks
Howard

Comment 10 Clebert Suconic 2017-01-09 16:19:04 UTC

I could replicate the issue by changing our own SlowConsumerTest on our testsuite.

I should have a fix soon. What should I do? send another TAG?

Comment 11 Clebert Suconic 2017-01-09 17:13:20 UTC

I have cut SP16 including a commit for this fix.

Comment 12 Radovan STANCEL 2017-01-10 07:37:59 UTC

New fix in HornetQ:
https://github.com/hornetq/hornetq/commit/3855d95709a7fe5284f898b33e7b95bde7d039e1

Comment 13 Peter Mackay 2017-01-16 15:28:33 UTC

Hi,

the slow consumer regression is gone, thanks for fixing that.

However, I still managed to get a negative value for count delta. The issue seemed to be gone when using the original steps to reproduce, but now I discovered another scenario:
1. Two collocated live-backup pairs with low enough max-size-bytes to force paging.
2. Send enough messages to nodeA to force paging.
3. Call list-message-counter-as-json() on nodeB.
4. Consume ALL messages from nodeB (or nodeA, doesn't seem to matter as long as the node where list-message-counter is called is also consumed from).
5. Call list-message-counter-as-json() again after all messages have been received. Count delta will be >= zero at first, but eventually will become negative.

The difference is, that now countDelta will only become negative some time after all messages were received (messageCount == 0). As you can see from the pastebin Tyronne provided, this was not the case before.

I prepared a reproducer in our internal testsuite. Just follow the steps from my previous comment [1]. Make sure you are on the eap6.4.x branch and use -Dtest=CountDeltaAttributeTestCase#testCountDeltaNotNegativeAfterLastMessage

Here is a (very) brief summary of what I've been able to tell is happening from a quick look with a debugger:

long latestMessagesAdded = serverQueue.getMessagesAdded(); [2]
long newMessagesAdded = latestMessagesAdded - lastMessagesAdded;
countTotal += newMessagesAdded;

newMessagesAdded can become negative here which then causes the countTotal to decrease which then results in countDelta being negative as well.
This happens because serverQueue.getMessagesAdded() returns a value lower than previous after the counter from its pageSubscription gets delete() called on it, since most of the getMessagesAdded return value up until that point comes from the pageSubscription (return messagesAdded + pageSubscription.getCounter().getValueAdded()).
I'm happy to provide more detailed info if needed.


[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1377703#c5
[2]: https://github.com/hornetq/hornetq/blob/9264bd1fbdc0f58336f8deddcc7485f6d2fade38/hornetq-server/src/main/java/org/hornetq/core/messagecounter/MessageCounter.java#L121

Comment 14 Clebert Suconic 2017-01-17 20:19:09 UTC

Make that a new issue... it's a very corner involving colocated.

I don't think we should / want retag because of this

Comment 15 Clebert Suconic 2017-01-17 21:34:17 UTC

are you sure this is related to colocated or anything like that?

Comment 16 Clebert Suconic 2017-01-17 21:34:40 UTC

Anyway, can we create a new issue and folllow as part of that.

Comment 18 Yong Hao Gao 2017-01-18 10:15:01 UTC

The reason seems to be the delete of the page counter. After counter is cleaned up it's value is reset to zero.

I'm working on a fix that before it's reset, add the current values to queue.
the QA test pass with my fix. But I need add a unit test as well as make sure my fix won't cause regression.

Howard

Comment 19 Petr Penicka 2017-01-18 12:52:35 UTC

I'd agree with Jiri. While achievable by different steps, negative values can still occur, so I wouldn't consider this fixed.

+1 for moving the issue to CP14.

Comment 20 Clebert Suconic 2017-01-18 13:13:41 UTC

Same symptom, but a different issue...


Before the calculation was wrong..


now the page-counter is deleted, so the value is affected after page is cleaned (only at that point).

Still worth to mention the previous fix on a release notes... and to track this one as a known issue and fix it on next release.

Comment 24 Clebert Suconic 2017-01-19 04:19:03 UTC

I don't think anyone will care for a HornetQ JIRA. Just a BZ is fine...


I haven't been using HOrnetQ JIRAs to manage EAP fixes on HornetQ for some time already.

Comment 25 Yong Hao Gao 2017-01-19 06:13:01 UTC

ok, just a new bug then.

Comment 26 Clebert Suconic 2017-01-19 20:22:32 UTC

I believe it's fixed on 2.3.25.x already.

it wasn't an issue on artemis/master or artemis/1.x

Comment 30 Peter Mackay 2017-01-23 16:46:49 UTC

Verified with EAP 6.4.13.CP.CR2.
The new issue is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1414747

Comment 31 Petr Penicka 2017-02-03 16:45:44 UTC

Released with EAP 6.4.13 on Feb 02 2017.

Note You need to log in before you can comment on or make changes to this bug.