507313 – clustered qpidd msgstore throws 'Enqueue capacity threshold exceeded on queue' in cases when standalone qpidd+msgstore does not

Bug 507313 - clustered qpidd msgstore throws 'Enqueue capacity threshold exceeded on queue' in cases when standalone qpidd+msgstore does not

Summary: clustered qpidd msgstore throws 'Enqueue capacity threshold exceeded on queue...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	1.1.6
Target Release:	---
Assignee:	Kim van der Riet
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-22 10:00 UTC by Frantisek Reznicek
Modified:	2015-11-16 00:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-06-23 18:03:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Frantisek Reznicek 2009-06-22 10:00:16 UTC

Description of problem:
The qpidd msgstore plugin causes qpidd to throw following messages and close client connection, which I believe is not correct and there should be much more space per queue.

  2009-jun-22 11:36:50 warning Journal "perf_iqcG1LhNY0": Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY0".
  2009-jun-22 11:36:50 debug Exception constructed: Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY0". (JournalImpl.cpp:575)
  2009-jun-22 11:36:50 error Unexpected exception: Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY0". (JournalImpl.cpp:575)
  2009-jun-22 11:36:50 debug 10.34.33.200:20571(READY/error) Connection error 66015 unresolved: 10.34.33.200:20571
  2009-jun-22 11:36:50 trace MCAST 10.34.33.200:20571-0: {ClusterErrorCheckBody: type=2; frame-seq=66015; }
  2009-jun-22 11:36:50 error Connection 127.0.0.1:58619 closed by error: Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY
0". (JournalImpl.cpp:575)(501)
  2009-jun-22 11:36:50 trace 10.34.33.200:20571(READY/error) DLVR: Event[10.34.33.200:20571-0 control 27 bytes]
  2009-jun-22 11:36:50 trace SENT [127.0.0.1:58619]: Frame[BEbe; channel=0; {ConnectionCloseBody: reply-code=501; reply-text=Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY0". (JournalImpl.cpp:575); }]


The qpid-tool says following perf_iqcG1LhNY0 queue statisctics:
Object of type org.apache.qpid.broker:queue: (last sample time: 09:37:03)
    Type       Element                113
    ===============================================================================
    property   vhostRef               103
    property   name                   perf_iqcG1LhNY0
    property   durable                True
    property   autoDelete             False
    property   exclusive              False
    property   arguments              {u'qpid.max_size': 0, u'qpid.max_count': 0}
    statistic  msgTotalEnqueues       6489 messages
    statistic  msgTotalDequeues       6489
    statistic  msgTxnEnqueues         0
    statistic  msgTxnDequeues         0
    statistic  msgPersistEnqueues     6489
    statistic  msgPersistDequeues     6489
    statistic  msgDepth               0
    statistic  byteDepth              0 octets
    statistic  byteTotalEnqueues      8883441
    statistic  byteTotalDequeues      8883441
    statistic  byteTxnEnqueues        0
    statistic  byteTxnDequeues        0
    statistic  bytePersistEnqueues    8883441
    statistic  bytePersistDequeues    8883441
    statistic  consumerCount          1 consumer
    statistic  consumerCountHigh      1
    statistic  consumerCountLow       1
    statistic  bindingCount           1 binding
    statistic  bindingCountHigh       1
    statistic  bindingCountLow        1
    statistic  unackedMessages        0 messages
    statistic  unackedMessagesHigh    0
    statistic  unackedMessagesLow     0
    statistic  messageLatencySamples  0
    statistic  messageLatencyMin      0
    statistic  messageLatencyMax      0
    statistic  messageLatencyAverage  0

I'm suspicious that there should be more space per queue...

Version-Release number of selected component (if applicable):
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
openais-devel-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-17.el5
qpidc-debuginfo-0.5.752581-17.el5
qpidc-devel-0.5.752581-17.el5
qpidc-perftest-0.5.752581-17.el5
qpidc-rdma-0.5.752581-17.el5
qpidc-ssl-0.5.752581-17.el5
qpidd-0.5.752581-17.el5
qpidd-acl-0.5.752581-17.el5
qpidd-cluster-0.5.752581-17.el5
qpidd-devel-0.5.752581-17.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-17.el5
qpidd-ssl-0.5.752581-17.el5
qpidd-xml-0.5.752581-17.el5
qpid-java-client-0.5.751061-7.el5
qpid-java-common-0.5.751061-7.el5
rhm-0.5.3206-1.el5
rhm-docs-0.5.756148-1.el5


How reproducible:
100%

Steps to Reproduce:
1. launch broker like this:
  qpidd -p <PORT>--auth no --log-enable trace+ --cluster-name dhcp-lab-200.englab.brq.redhat.com_cluster_V.MRG.M.0043_1/2_w1 --data-dir /home/freznice/prjs/rhts_tests/tests/distribution/MRG_Messaging/qpid_ptest_cluster_perftest/rhts_qpidd/20090622_113627/qpidd.0565
  Cluster name can be omited, I believe.
2. Launch multiple perftest instances in parallel on background (modified perftest which allows queue name specification)
[11:36:45] ..perftest(0) --unique-data yes --log-enable trace+ -p 38692 --count 14555 --size 1369 --durable 1 --queue-name perf_iqcG1LhNY launched
[11:36:45] ..perftest(1) --unique-data yes --log-enable trace+ -p 38692 --count 29912 --size 438 --durable 0 --queue-name perf_BfXljCD7zzJdkgIZVw0Li4RWGm launched
[11:36:45] ..perftest(2) --unique-data yes --log-enable trace+ -p 38692 --count 16320 --size 1056 --durable 1 --queue-name perf_xUZr75bnkZT3 launched
[11:36:46] ..perftest(3) --unique-data yes --log-enable trace+ -p 38692 --count 12203 --size 1276 --durable 1 --queue-name perf_zwL
3. Check the perftest and qpidd logfiles
  
Actual results:
Durable perftest runs did not end (hang) with '...Enqueue capacity threshold exceeded on queue'

Expected results:
I expect that there is more space and above exception should not be trown

Additional info:

The data of this case including modified perftest are here on mrg3.lab.bos.redhat.com:/root/:

The test including journal files
  /root/qpid_ptest_cluster_perftest.2.fail53x.tar.bz2

pstack & gdb backtrace of hang perftest clients and running qpidd
  /root/qpid_ptest_cluster_perftest.2.fail53x.analysis.log

Comment 1 Frantisek Reznicek 2009-06-22 10:51:44 UTC

I just updated BZ title, the msgstore queue policy thresholds definitely differ when qpidd standalone and clustered. I can see expected behavior when qpidd standalone, but when clustered I can see queue policy exceptions...

So the issue is triggered by mid range message count and size (perftest) and clustered qpidd (even 1 node cluster).


I can see much bigger messages and queues correctly send / received when standalone browser is used:
perf_ZeHEe40qloTCb5ipD56IsY0  --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:35665/35665|22397620/22397620 unacked:0 consumer_cnt:0 binding_cnt:1
perf_zmf5XPALU970  --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:11215/11215|17125305/17125305 unacked:0 consumer_cnt:0 binding_cnt:1
perf_dFMqZXr8BCwYFN8QLSOI7aC0  --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:17616/17616|7821504/7821504 unacked:0 consumer_cnt:0 binding_cnt:1
perf_IbPDZVF30  --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:26351/26351|35784658/35784658 unacked:0 consumer_cnt:0 binding_cnt:1
perf_sryFAH2jIly0y0  --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:34024/34024|39808080/39808080 unacked:0 consumer_cnt:0 binding_cnt:1
perf_lYkjLh1g7saFJy4b470  --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:22050/22050|32964750/32964750 unacked:0 consumer_cnt:0 binding_cnt:1
perf_ucjqITk4J7PcaEt6jFd0  --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:18575/18575|22772950/22772950 unacked:0 consumer_cnt:0 binding_cnt:1
perf_nyblTYSZHM6Zh70  --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:27129/27129|14595402/14595402 unacked:0 consumer_cnt:0 binding_cnt:1
perf_OTdwJZGJSmTZsAEEt2JH0zdVhG7HqFIE0  --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:34010/34010|35234360/35234360 unacked:0 consumer_cnt:0 binding_cnt:1
perf_IQSwTo80  --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:0/0|0/0 unacked:0 consumer_cnt:1 binding_cnt:1

Comment 2 Frantisek Reznicek 2009-06-22 11:47:08 UTC

The same behavior observed also on packages (rhel 5.3 x86_64):
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
openais-devel-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-19.el5
qpidc-debuginfo-0.5.752581-17.el5
qpidc-devel-0.5.752581-19.el5
qpidc-perftest-0.5.752581-19.el5
qpidc-rdma-0.5.752581-19.el5
qpidc-ssl-0.5.752581-19.el5
qpidd-0.5.752581-19.el5
qpidd-acl-0.5.752581-19.el5
qpidd-cluster-0.5.752581-19.el5
qpidd-devel-0.5.752581-19.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-19.el5
qpidd-ssl-0.5.752581-19.el5
qpidd-xml-0.5.752581-19.el5
qpid-java-client-0.5.751061-7.el5
qpid-java-common-0.5.751061-7.el5
rhm-0.5.3206-5.el5
rhm-docs-0.5.756148-1.el5

Comment 3 Frantisek Reznicek 2009-06-22 12:28:49 UTC

One clarification:
The above mentioned behavior (policy exceeded exception) is seen only in case of durable messages (perftest --durable)

Comment 4 Frantisek Reznicek 2009-06-22 14:36:44 UTC

The second correction:
This BZ highlights the "Enqueue capacity threshold exceeded on queue" exception. 
Please understand all 'policy exceeded exception' occurences above as "Enqueue capacity threshold exceeded on queue" exceptions.

In reality both terms link to different cases, so sorry for confusion.

Comment 5 Frantisek Reznicek 2009-06-23 12:06:58 UTC

The issue now reproduced also on qpidc*-0.5.752581-20.el5 package set running durable failover soak [MSG:413370, DUR:1, MDL:/usr/lib64/qpid/daemon, MAN:yes, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes]

Comment 6 Kim van der Riet 2009-06-23 18:03:43 UTC

A default queue journal has a capacity of 12MB (8 files of 1.5MB ea). The enqueue capacity threshold triggers when the message to be enqueued plus 20% of the journal capacity together falls in a file containing an enqueued record. If this condition triggers, then the connection performing the enqueue is closed. This strategy amounts to an approximately 70 - 80% threshold, but it can be lower. Note further that this test is applied to enqueue records only; dequeues and transaction records continue (on other connections) as long as there is free space in the journal.

Looking at the perftest parameters above, instance 0 will enqueue 14555 * 1369 bytes/msg = 20MB. Similarly, instance 1 will enqueue 13MB, 2 will enqueue 17MB and 3 will enqueue 15MB.

Under normal (non-clustered) conditions, these tests can pass using a 12MB journal because the rate of dequeue relative to the rate of enqueue is sufficiently high that the journal does not reach the threshold. However, when run on a cluster, the dequeue-enqueue ratio is significantly lower, and the journal quickly fills up to the threshold.

Note that because perftest uses separate connections for producing and consuming messages, when the producing connection is closed because of an enqueue exception threshold, the consuming connection continues to dequeue messages until all those records that _did_ make it before the exception are dequeued. This explains why after the test there are no remaining messages. However, the number of messages that _were_ enqueued and dequeued show the point at which the exception occurred during the test.

In the example above (from qpid-tool), 6489 messages of the 14555 that were expected were enqueued before the threshold exception was reached. This corresponds to 8.8MB/12MB ~ 75% full. Looking at perftest cluster performance, enqueues are so much faster than dequeues that often, only 30-40 records are dequeued in the time it takes to enqueue 8000 records (this is a separate issue).

RECOMMENDATION: When testing clusters using perftest, size the journal so that all the messages to be sent will fit into the journal under the threshold. For the tests above, the largest test of which enqueues 20MB, make the journal ~30MB which should hold all the enqueued messages.

Closing as NOTABUG. If I missed anything, please set back to ASSIGNED.

Note You need to log in before you can comment on or make changes to this bug.