Description of problem: The qpidd msgstore plugin causes qpidd to throw following messages and close client connection, which I believe is not correct and there should be much more space per queue. 2009-jun-22 11:36:50 warning Journal "perf_iqcG1LhNY0": Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY0". 2009-jun-22 11:36:50 debug Exception constructed: Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY0". (JournalImpl.cpp:575) 2009-jun-22 11:36:50 error Unexpected exception: Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY0". (JournalImpl.cpp:575) 2009-jun-22 11:36:50 debug 10.34.33.200:20571(READY/error) Connection error 66015 unresolved: 10.34.33.200:20571 2009-jun-22 11:36:50 trace MCAST 10.34.33.200:20571-0: {ClusterErrorCheckBody: type=2; frame-seq=66015; } 2009-jun-22 11:36:50 error Connection 127.0.0.1:58619 closed by error: Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY 0". (JournalImpl.cpp:575)(501) 2009-jun-22 11:36:50 trace 10.34.33.200:20571(READY/error) DLVR: Event[10.34.33.200:20571-0 control 27 bytes] 2009-jun-22 11:36:50 trace SENT [127.0.0.1:58619]: Frame[BEbe; channel=0; {ConnectionCloseBody: reply-code=501; reply-text=Enqueue capacity threshold exceeded on queue "perf_iqcG1LhNY0". (JournalImpl.cpp:575); }] The qpid-tool says following perf_iqcG1LhNY0 queue statisctics: Object of type org.apache.qpid.broker:queue: (last sample time: 09:37:03) Type Element 113 =============================================================================== property vhostRef 103 property name perf_iqcG1LhNY0 property durable True property autoDelete False property exclusive False property arguments {u'qpid.max_size': 0, u'qpid.max_count': 0} statistic msgTotalEnqueues 6489 messages statistic msgTotalDequeues 6489 statistic msgTxnEnqueues 0 statistic msgTxnDequeues 0 statistic msgPersistEnqueues 6489 statistic msgPersistDequeues 6489 statistic msgDepth 0 statistic byteDepth 0 octets statistic byteTotalEnqueues 8883441 statistic byteTotalDequeues 8883441 statistic byteTxnEnqueues 0 statistic byteTxnDequeues 0 statistic bytePersistEnqueues 8883441 statistic bytePersistDequeues 8883441 statistic consumerCount 1 consumer statistic consumerCountHigh 1 statistic consumerCountLow 1 statistic bindingCount 1 binding statistic bindingCountHigh 1 statistic bindingCountLow 1 statistic unackedMessages 0 messages statistic unackedMessagesHigh 0 statistic unackedMessagesLow 0 statistic messageLatencySamples 0 statistic messageLatencyMin 0 statistic messageLatencyMax 0 statistic messageLatencyAverage 0 I'm suspicious that there should be more space per queue... Version-Release number of selected component (if applicable): openais-0.80.3-22.el5_3.8 openais-debuginfo-0.80.3-22.el5_3.8 openais-devel-0.80.3-22.el5_3.8 python-qpid-0.5.752581-3.el5 qpidc-0.5.752581-17.el5 qpidc-debuginfo-0.5.752581-17.el5 qpidc-devel-0.5.752581-17.el5 qpidc-perftest-0.5.752581-17.el5 qpidc-rdma-0.5.752581-17.el5 qpidc-ssl-0.5.752581-17.el5 qpidd-0.5.752581-17.el5 qpidd-acl-0.5.752581-17.el5 qpidd-cluster-0.5.752581-17.el5 qpidd-devel-0.5.752581-17.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-17.el5 qpidd-ssl-0.5.752581-17.el5 qpidd-xml-0.5.752581-17.el5 qpid-java-client-0.5.751061-7.el5 qpid-java-common-0.5.751061-7.el5 rhm-0.5.3206-1.el5 rhm-docs-0.5.756148-1.el5 How reproducible: 100% Steps to Reproduce: 1. launch broker like this: qpidd -p <PORT>--auth no --log-enable trace+ --cluster-name dhcp-lab-200.englab.brq.redhat.com_cluster_V.MRG.M.0043_1/2_w1 --data-dir /home/freznice/prjs/rhts_tests/tests/distribution/MRG_Messaging/qpid_ptest_cluster_perftest/rhts_qpidd/20090622_113627/qpidd.0565 Cluster name can be omited, I believe. 2. Launch multiple perftest instances in parallel on background (modified perftest which allows queue name specification) [11:36:45] ..perftest(0) --unique-data yes --log-enable trace+ -p 38692 --count 14555 --size 1369 --durable 1 --queue-name perf_iqcG1LhNY launched [11:36:45] ..perftest(1) --unique-data yes --log-enable trace+ -p 38692 --count 29912 --size 438 --durable 0 --queue-name perf_BfXljCD7zzJdkgIZVw0Li4RWGm launched [11:36:45] ..perftest(2) --unique-data yes --log-enable trace+ -p 38692 --count 16320 --size 1056 --durable 1 --queue-name perf_xUZr75bnkZT3 launched [11:36:46] ..perftest(3) --unique-data yes --log-enable trace+ -p 38692 --count 12203 --size 1276 --durable 1 --queue-name perf_zwL 3. Check the perftest and qpidd logfiles Actual results: Durable perftest runs did not end (hang) with '...Enqueue capacity threshold exceeded on queue' Expected results: I expect that there is more space and above exception should not be trown Additional info: The data of this case including modified perftest are here on mrg3.lab.bos.redhat.com:/root/: The test including journal files /root/qpid_ptest_cluster_perftest.2.fail53x.tar.bz2 pstack & gdb backtrace of hang perftest clients and running qpidd /root/qpid_ptest_cluster_perftest.2.fail53x.analysis.log
I just updated BZ title, the msgstore queue policy thresholds definitely differ when qpidd standalone and clustered. I can see expected behavior when qpidd standalone, but when clustered I can see queue policy exceptions... So the issue is triggered by mid range message count and size (perftest) and clustered qpidd (even 1 node cluster). I can see much bigger messages and queues correctly send / received when standalone browser is used: perf_ZeHEe40qloTCb5ipD56IsY0 --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:35665/35665|22397620/22397620 unacked:0 consumer_cnt:0 binding_cnt:1 perf_zmf5XPALU970 --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:11215/11215|17125305/17125305 unacked:0 consumer_cnt:0 binding_cnt:1 perf_dFMqZXr8BCwYFN8QLSOI7aC0 --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:17616/17616|7821504/7821504 unacked:0 consumer_cnt:0 binding_cnt:1 perf_IbPDZVF30 --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:26351/26351|35784658/35784658 unacked:0 consumer_cnt:0 binding_cnt:1 perf_sryFAH2jIly0y0 --durable --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:34024/34024|39808080/39808080 unacked:0 consumer_cnt:0 binding_cnt:1 perf_lYkjLh1g7saFJy4b470 --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:22050/22050|32964750/32964750 unacked:0 consumer_cnt:0 binding_cnt:1 perf_ucjqITk4J7PcaEt6jFd0 --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:18575/18575|22772950/22772950 unacked:0 consumer_cnt:0 binding_cnt:1 perf_nyblTYSZHM6Zh70 --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:27129/27129|14595402/14595402 unacked:0 consumer_cnt:0 binding_cnt:1 perf_OTdwJZGJSmTZsAEEt2JH0zdVhG7HqFIE0 --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:34010/34010|35234360/35234360 unacked:0 consumer_cnt:0 binding_cnt:1 perf_IQSwTo80 --max-queue-size=0 --max-queue-count=0 msg|byteEnq/Deq:0/0|0/0 unacked:0 consumer_cnt:1 binding_cnt:1
The same behavior observed also on packages (rhel 5.3 x86_64): openais-0.80.3-22.el5_3.8 openais-debuginfo-0.80.3-22.el5_3.8 openais-devel-0.80.3-22.el5_3.8 python-qpid-0.5.752581-3.el5 qpidc-0.5.752581-19.el5 qpidc-debuginfo-0.5.752581-17.el5 qpidc-devel-0.5.752581-19.el5 qpidc-perftest-0.5.752581-19.el5 qpidc-rdma-0.5.752581-19.el5 qpidc-ssl-0.5.752581-19.el5 qpidd-0.5.752581-19.el5 qpidd-acl-0.5.752581-19.el5 qpidd-cluster-0.5.752581-19.el5 qpidd-devel-0.5.752581-19.el5 qpid-dotnet-0.4.738274-2.el5 qpidd-rdma-0.5.752581-19.el5 qpidd-ssl-0.5.752581-19.el5 qpidd-xml-0.5.752581-19.el5 qpid-java-client-0.5.751061-7.el5 qpid-java-common-0.5.751061-7.el5 rhm-0.5.3206-5.el5 rhm-docs-0.5.756148-1.el5
One clarification: The above mentioned behavior (policy exceeded exception) is seen only in case of durable messages (perftest --durable)
The second correction: This BZ highlights the "Enqueue capacity threshold exceeded on queue" exception. Please understand all 'policy exceeded exception' occurences above as "Enqueue capacity threshold exceeded on queue" exceptions. In reality both terms link to different cases, so sorry for confusion.
The issue now reproduced also on qpidc*-0.5.752581-20.el5 package set running durable failover soak [MSG:413370, DUR:1, MDL:/usr/lib64/qpid/daemon, MAN:yes, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes]
A default queue journal has a capacity of 12MB (8 files of 1.5MB ea). The enqueue capacity threshold triggers when the message to be enqueued plus 20% of the journal capacity together falls in a file containing an enqueued record. If this condition triggers, then the connection performing the enqueue is closed. This strategy amounts to an approximately 70 - 80% threshold, but it can be lower. Note further that this test is applied to enqueue records only; dequeues and transaction records continue (on other connections) as long as there is free space in the journal. Looking at the perftest parameters above, instance 0 will enqueue 14555 * 1369 bytes/msg = 20MB. Similarly, instance 1 will enqueue 13MB, 2 will enqueue 17MB and 3 will enqueue 15MB. Under normal (non-clustered) conditions, these tests can pass using a 12MB journal because the rate of dequeue relative to the rate of enqueue is sufficiently high that the journal does not reach the threshold. However, when run on a cluster, the dequeue-enqueue ratio is significantly lower, and the journal quickly fills up to the threshold. Note that because perftest uses separate connections for producing and consuming messages, when the producing connection is closed because of an enqueue exception threshold, the consuming connection continues to dequeue messages until all those records that _did_ make it before the exception are dequeued. This explains why after the test there are no remaining messages. However, the number of messages that _were_ enqueued and dequeued show the point at which the exception occurred during the test. In the example above (from qpid-tool), 6489 messages of the 14555 that were expected were enqueued before the threshold exception was reached. This corresponds to 8.8MB/12MB ~ 75% full. Looking at perftest cluster performance, enqueues are so much faster than dequeues that often, only 30-40 records are dequeued in the time it takes to enqueue 8000 records (this is a separate issue). RECOMMENDATION: When testing clusters using perftest, size the journal so that all the messages to be sent will fit into the journal under the threshold. For the tests above, the largest test of which enqueues 20MB, make the journal ~30MB which should hold all the enqueued messages. Closing as NOTABUG. If I missed anything, please set back to ASSIGNED.