Description of problem: I'm suspicious that clustered persistant qpidd broker throws journal exception to client (qpid-perftest): [root@mrg-qe-10 cluster_test_bz674338]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest 2011-02-10 11:49:38 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) in the moments when no exception should be thrown. This defects touches the observation mentioned in bug 509796 comment 19. Let's have following config: - openais service configured and running - broker started with service with following config: cluster-mechanism=ANONYMOUS auth=yes #auth=no log-to-file=/tmp/qpidd.log log-enable=info+ #log-enable=debug+:cluster cluster-name=fclusterA mgmt-pub-interval=2 truncate=yes - qpid-cluster says that there is one node in the cluster: [root@mrg-qe-10 cluster_test_bz674338]# qpid-cluster Cluster Name: fcluster Cluster Status: ACTIVE Cluster Size: 1 Members: ID=10.34.45.10:5636 URL=amqp:tcp:10.34.33.63:5672,tcp:10.34.44.10:5672,tcp:10.34.45.10:5672 - then you run the qpid-perftest client: [root@mrg-qe-10 cluster_test_bz674338]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest 2011-02-10 11:49:38 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) At this point I expect that perftest client CANNOT trigger the journal exception 'Enqueue capacity threshold exceeded on queue...'. More surprisingly I have two identical machines with identical HW and identical RHEL (5.6) with identical architecture (x86_64) and on above one I can see the exception and on the other one (mrg-qe-09) I'm not seeing it. To double-check that there is nothing wrong with mrg-qe-10 config I did restart of service again: [root@mrg-qe-10 _bz]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest 2011-02-10 14:59:30 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) [root@mrg-qe-10 _bz]# qpid-perftest --durable true --count 500 --size 8 --summary --username guest --password guest 30224.3 3230.43 16135.5 0.123104 The latter command shows that client is able to put / get messages. There has to be something wrong with the threshold for such exception then. Version-Release number of selected component (if applicable): [root@mrg-qe-10 _bz]# rpm -qa | grep qpid | sort python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-28.el5 qpid-cpp-client-devel-0.7.946106-28.el5 qpid-cpp-client-devel-docs-0.7.946106-28.el5 qpid-cpp-client-rdma-0.7.946106-28.el5 qpid-cpp-client-ssl-0.7.946106-28.el5 qpid-cpp-mrg-debuginfo-0.7.946106-28.el5 qpid-cpp-server-0.7.946106-28.el5 qpid-cpp-server-cluster-0.7.946106-28.el5 qpid-cpp-server-devel-0.7.946106-28.el5 qpid-cpp-server-rdma-0.7.946106-28.el5 qpid-cpp-server-ssl-0.7.946106-28.el5 qpid-cpp-server-store-0.7.946106-28.el5 qpid-cpp-server-xml-0.7.946106-28.el5 qpid-dotnet-0.4.738274-2.el5 qpid-java-client-0.7.946106-15.el5 qpid-java-common-0.7.946106-15.el5 qpid-java-example-0.7.946106-15.el5 qpid-tools-0.7.946106-12.el5 rh-qpid-cpp-tests-0.7.946106-28.el5 How reproducible: on mrg-qe-10 100% on mrg-qe-09 never. Steps to Reproduce: 1. service openais restart 2. service openais start 3. qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest Actual results: qpid-perftest --durable true --count 50000 --size 8 throws 'Enqueue capacity threshold exceeded on queue...' exception. Expected results: qpid-perftest --durable true --count 50000 --size 8 should not throw 'Enqueue capacity threshold exceeded on queue...' exception. Additional info:
Created attachment 478056 [details] The journals, logs and terminal transcripts The above attachment shows two same scenarios on two identical machines with different results (on one the journal exception is thrown on other one is not) The qpidd journals from both machines are included to provide you good way of comparison of content of data-dirs.
Interesting observation: I can easily reproduce the problem as described, but if I set "auth=no" in the configuration the problem goes away. So it appears to be related to authentication in some way, but I don't know what the connection might be. Host info where I reproduced: mrg32.lab.bos.redhat.com 2.6.18-238.el5 x86_64: 16050Mb 2493MHz 8-core/2-cpu Intel(R) Xeon(R) CPU E5420 @ 2.50GHz Red Hat Enterprise Linux Server release 5.6 (Tikanga)
Created attachment 478244 [details] Analysis of journals from comment #1 I have examined the two journals from mrg-qe-09 and mrg-qe-10, and neither shows any irregularity in the journal itself. I checked the enqueue threshold calculation from the mrg-qe-10 journal, and found it to be correct. All analysis details are in the attached file. There is a distinct difference in the patterns of enqueue/dequeue in the journals. The journal from mrg-qe-09 had a maximum depth of 27311 records, while the journal from mrg-qe-10 had a depth of 36548 records at the time of the enqueue failure. This analysis shows that the enqueue/dequeue patterns are very different on these two machines, but does not shed any light on why that might be the case.
Setting NEEDINFO for aconway. Alan, any further thoughts on this? It seems that the two nodes are seeing very different patterns of enqueueing/dequeuing, hence triggering an ETE on one node which is not seen on the other.
I ran this against a stand-alone broker: qpid-send --durable yes --messages 50000 --content-size 8 -a 'q;{create:always,node:{durable:1}}' and the store overflowed. So the message load here is bigger than the default store capacity, therefore it's a matter of timing whether it overflows or not. In the clustered configuration it appears that messages are produced much faster than they are consumed. I think this is a performance issue, not a correctness issue. I would still like to find out why the differences arise but I think it's low priority/urgency.
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.