Bug 676627

Summary: persistent clustered qpidd broker throws an unpredictably journal exception (Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616))
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: CLOSED UPSTREAM QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 1.3CC: gsim, kim.vdriet
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-02-10 03:13:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The journals, logs and terminal transcripts
none
Analysis of journals from comment #1 none

Description Frantisek Reznicek 2011-02-10 14:03:37 UTC
Description of problem:

I'm suspicious that clustered persistant qpidd broker throws journal exception to client (qpid-perftest):

  [root@mrg-qe-10 cluster_test_bz674338]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest
  2011-02-10 11:49:38 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
  PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)

in the moments when no exception should be thrown.

This defects touches the observation mentioned in bug 509796 comment 19.


Let's have following config:
- openais service configured and running

- broker started with service with following config:
cluster-mechanism=ANONYMOUS
auth=yes
#auth=no
log-to-file=/tmp/qpidd.log
log-enable=info+
#log-enable=debug+:cluster
cluster-name=fclusterA
mgmt-pub-interval=2
truncate=yes

- qpid-cluster says that there is one node in the cluster:
[root@mrg-qe-10 cluster_test_bz674338]# qpid-cluster
  Cluster Name: fcluster
Cluster Status: ACTIVE
  Cluster Size: 1
       Members: ID=10.34.45.10:5636 URL=amqp:tcp:10.34.33.63:5672,tcp:10.34.44.10:5672,tcp:10.34.45.10:5672

- then you run the qpid-perftest client:
[root@mrg-qe-10 cluster_test_bz674338]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest
2011-02-10 11:49:38 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)

At this point I expect that perftest client CANNOT trigger the journal exception 'Enqueue capacity threshold exceeded on queue...'.


More surprisingly I have two identical machines with identical HW and identical RHEL (5.6) with identical architecture (x86_64) and on above one I can see the exception and on the other one (mrg-qe-09) I'm not seeing it.

To double-check that there is nothing wrong with mrg-qe-10 config I did restart of service again:
[root@mrg-qe-10 _bz]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest
2011-02-10 14:59:30 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)

[root@mrg-qe-10 _bz]# qpid-perftest --durable true --count 500 --size 8 --summary --username guest --password guest
30224.3 3230.43 16135.5 0.123104

The latter command shows that client is able to put / get messages.

There has to be something wrong with the threshold for such exception then.


Version-Release number of selected component (if applicable):
[root@mrg-qe-10 _bz]# rpm -qa | grep qpid | sort
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-28.el5
qpid-cpp-client-devel-0.7.946106-28.el5
qpid-cpp-client-devel-docs-0.7.946106-28.el5
qpid-cpp-client-rdma-0.7.946106-28.el5
qpid-cpp-client-ssl-0.7.946106-28.el5
qpid-cpp-mrg-debuginfo-0.7.946106-28.el5
qpid-cpp-server-0.7.946106-28.el5
qpid-cpp-server-cluster-0.7.946106-28.el5
qpid-cpp-server-devel-0.7.946106-28.el5
qpid-cpp-server-rdma-0.7.946106-28.el5
qpid-cpp-server-ssl-0.7.946106-28.el5
qpid-cpp-server-store-0.7.946106-28.el5
qpid-cpp-server-xml-0.7.946106-28.el5
qpid-dotnet-0.4.738274-2.el5
qpid-java-client-0.7.946106-15.el5
qpid-java-common-0.7.946106-15.el5
qpid-java-example-0.7.946106-15.el5
qpid-tools-0.7.946106-12.el5
rh-qpid-cpp-tests-0.7.946106-28.el5


How reproducible:
on mrg-qe-10 100% on mrg-qe-09 never.

Steps to Reproduce:
1. service openais restart
2. service openais start
3. qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest
  

Actual results:
qpid-perftest --durable true --count 50000 --size 8 throws 'Enqueue capacity threshold exceeded on queue...' exception.

Expected results:
qpid-perftest --durable true --count 50000 --size 8 should not throw 'Enqueue capacity threshold exceeded on queue...' exception.

Additional info:

Comment 1 Frantisek Reznicek 2011-02-10 14:07:24 UTC
Created attachment 478056 [details]
The journals, logs and terminal transcripts

The above attachment shows two same scenarios on two identical machines with different results (on one the journal exception is thrown on other one is not)

The qpidd journals from both machines are included to provide you good way of comparison of content of data-dirs.

Comment 2 Alan Conway 2011-02-10 14:55:35 UTC
Interesting observation: I can easily reproduce the problem as described, but
if I set "auth=no" in the configuration the problem goes away. So it appears to
be  related to authentication in some way, but I don't know what the connection
might be.

Host info where I reproduced:

mrg32.lab.bos.redhat.com 2.6.18-238.el5 x86_64: 16050Mb 2493MHz 8-core/2-cpu
  Intel(R) Xeon(R) CPU           E5420  @ 2.50GHz
  Red Hat Enterprise Linux Server release 5.6 (Tikanga)

Comment 3 Kim van der Riet 2011-02-11 14:10:42 UTC
Created attachment 478244 [details]
Analysis of journals from comment #1

I have examined the two journals from mrg-qe-09 and mrg-qe-10, and neither shows any irregularity in the journal itself. I checked the enqueue threshold calculation from the mrg-qe-10 journal, and found it to be correct.

All analysis details are in the attached file.

There is a distinct difference in the patterns of enqueue/dequeue in the journals. The journal from mrg-qe-09 had a maximum depth of 27311 records, while the journal from mrg-qe-10 had a depth of 36548 records at the time of the enqueue failure.

This analysis shows that the enqueue/dequeue patterns are very different on these two machines, but does not shed any light on why that might be the case.

Comment 4 Kim van der Riet 2011-07-29 13:05:05 UTC
Setting NEEDINFO for aconway.

Alan, any further thoughts on this? It seems that the two nodes are seeing very different patterns of enqueueing/dequeuing, hence triggering an ETE on one node which is not seen on the other.

Comment 5 Alan Conway 2011-07-29 13:43:04 UTC
I ran this against a stand-alone broker:
 qpid-send --durable yes --messages 50000 --content-size 8 -a 'q;{create:always,node:{durable:1}}'

and the store overflowed. So the message load here is bigger than the default store capacity, therefore it's a matter of timing whether it overflows or not. In the clustered configuration it appears that messages are produced much faster than they are consumed.

I think this is a performance issue, not a correctness issue. I would still like to find out why the differences arise but I think it's low priority/urgency.

Comment 8 Red Hat Bugzilla 2025-02-10 03:13:38 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.