Bug 509796

Summary:	error sending update: Enqueue capacity threshold exceeded
Product:	Red Hat Enterprise MRG	Reporter:	Gordon Sim <gsim>
Component:	qpid-cpp	Assignee:	Alan Conway <aconway>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Frantisek Reznicek <freznice>
Severity:	medium	Docs Contact:
Priority:	high
Version:	1.1.1	CC:	esammons, freznice, iboverma, jross, kim.vdriet, ppecka, tross, trusnak
Target Milestone:	1.3.3	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Adding a new node to a cluster no longer fails with a journal capacity exception.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-12-06 21:06:53 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	639994
Bug Blocks:

Description Gordon Sim 2009-07-06 09:23:35 UTC

Description of problem:

Adding a new node to a cluster failed with a journal capacity exception. However the new node was using the default journal size options as was the existing cluster node so its not clear why this error should occur.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-22.el5
rhm-0.5.3206-5.el5

How reproducible:

Fairly easily

Steps to Reproduce:
1. start one cluster node
2. run while perftest --durable true --count 50000 --size 8 --summary; do true; done
3. while this is running add a new node to the cluster, then stop it, then re-add it and repeat this for several times until eventually you see an error

Actual results:

2009-jul-06 05:07:10 notice Journal "perftest0": Created
2009-jul-06 05:07:11 warning Journal "perftest0": Enqueue capacity threshold exceeded on queue "perftest0".
2009-jul-06 05:07:11 error Unexpected exception: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:576)
2009-jul-06 05:07:11 error Connection 10.16.44.221:46092 closed by error: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:576)(501)
2009-jul-06 05:07:11 error Channel exception: not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:40)
2009-jul-06 05:07:11 critical 10.16.44.221:19499(UPDATEE) catch-up connection closed prematurely 10.16.44.221:19499-1(local,catchup)
2009-jul-06 05:07:11 notice 10.16.44.221:19499(LEFT) leaving cluster grs-mrg14-test-cluster
2009-jul-06 05:07:11 notice Shut down

Expected results:

Should always be able to add the new node since the queue on the original node in the cluster has the same capacity.

Additional info:

Comment 1 Gordon Sim 2009-07-06 09:30:25 UTC

In the example above, if I use the journal of the new node (that failed to join) in recovery then on inspection the queue depth of the queue in question is 314568 bytes which seems well below capacity.

Comment 2 Kim van der Riet 2009-07-06 14:20:08 UTC

The journal footprint for these pertest messages is 256 bytes. (See Bug 509573, comment 1 for a detailed explanation on storage budgets and capacity variability.) This places the total enqueued message depth at 12.8MB - which is greater than the default journal capacity of 12.5MB. However, perftest enqueues and dequeues simultaneously on separate threads; as long as enqueue rate ~ dequeue rate, then the capacity is not an issue. If enqueues are faster than dequeues, then the dequeue rate "buys time" for the test to complete before capacity is reached.

Initial examination of the failed secondary node journal shows ~ 39k enqueues and no dequeues - which is the correct threshold for a default journal. What is unclear is which of these enqueues can be attributed to the initial sync with the primary node and which occurred subsequently (if any). This exposes two possible modes of failure:

a) The primary (somehow) is able to maintain its 39k + messages when the secondary is reaching its threshold (a size mismatch), and the failure occurs within the initial sync period;

b) The initial sync completes correctly, but subsequent updates are selective and only the enqueues are reaching the secondary as perftest continues its activity on the primary.

Comment 3 Kim van der Riet 2009-07-06 15:34:41 UTC

Investigating further, the message size of 8 in this test allows messages on the primary to exist in 1 dblk (with 3 bytes to spare), but on the secondary, because of the additional header info, 2 dblks are consumed per message. This means that the secondary consumes twice the journal space for the same number of messages than the primary.

Rerunning this test with a message size of 12 rather than 8 (which will force both the primary and the secondary to consume 2 dblks) causes an almost immediate enqueue capacity threshold error on the primary.

As this behaviour is now understood and expected, this bug is set to CLOSED/NOTABUG.

Comment 4 Kim van der Riet 2009-08-25 20:02:09 UTC

Reopening this bug as a question has arisen about whether different nodes should have differing header sizes.

This behaviour greatly increases the probability that cluster nodes will exit one at a time as they hit the enqueue threshold. The desired behaviour is a synchronized exception coming from all nodes at the same time, and this results in a reject of the enqueued message rather than the nodes exiting.

Comment 5 Alan Conway 2009-08-26 13:12:14 UTC

What is the question? I think we should recommend that cluster nodes have identical configuration in all respects. Is there something particular about the journal parameters?

Comment 6 Gordon Sim 2009-08-26 13:19:25 UTC

I didn't ask any question... however the issue appears to be that a message takes up more space on new nodes than it does on nodes that were part of the cluster when it was published. I believe it was the exchange header that was different(?). See comment #3 above.

Comment 7 Alan Conway 2009-08-26 13:26:57 UTC

Understood, I'll investigate why this discrepancy arises.

Comment 9 Alan Conway 2009-12-11 20:58:18 UTC

Fixed in r889813

Comment 11 Tomas Rusnak 2010-10-04 12:48:44 UTC

Reproduced on RHEL5/x86_64 with:

qpidd-0.5.752581-33.el5
rhm-0.5.3206-26.el5

# grep -i "Enqueue capacity threshold exceeded" /tmp/qpidd*.log
/tmp/qpidd1.log:2010-oct-04 08:46:56 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611)
/tmp/qpidd1.log:2010-oct-04 08:46:56 error Unbinding update queue perftest0: framing-error: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611)
/tmp/qpidd1.log:2010-oct-04 08:46:56 error 10.16.64.153:26294(UPDATER) error sending update: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611)
/tmp/qpidd2.log:2010-oct-04 08:46:56 warning Journal "perftest0": Enqueue capacity threshold exceeded on queue "perftest0".
/tmp/qpidd2.log:2010-oct-04 08:46:56 error Unexpected exception: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611)
/tmp/qpidd2.log:2010-oct-04 08:46:56 error Connection 10.16.64.153:41592 closed by error: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611)(501)

Comment 12 Tomas Rusnak 2010-10-04 14:15:28 UTC

Regression found on RHEL5/x86_64 with current packages:

qpid-cpp-client-ssl-0.7.946106-17.el5
qpid-java-common-0.7.946106-10.el5
qpid-cpp-server-devel-0.7.946106-17.el5
qpid-cpp-client-0.7.946106-17.el5
qpid-cpp-server-ssl-0.7.946106-17.el5
qpid-tools-0.7.946106-11.el5
qpid-cpp-client-devel-0.7.946106-17.el5
python-qpid-0.7.946106-14.el5
qpid-cpp-client-devel-docs-0.7.946106-17.el5
qpid-java-client-0.7.946106-10.el5
qpidc-debuginfo-0.5.752581-33.el5
qpid-cpp-server-store-0.7.946106-17.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-dotnet-0.4.738274-2.el5
qpid-cpp-server-cluster-0.7.946106-17.el5
qpid-cpp-server-xml-0.7.946106-17.el5

Broker log:
2010-10-04 10:08:18 notice cluster(10.16.64.153:28518 READY) last broker standing, update queue policies
2010-10-04 10:08:19 warning Timer callback overran by 3ms [taking 8000ns]
2010-10-04 10:08:19 warning Timer callback overran by 3ms [taking 2000ns]
2010-10-04 10:08:19 warning Journal "qpid-perftest0": Enqueue capacity threshold exceeded on queue "qpid-perftest0".
2010-10-04 10:08:19 error Unexpected exception: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
2010-10-04 10:08:19 error Connection 127.0.0.1:42476 closed by error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)(501)

Perftest:
23825.8	14544.4	24645.4	0.188029
23148.9	18034.8	37104.3	0.283083
2010-10-04 10:08:19 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)

Comment 13 Tomas Rusnak 2010-10-04 15:59:39 UTC

Regression confirmed also for RHEL5/x86 packages.

Comment 14 Alan Conway 2010-10-04 20:57:40 UTC

From the log in comment 12 it appears that the capacity exception occurred on the established broker, not on the broker that was joining the cluster. The "last broker standing" message is only logged on an established broker. That would not be a bug, it is simply the perftest filling up the queue faster than it is consumed and exceeding the store limit.

Comment 15 Tomas Rusnak 2010-10-06 09:15:53 UTC

Thanks Alan, then it's not regression. Now, we must to wait for 639994, which blocks us - qpidd crash during shutdown.

Comment 16 Jaromir Hradilek 2010-10-08 09:39:41 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Adding a new node to a cluster no longer fails with a journal capacity exception.

Comment 17 Alan Conway 2010-10-12 17:46:56 UTC

This is still marked needinfo from me - what info is needed?

Comment 18 Frantisek Reznicek 2011-01-25 16:42:34 UTC

Retested with interesting result:

The above exception is there thrown on RHEL5 x86_64 with auth=yes only, all other cases are w/o exception.


At the moment broker does not leave the cluster.
  python-qpid-0.7.946106-15.el5
  qpid-cpp-client-0.7.946106-27.el5
  qpid-cpp-client-devel-0.7.946106-27.el5
  qpid-cpp-client-devel-docs-0.7.946106-27.el5
  qpid-cpp-client-rdma-0.7.946106-27.el5
  qpid-cpp-client-ssl-0.7.946106-27.el5
  qpid-cpp-mrg-debuginfo-0.7.946106-27.el5
  qpid-cpp-server-0.7.946106-27.el5
  qpid-cpp-server-cluster-0.7.946106-27.el5
  qpid-cpp-server-devel-0.7.946106-27.el5
  qpid-cpp-server-rdma-0.7.946106-27.el5
  qpid-cpp-server-ssl-0.7.946106-27.el5
  qpid-cpp-server-store-0.7.946106-27.el5
  qpid-cpp-server-xml-0.7.946106-27.el5
  qpid-dotnet-0.4.738274-2.el5
  qpid-java-client-0.7.946106-14.el5
  qpid-java-common-0.7.946106-14.el5
  qpid-java-example-0.7.946106-14.el5
  qpid-tools-0.7.946106-12.el5
  rh-qpid-cpp-tests-0.7.946106-27.el5

Comment 19 Frantisek Reznicek 2011-01-26 09:58:33 UTC

OK, the issue is retested now with interesting observation:

  # broker as cluster node
  [root@mrg-qe-10 bz509796_639994]# qpid-cluster
    Cluster Name: fcluster
  Cluster Status: ACTIVE
    Cluster Size: 1
        Members: ID=10.34.45.10:11062
URL=amqp:tcp:10.34.33.63:5672,tcp:10.34.44.10:5672,tcp:10.34.45.10:5672
  [root@mrg-qe-10 bz509796_639994]# qpid-config queues | grep perft
  [root@mrg-qe-10 bz509796_639994]# qpid-perftest --durable true --count 50000
--size 8 --summary --username guest --password guest
  2011-01-25 16:40:27 warning Broker closed connection: 501, Enqueue capacity
threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
  PublishThread exception: framing-error: Enqueue capacity threshold exceeded
on queue "qpid-perftest0". (JournalImpl.cpp:616)


  # broker as standalone

  [root@mrg-qe-10 bz509796_639994]# vi /etc/qpidd.conf
  [root@mrg-qe-10 bz509796_639994]# service qpidd restart
  Stopping Qpid AMQP daemon: qpidd
  Starting Qpid AMQP daemon:  qpidd
  [root@mrg-qe-10 bz509796_639994]# qpid-stat -b
  Brokers
    broker          cluster       uptime  conn  sess  exch  queue
    ===============================================================
    localhost:5672  <standalone>  8s         2     2     8     6
  [root@mrg-qe-10 bz509796_639994]# qpid-perftest --durable true --count 50000
--size 8 --summary --username guest --password guest
  41096.7 41028.9 80447.2 0.613763

  # broker as cluster node

  [root@mrg-qe-10 bz509796_639994]# vi /etc/qpidd.conf
  [root@mrg-qe-10 bz509796_639994]# service qpidd restart
  Stopping Qpid AMQP daemon: qpidd
  Starting Qpid AMQP daemon:  qpidd
  [root@mrg-qe-10 bz509796_639994]# qpid-stat -b
  Brokers
    broker            cluster           uptime  conn  sess  exch  queue
    =====================================================================
    10.34.33.63:5672  fcluster(ACTIVE)  4s         2     2     8     6
  [root@mrg-qe-10 bz509796_639994]# qpid-perftest --durable true --count 50000
--size 8 --summary --username guest --password guest
  2011-01-25 16:42:20 warning Broker closed connection: 501, Enqueue capacity
threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
  PublishThread exception: framing-error: Enqueue capacity threshold exceeded
on queue "qpid-perftest0". (JournalImpl.cpp:616)



  [root@mrg-qe-10 ~]# uname -a
  Linux mrg-qe-10.lab.eng.brq.redhat.com 2.6.18-238.1.1.el5 #1 SMP Tue Jan 4
13:32:19 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
  [root@mrg-qe-10 ~]# rpm -qa | grep qpid | sort
  python-qpid-0.7.946106-15.el5
  qpid-cpp-client-0.7.946106-27.el5
  qpid-cpp-client-devel-0.7.946106-27.el5
  qpid-cpp-client-devel-docs-0.7.946106-27.el5
  qpid-cpp-client-rdma-0.7.946106-27.el5
  qpid-cpp-client-ssl-0.7.946106-27.el5
  qpid-cpp-mrg-debuginfo-0.7.946106-27.el5
  qpid-cpp-server-0.7.946106-27.el5
  qpid-cpp-server-cluster-0.7.946106-27.el5
  qpid-cpp-server-devel-0.7.946106-27.el5
  qpid-cpp-server-rdma-0.7.946106-27.el5
  qpid-cpp-server-ssl-0.7.946106-27.el5
  qpid-cpp-server-store-0.7.946106-27.el5
  qpid-cpp-server-xml-0.7.946106-27.el5
  qpid-dotnet-0.4.738274-2.el5
  qpid-java-client-0.7.946106-14.el5
  qpid-java-common-0.7.946106-14.el5
  qpid-java-example-0.7.946106-14.el5
  qpid-tools-0.7.946106-12.el5
  rh-qpid-cpp-tests-0.7.946106-27.el5


Above lines clearly shows that running qpid-perftest:
qpid-perftest --durable true --count 50000 --size 8 --summary --username guest
--password guest

with authentication enabled on broker and in cluster additionally the machine
has to be 64 bit machine, fails with journal exception, but should not!

No such behavior on 64 bit machine with auth=no.
No such behavior on 32 bit machine.

Another interesting fact is that I was unable to reproduce on RHEL 5.6 x86_64
VM which indicates that it might be HW related (machines are AMD Opterons).

Comment 20 Frantisek Reznicek 2011-01-26 10:09:49 UTC

Retested according the described test scenario and issue is visible after couple of hours of looping.

I believe the cause of this issue is connected with comment 19.

The test result is following (after about 5 hours' run):
test tail transcript:
  run11063: 22061.2       19026.1 38660.9 0.294959 -------------------------------------------------------
  run11064: 31266.8       30040.2 61601.2 0.46998 -------------------------------------------------------
  run11065: 30808.2       28911   59241.2 0.451975 -------------------------------------------------------
  run11066: 31865.7       30435.4 62445.9 0.476425 -------------------------------------------------------
  2011-01-25 22:45:02 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
  2011-01-25 22:45:02 warning Connection [34469 localhost:5672] closed
  2011-01-25 22:45:02 warning Connection [34470 localhost:5672] closed
  run11067: PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
  Controller exception: Connection [34470 localhost:5672] closed
  SubscribeThread exception: Connection [34469 localhost:5672] closed -------------------------------------------------------
  2011-01-25 22:45:02 warning Connection [34476 localhost:5672] closed
  run11068:
  Connection [34476 localhost:5672] closed -------------------------------------------------------
  broker exited!
  
broker log tail:
  2011-01-25 22:45:00 notice cluster(10.34.45.9:24684 UPDATER) sending update to 10.34.45.11:1401 at amqp:tcp:10.34.33.64:5672,tcp:10.34.44.11:5672,tcp:10.34.45.11:5672
  2011-01-25 22:45:00 info Connection  connected to tcp:10.34.33.64:5672
  2011-01-25 22:45:01 info Connection  connected to tcp:10.34.33.64:5672
  2011-01-25 22:45:01 info Connection  connected to tcp:10.34.33.64:5672
  2011-01-25 22:45:01 info Connection  connected to tcp:10.34.33.64:5672
  2011-01-25 22:45:01 info Connection  connected to tcp:10.34.33.64:5672
  2011-01-25 22:45:01 warning Broker closed connection: 200, OK
  2011-01-25 22:45:01 notice cluster(10.34.45.9:24684 UPDATER) update sent
  2011-01-25 22:45:01 info cluster(10.34.45.9:24684 READY) member update: 10.34.45.9:24684(member) 10.34.45.11:1401(member) frameSeq=1727220900
  2011-01-25 22:45:01 notice Running in a cluster, marking store dirty.
  2011-01-25 22:45:01 notice cluster(10.34.45.9:24684 READY) last broker standing joined by 1 replicas, updating queue policies.
  2011-01-25 22:45:02 warning Journal "qpid-perftest0": Enqueue capacity threshold exceeded on queue "qpid-perftest0".
  2011-01-25 22:45:02 error Unexpected exception: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
  2011-01-25 22:45:02 error Connection 127.0.0.1:34468 closed by error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)(501)
  2011-01-25 22:45:02 critical cluster(10.34.45.9:24684 READY/error) local error 1727363037 did not occur on member 10.34.45.11:1401: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)
  2011-01-25 22:45:02 critical Error delivering frames: local error did not occur on all cluster members : Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) (qpid/cluster/ErrorCheck.cpp:89)
  2011-01-25 22:45:02 notice cluster(10.34.45.9:24684 LEFT/error) leaving cluster fclusterA
  2011-01-25 22:45:02 notice Shut down



Reproducers:
  - ./node1.sh on machine where one broker runs continuously and perftest loops
  - ./node2.sh on machine which restarts broker

[root@mrg-qe-09 bz509796_639994]# cat ./node1.sh
#!/bin/bash
service qpidd stop

rm -rf /var/lib/qpidd/*

service qpidd start
netstat -nlp | grep qpidd
qpid-cluster


i=0
while true; do

  out=$(qpid-perftest --durable true --count $((50000+${RANDOM}%1000)) --size 8 --summary)
  ecode=$?
  echo "run${i}: ${out} -------------------------------------------------------"

  [ "${ecode}" != "0" ] && break

  let "i++"

  usleep $(( ${RANDOM} + ${RANDOM} ))

done

echo "broker exited!"


# eof

[root@mrg-qe-09 bz509796_639994]# cat ./node2.sh
#!/bin/bash
#2 node thread


service qpidd stop

rm -rf /var/lib/qpidd/*


i=0
while true; do

  echo "run${i} ---------------------------------------------------------------"
  service qpidd start

  # wait 1 + 2typ (0...4)
  usleep $(( 1000000 + ((${RANDOM} % 2000) * 2000) ))
  qpid-cluster

  service qpidd stop

  # wait 0.5 + 1typ (0...2)
  usleep $(( 500000 + ((${RANDOM} % 2000) * 1000) ))

  let "i++"

done

# eof

The broker config was:
cluster-mechanism=ANONYMOUS
#auth=yes
auth=no
log-to-file=/tmp/qpidd.log
log-enable=info+
#log-enable=debug+:cluster
cluster-name=fclusterA
mgmt-pub-interval=2


Moving the defect to ASSIGNED as I'm able to see cluster width reduction due to journal's exception which should not be thrown as
qpid-perftest --durable true --count $((50000+${RANDOM}%1000)) \
              --size 8 --summary

should not trigger the journal exception.

The above test data show that the journal exception was thrown after couple of hours during 11067th run of the perftest.


-> ASSIGNED

Comment 21 Alan Conway 2011-01-26 21:29:09 UTC

comment 18 and comment 19 are a different issue: they are the expected behaviour of a client that overruns queue limits. auth=yes adds an additional header to the messages and I'm guessing that journal entries take more space on 64 bit (kpvdr?), so the two must be combining to push the total size over the journal limit.

comment 20 looks similar to the original issue and needs investigation. Can I just confirm that both brokers in the test had identical config with auth=no? Is there any chance that one broker had auth=yes and the other auth=no?

Comment 22 Frantisek Reznicek 2011-01-27 14:52:43 UTC

Thanks for an update.

ad reaction on comment 18 and comment 19:

If it is confirmed that authentication info take more info on 64 bit, then this is fine.


ad reaction on comment 20:
I can confirm that the case from comment 20 was run with auth=no (both machines, RHEL 5.6 one i[36]86, the second one x86_64.

Comment 23 Alan Conway 2011-01-27 17:30:47 UTC

Using the reproducer in comment 20 I reproduced the issue on two x86_64 hosts, so the difference in architecture is probably not an issue.

In this case it was node1 (the stable node) that crashed, not node2 (the one joining & leaving)

I don't think this is a regression. I think it's an example of why we originally introduced the inconsistent error mechanism: brokers with different histories do not have identical capacity to enqueue messages (e.g. in this case probably the new broker's journal was more compact because it didn't have as many dequeue records.) So I think this is a case of the design working as expected (although obviously no as desired) - the broker that is unable to store messages that are successfully stored on other brokers leaves the cluster.

A better solution to the persistent cluster problem is probably something for the new cluster work, and may also requires more transparency from the store.

Comment 24 Frantisek Reznicek 2011-02-03 09:50:53 UTC

I'm going to retest to confirm Alan's comment 23.
ETC 2011-02-11.
(Raising NEEDINFO for myself to keep it in as the top 10 TODO task.)

Comment 26 Frantisek Reznicek 2011-02-10 14:10:10 UTC

The issue from comment 19 is addressed as separate bug 676627.

Comment 27 Frantisek Reznicek 2011-02-10 22:19:51 UTC

I'm reliably able to trigger the original issue on packages:

python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-28.el5
qpid-cpp-client-devel-0.7.946106-28.el5
qpid-cpp-client-devel-docs-0.7.946106-28.el5
qpid-cpp-client-rdma-0.7.946106-28.el5
qpid-cpp-client-ssl-0.7.946106-28.el5
qpid-cpp-mrg-debuginfo-0.7.946106-28.el5
qpid-cpp-server-0.7.946106-28.el5
qpid-cpp-server-cluster-0.7.946106-28.el5
qpid-cpp-server-devel-0.7.946106-28.el5
qpid-cpp-server-rdma-0.7.946106-28.el5
qpid-cpp-server-ssl-0.7.946106-28.el5
qpid-cpp-server-store-0.7.946106-28.el5
qpid-cpp-server-xml-0.7.946106-28.el5
qpid-dotnet-0.4.738274-2.el5
qpid-java-client-0.7.946106-15.el5
qpid-java-common-0.7.946106-15.el5
qpid-java-example-0.7.946106-15.el5
qpid-tools-0.7.946106-12.el5
rh-qpid-cpp-tests-0.7.946106-28.el5


with the modified reproducer from comment 20.

The repro patch is:
-  out=$(qpid-perftest --durable true --count $((50000+${RANDOM}%1000)) --size 8 --summary)
+  out=$(qpid-perftest --durable true --count $((50000+${RANDOM}%1000)) --size 8 --summary --username guest --password guest)

and of course auth=yes in qpidd.conf

keeping ASSIGNED

Comment 28 ppecka 2011-08-02 17:23:25 UTC

i can confirm issue is seen on 1.3.3 packages

# rpm -qa | grep qpid | sort -u
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-34.el5
qpid-cpp-client-devel-0.7.946106-34.el5
qpid-cpp-client-devel-docs-0.7.946106-34.el5
qpid-cpp-client-ssl-0.7.946106-34.el5
qpid-cpp-server-0.7.946106-34.el5
qpid-cpp-server-cluster-0.7.946106-34.el5
qpid-cpp-server-devel-0.7.946106-34.el5
qpid-cpp-server-ssl-0.7.946106-34.el5
qpid-cpp-server-store-0.7.946106-34.el5
qpid-cpp-server-xml-0.7.946106-34.el5
qpid-java-client-0.10-6.el5
qpid-java-common-0.10-6.el5
qpid-java-example-0.10-6.el5
qpid-tools-0.7.946106-12.el5


2011-08-02 13:16:07 debug cluster(10.34.44.23:9252 READY) replicated connection 10.34.33.84:5672-10.34.33.84:47004(10.34.44.22:25262-6304 shadow)
2011-08-02 13:16:07 debug cluster(10.34.44.23:9252 READY) closed connection 127.0.0.1:5672-127.0.0.1:34159(10.34.44.23:9252-4052 local)
2011-08-02 13:16:08 debug cluster(10.34.44.23:9252 READY/error) connection error 56218379 on 10.34.33.84:5672-10.34.33.84:47002(10.34.44.22:25262-6302 shadow) must be resolved with: 10.34.44.21:16272 10.34.44.22:25262 10.34.44.23:9252 10.34.44.24:2664 : Enqueue capacity threshold exceeded on queue "mrg-qe-17.lab.eng.brq.redhat.com_5406630000". (JournalImpl.cpp:616)
2011-08-02 13:16:08 debug cluster(10.34.44.23:9252 READY/error) error 56218379 resolved with 10.34.44.23:9252
2011-08-02 13:16:08 debug cluster(10.34.44.23:9252 READY/error) error 56218379 must be resolved with 10.34.44.21:16272 10.34.44.22:25262 10.34.44.24:2664 
2011-08-02 13:16:08 critical cluster(10.34.44.23:9252 READY/error) local error 56218379 did not occur on member 10.34.44.21:16272: Enqueue capacity threshold exceeded on queue "mrg-qe-17.lab.eng.brq.redhat.com_5406630000". (JournalImpl.cpp:616)
2011-08-02 13:16:08 critical Error delivering frames: local error did not occur on all cluster members : Enqueue capacity threshold exceeded on queue "mrg-qe-17.lab.eng.brq.redhat.com_5406630000". (JournalImpl.cpp:616) (qpid/cluster/ErrorCheck.cpp:89)
2011-08-02 13:16:08 notice cluster(10.34.44.23:9252 LEFT/error) leaving cluster mrg-qe-1.3.3-clusterA

Comment 29 ppecka 2011-08-02 17:38:38 UTC

(In reply to comment #0)
on 1.3.3 happened on 4 node settled cluster, then we started  without joining new node, test scenario described in Bug # 726379 section "Steps to Reproduce".
cluster worked for 3hours 5minutes, then 2 of 4 nodes left cluster with result from comment #28

> Description of problem:
> 
> Adding a new node to a cluster failed with a journal capacity exception.
> However the new node was using the default journal size options as was the
> existing cluster node so its not clear why this error should occur.
> 
> Version-Release number of selected component (if applicable):
> 
> qpidd-0.5.752581-22.el5
> rhm-0.5.3206-5.el5
> 
> How reproducible:
> 
> Fairly easily
> 
> Steps to Reproduce:
> 1. start one cluster node
> 2. run  while perftest --durable true --count 50000 --size 8 --summary; do
> true; done
> 3. while this is running add a new node to the cluster, then stop it, then
> re-add it and repeat this for several times until eventually you see an error
> 
> Actual results:
> 
> 2009-jul-06 05:07:10 notice Journal "perftest0": Created
> 2009-jul-06 05:07:11 warning Journal "perftest0": Enqueue capacity threshold
> exceeded on queue "perftest0".
> 2009-jul-06 05:07:11 error Unexpected exception: Enqueue capacity threshold
> exceeded on queue "perftest0". (JournalImpl.cpp:576)
> 2009-jul-06 05:07:11 error Connection 10.16.44.221:46092 closed by error:
> Enqueue capacity threshold exceeded on queue "perftest0".
> (JournalImpl.cpp:576)(501)
> 2009-jul-06 05:07:11 error Channel exception: not-attached: Channel 1 is not
> attached (qpid/amqp_0_10/SessionHandler.cpp:40)
> 2009-jul-06 05:07:11 critical 10.16.44.221:19499(UPDATEE) catch-up connection
> closed prematurely 10.16.44.221:19499-1(local,catchup)
> 2009-jul-06 05:07:11 notice 10.16.44.221:19499(LEFT) leaving cluster
> grs-mrg14-test-cluster
> 2009-jul-06 05:07:11 notice Shut down
> 
> 
> Expected results:
> 
> Should always be able to add the new node since the queue on the original node
> in the cluster has the same capacity.
> 
> Additional info:

Comment 30 Gordon Sim 2011-08-03 10:04:19 UTC

From comment #2 'the total enqueued message depth at 12.8MB - which is
greater than the default journal capacity of 12.5MB', i.e. if testing with 50,000 messages.

It is important to distinguish two cases: (i) general capacity issues for nodes already in the cluster (comment #29 sounds like it is in this category, where two nodes leave a cluster), (ii) a capacity error on updating a new member (which was the original issue and seems also to me the case for comment #20).

The first case cannot be ruled out, though it is unlikely. The exact history may differ between nodes, and this history may affect the actual capacity. Thus the actual capacity may vary by node and may in theory on occasion be insufficient for the defined perftest run.

I can't however explain why the second is ever legitimate if the nodes all have the same configuration and are on the same architecture. The new member should, if anything, have *greater* capacity than the member from which it is receiving the update. A new member by definition has no history and therefore has the maximum actual capacity when receiving the update. Since the existing member giving the update has not hit the capacity, I don't understand why the new member ever would.

For the purposes of any automated testing around this bug, I would suggest that we reduce the count or increase the journal to the point where we can be certain no journal capacity problems can ever occur even if all the messages are enqueued at once. That will rule out any false negatives.

Comment 31 Alan Conway 2011-08-03 12:42:56 UTC

From comment 30: Another approach to avoid this in tests would be to set a discard policy on the queues that is lower than the journal capacity. Queue policy calculations are consistent in a cluster so this won't cause brokers to crash.

Comment 32 Gordon Sim 2011-08-03 12:47:19 UTC

That would still result in occasional hangs (or errors if reject was used instead) in perftest. You really want the test to run smoothly unless there is a genuine bug uncovered.

Comment 33 Kim van der Riet 2011-08-03 14:26:11 UTC

Assuming that the size of headers is always the same, let me shed some light on how enqueue threshold exception decisions are made in the store.

* The store uses fixed files of fixed size in a circular buffer.

* Each file contains a header that is overwritten each time that file is written.

* Any file that contains an unconsumed record (or part of one) cannot be overwritten at all, even if that record is at the end of the file, as once the header is overwritten, it becomes more problematic to recover.

* The method for determining whether to enqueue a record is as follows: Starting with the current write position in the currently written file, add the size of the record to be written and then jump ahead another 20% of the total journal size. In whichever file this places you, check to see if there is any unconsumed record present. If not, the enqueue proceeds, if there is a record anywhere in this file, the enqueue is rejected with the Enqueue Threshold error.

This method has inherent uncertainty in the point of rejection of one journal file, depending on where the current write point is relative to the next unconsumed record. The most optimistic scenario will cause the error to occur at 80% capacity; however, the most pessimistic scenario will occur at 80% less one file size.

Clearly the number of files used in a journal will influence this uncertainty. If the minimum number of files is used in a journal (4) then one file represents 25% of the capacity, so the error would occur anywhere between 55% and 80%. Eight files (the default) yields 67.5% to 80%, and using the maximum number (64 files at present) yields 78.4% to 80%.

In a cluster where nodes join at different times are subject to this uncertainty. The simplest strategy to mitigate this risk from testing is to ensure the store is big enough to accommodate all messages. However, using a large number of small files will provide better uncertainty than a similarly sized store composed of a smaller number of larger files. Of course, using more files consumes file handles, so care needs to be taken that there are sufficient handles available for the test in question.

Comment 34 Alan Conway 2011-08-04 13:17:09 UTC

I think the summary is: if you're store isn't large enough to handle your peak load, then it's possible for clustered brokers to exit with this error. 
We should re-run the tests with a large enough store and see if there are still any issues.

Comment 35 Gordon Sim 2011-08-04 13:43:07 UTC

Yes, agreed (though I still maintain that there should *never* be a capacity error on an update).