Bug 509796
| Summary: | error sending update: Enqueue capacity threshold exceeded | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Gordon Sim <gsim> |
| Component: | qpid-cpp | Assignee: | Alan Conway <aconway> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Frantisek Reznicek <freznice> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 1.1.1 | CC: | esammons, freznice, iboverma, jross, kim.vdriet, ppecka, tross, trusnak |
| Target Milestone: | 1.3.3 | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Adding a new node to a cluster no longer fails with a journal capacity exception.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-12-06 21:06:53 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 639994 | ||
| Bug Blocks: | |||
|
Description
Gordon Sim
2009-07-06 09:23:35 UTC
In the example above, if I use the journal of the new node (that failed to join) in recovery then on inspection the queue depth of the queue in question is 314568 bytes which seems well below capacity. The journal footprint for these pertest messages is 256 bytes. (See Bug 509573, comment 1 for a detailed explanation on storage budgets and capacity variability.) This places the total enqueued message depth at 12.8MB - which is greater than the default journal capacity of 12.5MB. However, perftest enqueues and dequeues simultaneously on separate threads; as long as enqueue rate ~ dequeue rate, then the capacity is not an issue. If enqueues are faster than dequeues, then the dequeue rate "buys time" for the test to complete before capacity is reached. Initial examination of the failed secondary node journal shows ~ 39k enqueues and no dequeues - which is the correct threshold for a default journal. What is unclear is which of these enqueues can be attributed to the initial sync with the primary node and which occurred subsequently (if any). This exposes two possible modes of failure: a) The primary (somehow) is able to maintain its 39k + messages when the secondary is reaching its threshold (a size mismatch), and the failure occurs within the initial sync period; b) The initial sync completes correctly, but subsequent updates are selective and only the enqueues are reaching the secondary as perftest continues its activity on the primary. Investigating further, the message size of 8 in this test allows messages on the primary to exist in 1 dblk (with 3 bytes to spare), but on the secondary, because of the additional header info, 2 dblks are consumed per message. This means that the secondary consumes twice the journal space for the same number of messages than the primary. Rerunning this test with a message size of 12 rather than 8 (which will force both the primary and the secondary to consume 2 dblks) causes an almost immediate enqueue capacity threshold error on the primary. As this behaviour is now understood and expected, this bug is set to CLOSED/NOTABUG. Reopening this bug as a question has arisen about whether different nodes should have differing header sizes. This behaviour greatly increases the probability that cluster nodes will exit one at a time as they hit the enqueue threshold. The desired behaviour is a synchronized exception coming from all nodes at the same time, and this results in a reject of the enqueued message rather than the nodes exiting. What is the question? I think we should recommend that cluster nodes have identical configuration in all respects. Is there something particular about the journal parameters? I didn't ask any question... however the issue appears to be that a message takes up more space on new nodes than it does on nodes that were part of the cluster when it was published. I believe it was the exchange header that was different(?). See comment #3 above. Understood, I'll investigate why this discrepancy arises. Fixed in r889813 Reproduced on RHEL5/x86_64 with: qpidd-0.5.752581-33.el5 rhm-0.5.3206-26.el5 # grep -i "Enqueue capacity threshold exceeded" /tmp/qpidd*.log /tmp/qpidd1.log:2010-oct-04 08:46:56 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611) /tmp/qpidd1.log:2010-oct-04 08:46:56 error Unbinding update queue perftest0: framing-error: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611) /tmp/qpidd1.log:2010-oct-04 08:46:56 error 10.16.64.153:26294(UPDATER) error sending update: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611) /tmp/qpidd2.log:2010-oct-04 08:46:56 warning Journal "perftest0": Enqueue capacity threshold exceeded on queue "perftest0". /tmp/qpidd2.log:2010-oct-04 08:46:56 error Unexpected exception: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611) /tmp/qpidd2.log:2010-oct-04 08:46:56 error Connection 10.16.64.153:41592 closed by error: Enqueue capacity threshold exceeded on queue "perftest0". (JournalImpl.cpp:611)(501) Regression found on RHEL5/x86_64 with current packages: qpid-cpp-client-ssl-0.7.946106-17.el5 qpid-java-common-0.7.946106-10.el5 qpid-cpp-server-devel-0.7.946106-17.el5 qpid-cpp-client-0.7.946106-17.el5 qpid-cpp-server-ssl-0.7.946106-17.el5 qpid-tools-0.7.946106-11.el5 qpid-cpp-client-devel-0.7.946106-17.el5 python-qpid-0.7.946106-14.el5 qpid-cpp-client-devel-docs-0.7.946106-17.el5 qpid-java-client-0.7.946106-10.el5 qpidc-debuginfo-0.5.752581-33.el5 qpid-cpp-server-store-0.7.946106-17.el5 qpid-cpp-server-0.7.946106-17.el5 qpid-dotnet-0.4.738274-2.el5 qpid-cpp-server-cluster-0.7.946106-17.el5 qpid-cpp-server-xml-0.7.946106-17.el5 Broker log: 2010-10-04 10:08:18 notice cluster(10.16.64.153:28518 READY) last broker standing, update queue policies 2010-10-04 10:08:19 warning Timer callback overran by 3ms [taking 8000ns] 2010-10-04 10:08:19 warning Timer callback overran by 3ms [taking 2000ns] 2010-10-04 10:08:19 warning Journal "qpid-perftest0": Enqueue capacity threshold exceeded on queue "qpid-perftest0". 2010-10-04 10:08:19 error Unexpected exception: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) 2010-10-04 10:08:19 error Connection 127.0.0.1:42476 closed by error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)(501) Perftest: 23825.8 14544.4 24645.4 0.188029 23148.9 18034.8 37104.3 0.283083 2010-10-04 10:08:19 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) Regression confirmed also for RHEL5/x86 packages. From the log in comment 12 it appears that the capacity exception occurred on the established broker, not on the broker that was joining the cluster. The "last broker standing" message is only logged on an established broker. That would not be a bug, it is simply the perftest filling up the queue faster than it is consumed and exceeding the store limit. Thanks Alan, then it's not regression. Now, we must to wait for 639994, which blocks us - qpidd crash during shutdown.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Adding a new node to a cluster no longer fails with a journal capacity exception.
This is still marked needinfo from me - what info is needed? Retested with interesting result: The above exception is there thrown on RHEL5 x86_64 with auth=yes only, all other cases are w/o exception. At the moment broker does not leave the cluster. python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-27.el5 qpid-cpp-client-devel-0.7.946106-27.el5 qpid-cpp-client-devel-docs-0.7.946106-27.el5 qpid-cpp-client-rdma-0.7.946106-27.el5 qpid-cpp-client-ssl-0.7.946106-27.el5 qpid-cpp-mrg-debuginfo-0.7.946106-27.el5 qpid-cpp-server-0.7.946106-27.el5 qpid-cpp-server-cluster-0.7.946106-27.el5 qpid-cpp-server-devel-0.7.946106-27.el5 qpid-cpp-server-rdma-0.7.946106-27.el5 qpid-cpp-server-ssl-0.7.946106-27.el5 qpid-cpp-server-store-0.7.946106-27.el5 qpid-cpp-server-xml-0.7.946106-27.el5 qpid-dotnet-0.4.738274-2.el5 qpid-java-client-0.7.946106-14.el5 qpid-java-common-0.7.946106-14.el5 qpid-java-example-0.7.946106-14.el5 qpid-tools-0.7.946106-12.el5 rh-qpid-cpp-tests-0.7.946106-27.el5 OK, the issue is retested now with interesting observation: # broker as cluster node [root@mrg-qe-10 bz509796_639994]# qpid-cluster Cluster Name: fcluster Cluster Status: ACTIVE Cluster Size: 1 Members: ID=10.34.45.10:11062 URL=amqp:tcp:10.34.33.63:5672,tcp:10.34.44.10:5672,tcp:10.34.45.10:5672 [root@mrg-qe-10 bz509796_639994]# qpid-config queues | grep perft [root@mrg-qe-10 bz509796_639994]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest 2011-01-25 16:40:27 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) # broker as standalone [root@mrg-qe-10 bz509796_639994]# vi /etc/qpidd.conf [root@mrg-qe-10 bz509796_639994]# service qpidd restart Stopping Qpid AMQP daemon: qpidd Starting Qpid AMQP daemon: qpidd [root@mrg-qe-10 bz509796_639994]# qpid-stat -b Brokers broker cluster uptime conn sess exch queue =============================================================== localhost:5672 <standalone> 8s 2 2 8 6 [root@mrg-qe-10 bz509796_639994]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest 41096.7 41028.9 80447.2 0.613763 # broker as cluster node [root@mrg-qe-10 bz509796_639994]# vi /etc/qpidd.conf [root@mrg-qe-10 bz509796_639994]# service qpidd restart Stopping Qpid AMQP daemon: qpidd Starting Qpid AMQP daemon: qpidd [root@mrg-qe-10 bz509796_639994]# qpid-stat -b Brokers broker cluster uptime conn sess exch queue ===================================================================== 10.34.33.63:5672 fcluster(ACTIVE) 4s 2 2 8 6 [root@mrg-qe-10 bz509796_639994]# qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest 2011-01-25 16:42:20 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) [root@mrg-qe-10 ~]# uname -a Linux mrg-qe-10.lab.eng.brq.redhat.com 2.6.18-238.1.1.el5 #1 SMP Tue Jan 4 13:32:19 EST 2011 x86_64 x86_64 x86_64 GNU/Linux [root@mrg-qe-10 ~]# rpm -qa | grep qpid | sort python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-27.el5 qpid-cpp-client-devel-0.7.946106-27.el5 qpid-cpp-client-devel-docs-0.7.946106-27.el5 qpid-cpp-client-rdma-0.7.946106-27.el5 qpid-cpp-client-ssl-0.7.946106-27.el5 qpid-cpp-mrg-debuginfo-0.7.946106-27.el5 qpid-cpp-server-0.7.946106-27.el5 qpid-cpp-server-cluster-0.7.946106-27.el5 qpid-cpp-server-devel-0.7.946106-27.el5 qpid-cpp-server-rdma-0.7.946106-27.el5 qpid-cpp-server-ssl-0.7.946106-27.el5 qpid-cpp-server-store-0.7.946106-27.el5 qpid-cpp-server-xml-0.7.946106-27.el5 qpid-dotnet-0.4.738274-2.el5 qpid-java-client-0.7.946106-14.el5 qpid-java-common-0.7.946106-14.el5 qpid-java-example-0.7.946106-14.el5 qpid-tools-0.7.946106-12.el5 rh-qpid-cpp-tests-0.7.946106-27.el5 Above lines clearly shows that running qpid-perftest: qpid-perftest --durable true --count 50000 --size 8 --summary --username guest --password guest with authentication enabled on broker and in cluster additionally the machine has to be 64 bit machine, fails with journal exception, but should not! No such behavior on 64 bit machine with auth=no. No such behavior on 32 bit machine. Another interesting fact is that I was unable to reproduce on RHEL 5.6 x86_64 VM which indicates that it might be HW related (machines are AMD Opterons). Retested according the described test scenario and issue is visible after couple of hours of looping. I believe the cause of this issue is connected with comment 19. The test result is following (after about 5 hours' run): test tail transcript: run11063: 22061.2 19026.1 38660.9 0.294959 ------------------------------------------------------- run11064: 31266.8 30040.2 61601.2 0.46998 ------------------------------------------------------- run11065: 30808.2 28911 59241.2 0.451975 ------------------------------------------------------- run11066: 31865.7 30435.4 62445.9 0.476425 ------------------------------------------------------- 2011-01-25 22:45:02 warning Broker closed connection: 501, Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) 2011-01-25 22:45:02 warning Connection [34469 localhost:5672] closed 2011-01-25 22:45:02 warning Connection [34470 localhost:5672] closed run11067: PublishThread exception: framing-error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) Controller exception: Connection [34470 localhost:5672] closed SubscribeThread exception: Connection [34469 localhost:5672] closed ------------------------------------------------------- 2011-01-25 22:45:02 warning Connection [34476 localhost:5672] closed run11068: Connection [34476 localhost:5672] closed ------------------------------------------------------- broker exited! broker log tail: 2011-01-25 22:45:00 notice cluster(10.34.45.9:24684 UPDATER) sending update to 10.34.45.11:1401 at amqp:tcp:10.34.33.64:5672,tcp:10.34.44.11:5672,tcp:10.34.45.11:5672 2011-01-25 22:45:00 info Connection connected to tcp:10.34.33.64:5672 2011-01-25 22:45:01 info Connection connected to tcp:10.34.33.64:5672 2011-01-25 22:45:01 info Connection connected to tcp:10.34.33.64:5672 2011-01-25 22:45:01 info Connection connected to tcp:10.34.33.64:5672 2011-01-25 22:45:01 info Connection connected to tcp:10.34.33.64:5672 2011-01-25 22:45:01 warning Broker closed connection: 200, OK 2011-01-25 22:45:01 notice cluster(10.34.45.9:24684 UPDATER) update sent 2011-01-25 22:45:01 info cluster(10.34.45.9:24684 READY) member update: 10.34.45.9:24684(member) 10.34.45.11:1401(member) frameSeq=1727220900 2011-01-25 22:45:01 notice Running in a cluster, marking store dirty. 2011-01-25 22:45:01 notice cluster(10.34.45.9:24684 READY) last broker standing joined by 1 replicas, updating queue policies. 2011-01-25 22:45:02 warning Journal "qpid-perftest0": Enqueue capacity threshold exceeded on queue "qpid-perftest0". 2011-01-25 22:45:02 error Unexpected exception: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) 2011-01-25 22:45:02 error Connection 127.0.0.1:34468 closed by error: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616)(501) 2011-01-25 22:45:02 critical cluster(10.34.45.9:24684 READY/error) local error 1727363037 did not occur on member 10.34.45.11:1401: Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) 2011-01-25 22:45:02 critical Error delivering frames: local error did not occur on all cluster members : Enqueue capacity threshold exceeded on queue "qpid-perftest0". (JournalImpl.cpp:616) (qpid/cluster/ErrorCheck.cpp:89) 2011-01-25 22:45:02 notice cluster(10.34.45.9:24684 LEFT/error) leaving cluster fclusterA 2011-01-25 22:45:02 notice Shut down Reproducers: - ./node1.sh on machine where one broker runs continuously and perftest loops - ./node2.sh on machine which restarts broker [root@mrg-qe-09 bz509796_639994]# cat ./node1.sh #!/bin/bash service qpidd stop rm -rf /var/lib/qpidd/* service qpidd start netstat -nlp | grep qpidd qpid-cluster i=0 while true; do out=$(qpid-perftest --durable true --count $((50000+${RANDOM}%1000)) --size 8 --summary) ecode=$? echo "run${i}: ${out} -------------------------------------------------------" [ "${ecode}" != "0" ] && break let "i++" usleep $(( ${RANDOM} + ${RANDOM} )) done echo "broker exited!" # eof [root@mrg-qe-09 bz509796_639994]# cat ./node2.sh #!/bin/bash #2 node thread service qpidd stop rm -rf /var/lib/qpidd/* i=0 while true; do echo "run${i} ---------------------------------------------------------------" service qpidd start # wait 1 + 2typ (0...4) usleep $(( 1000000 + ((${RANDOM} % 2000) * 2000) )) qpid-cluster service qpidd stop # wait 0.5 + 1typ (0...2) usleep $(( 500000 + ((${RANDOM} % 2000) * 1000) )) let "i++" done # eof The broker config was: cluster-mechanism=ANONYMOUS #auth=yes auth=no log-to-file=/tmp/qpidd.log log-enable=info+ #log-enable=debug+:cluster cluster-name=fclusterA mgmt-pub-interval=2 Moving the defect to ASSIGNED as I'm able to see cluster width reduction due to journal's exception which should not be thrown as qpid-perftest --durable true --count $((50000+${RANDOM}%1000)) \ --size 8 --summary should not trigger the journal exception. The above test data show that the journal exception was thrown after couple of hours during 11067th run of the perftest. -> ASSIGNED comment 18 and comment 19 are a different issue: they are the expected behaviour of a client that overruns queue limits. auth=yes adds an additional header to the messages and I'm guessing that journal entries take more space on 64 bit (kpvdr?), so the two must be combining to push the total size over the journal limit. comment 20 looks similar to the original issue and needs investigation. Can I just confirm that both brokers in the test had identical config with auth=no? Is there any chance that one broker had auth=yes and the other auth=no? Thanks for an update. ad reaction on comment 18 and comment 19: If it is confirmed that authentication info take more info on 64 bit, then this is fine. ad reaction on comment 20: I can confirm that the case from comment 20 was run with auth=no (both machines, RHEL 5.6 one i[36]86, the second one x86_64. Using the reproducer in comment 20 I reproduced the issue on two x86_64 hosts, so the difference in architecture is probably not an issue. In this case it was node1 (the stable node) that crashed, not node2 (the one joining & leaving) I don't think this is a regression. I think it's an example of why we originally introduced the inconsistent error mechanism: brokers with different histories do not have identical capacity to enqueue messages (e.g. in this case probably the new broker's journal was more compact because it didn't have as many dequeue records.) So I think this is a case of the design working as expected (although obviously no as desired) - the broker that is unable to store messages that are successfully stored on other brokers leaves the cluster. A better solution to the persistent cluster problem is probably something for the new cluster work, and may also requires more transparency from the store. I'm going to retest to confirm Alan's comment 23. ETC 2011-02-11. (Raising NEEDINFO for myself to keep it in as the top 10 TODO task.) The issue from comment 19 is addressed as separate bug 676627. I'm reliably able to trigger the original issue on packages: python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-28.el5 qpid-cpp-client-devel-0.7.946106-28.el5 qpid-cpp-client-devel-docs-0.7.946106-28.el5 qpid-cpp-client-rdma-0.7.946106-28.el5 qpid-cpp-client-ssl-0.7.946106-28.el5 qpid-cpp-mrg-debuginfo-0.7.946106-28.el5 qpid-cpp-server-0.7.946106-28.el5 qpid-cpp-server-cluster-0.7.946106-28.el5 qpid-cpp-server-devel-0.7.946106-28.el5 qpid-cpp-server-rdma-0.7.946106-28.el5 qpid-cpp-server-ssl-0.7.946106-28.el5 qpid-cpp-server-store-0.7.946106-28.el5 qpid-cpp-server-xml-0.7.946106-28.el5 qpid-dotnet-0.4.738274-2.el5 qpid-java-client-0.7.946106-15.el5 qpid-java-common-0.7.946106-15.el5 qpid-java-example-0.7.946106-15.el5 qpid-tools-0.7.946106-12.el5 rh-qpid-cpp-tests-0.7.946106-28.el5 with the modified reproducer from comment 20. The repro patch is: - out=$(qpid-perftest --durable true --count $((50000+${RANDOM}%1000)) --size 8 --summary) + out=$(qpid-perftest --durable true --count $((50000+${RANDOM}%1000)) --size 8 --summary --username guest --password guest) and of course auth=yes in qpidd.conf keeping ASSIGNED i can confirm issue is seen on 1.3.3 packages # rpm -qa | grep qpid | sort -u python-qpid-0.7.946106-15.el5 qpid-cpp-client-0.7.946106-34.el5 qpid-cpp-client-devel-0.7.946106-34.el5 qpid-cpp-client-devel-docs-0.7.946106-34.el5 qpid-cpp-client-ssl-0.7.946106-34.el5 qpid-cpp-server-0.7.946106-34.el5 qpid-cpp-server-cluster-0.7.946106-34.el5 qpid-cpp-server-devel-0.7.946106-34.el5 qpid-cpp-server-ssl-0.7.946106-34.el5 qpid-cpp-server-store-0.7.946106-34.el5 qpid-cpp-server-xml-0.7.946106-34.el5 qpid-java-client-0.10-6.el5 qpid-java-common-0.10-6.el5 qpid-java-example-0.10-6.el5 qpid-tools-0.7.946106-12.el5 2011-08-02 13:16:07 debug cluster(10.34.44.23:9252 READY) replicated connection 10.34.33.84:5672-10.34.33.84:47004(10.34.44.22:25262-6304 shadow) 2011-08-02 13:16:07 debug cluster(10.34.44.23:9252 READY) closed connection 127.0.0.1:5672-127.0.0.1:34159(10.34.44.23:9252-4052 local) 2011-08-02 13:16:08 debug cluster(10.34.44.23:9252 READY/error) connection error 56218379 on 10.34.33.84:5672-10.34.33.84:47002(10.34.44.22:25262-6302 shadow) must be resolved with: 10.34.44.21:16272 10.34.44.22:25262 10.34.44.23:9252 10.34.44.24:2664 : Enqueue capacity threshold exceeded on queue "mrg-qe-17.lab.eng.brq.redhat.com_5406630000". (JournalImpl.cpp:616) 2011-08-02 13:16:08 debug cluster(10.34.44.23:9252 READY/error) error 56218379 resolved with 10.34.44.23:9252 2011-08-02 13:16:08 debug cluster(10.34.44.23:9252 READY/error) error 56218379 must be resolved with 10.34.44.21:16272 10.34.44.22:25262 10.34.44.24:2664 2011-08-02 13:16:08 critical cluster(10.34.44.23:9252 READY/error) local error 56218379 did not occur on member 10.34.44.21:16272: Enqueue capacity threshold exceeded on queue "mrg-qe-17.lab.eng.brq.redhat.com_5406630000". (JournalImpl.cpp:616) 2011-08-02 13:16:08 critical Error delivering frames: local error did not occur on all cluster members : Enqueue capacity threshold exceeded on queue "mrg-qe-17.lab.eng.brq.redhat.com_5406630000". (JournalImpl.cpp:616) (qpid/cluster/ErrorCheck.cpp:89) 2011-08-02 13:16:08 notice cluster(10.34.44.23:9252 LEFT/error) leaving cluster mrg-qe-1.3.3-clusterA (In reply to comment #0) on 1.3.3 happened on 4 node settled cluster, then we started without joining new node, test scenario described in Bug # 726379 section "Steps to Reproduce". cluster worked for 3hours 5minutes, then 2 of 4 nodes left cluster with result from comment #28 > Description of problem: > > Adding a new node to a cluster failed with a journal capacity exception. > However the new node was using the default journal size options as was the > existing cluster node so its not clear why this error should occur. > > Version-Release number of selected component (if applicable): > > qpidd-0.5.752581-22.el5 > rhm-0.5.3206-5.el5 > > How reproducible: > > Fairly easily > > Steps to Reproduce: > 1. start one cluster node > 2. run while perftest --durable true --count 50000 --size 8 --summary; do > true; done > 3. while this is running add a new node to the cluster, then stop it, then > re-add it and repeat this for several times until eventually you see an error > > Actual results: > > 2009-jul-06 05:07:10 notice Journal "perftest0": Created > 2009-jul-06 05:07:11 warning Journal "perftest0": Enqueue capacity threshold > exceeded on queue "perftest0". > 2009-jul-06 05:07:11 error Unexpected exception: Enqueue capacity threshold > exceeded on queue "perftest0". (JournalImpl.cpp:576) > 2009-jul-06 05:07:11 error Connection 10.16.44.221:46092 closed by error: > Enqueue capacity threshold exceeded on queue "perftest0". > (JournalImpl.cpp:576)(501) > 2009-jul-06 05:07:11 error Channel exception: not-attached: Channel 1 is not > attached (qpid/amqp_0_10/SessionHandler.cpp:40) > 2009-jul-06 05:07:11 critical 10.16.44.221:19499(UPDATEE) catch-up connection > closed prematurely 10.16.44.221:19499-1(local,catchup) > 2009-jul-06 05:07:11 notice 10.16.44.221:19499(LEFT) leaving cluster > grs-mrg14-test-cluster > 2009-jul-06 05:07:11 notice Shut down > > > Expected results: > > Should always be able to add the new node since the queue on the original node > in the cluster has the same capacity. > > Additional info: From comment #2 'the total enqueued message depth at 12.8MB - which is greater than the default journal capacity of 12.5MB', i.e. if testing with 50,000 messages. It is important to distinguish two cases: (i) general capacity issues for nodes already in the cluster (comment #29 sounds like it is in this category, where two nodes leave a cluster), (ii) a capacity error on updating a new member (which was the original issue and seems also to me the case for comment #20). The first case cannot be ruled out, though it is unlikely. The exact history may differ between nodes, and this history may affect the actual capacity. Thus the actual capacity may vary by node and may in theory on occasion be insufficient for the defined perftest run. I can't however explain why the second is ever legitimate if the nodes all have the same configuration and are on the same architecture. The new member should, if anything, have *greater* capacity than the member from which it is receiving the update. A new member by definition has no history and therefore has the maximum actual capacity when receiving the update. Since the existing member giving the update has not hit the capacity, I don't understand why the new member ever would. For the purposes of any automated testing around this bug, I would suggest that we reduce the count or increase the journal to the point where we can be certain no journal capacity problems can ever occur even if all the messages are enqueued at once. That will rule out any false negatives. From comment 30: Another approach to avoid this in tests would be to set a discard policy on the queues that is lower than the journal capacity. Queue policy calculations are consistent in a cluster so this won't cause brokers to crash. That would still result in occasional hangs (or errors if reject was used instead) in perftest. You really want the test to run smoothly unless there is a genuine bug uncovered. Assuming that the size of headers is always the same, let me shed some light on how enqueue threshold exception decisions are made in the store. * The store uses fixed files of fixed size in a circular buffer. * Each file contains a header that is overwritten each time that file is written. * Any file that contains an unconsumed record (or part of one) cannot be overwritten at all, even if that record is at the end of the file, as once the header is overwritten, it becomes more problematic to recover. * The method for determining whether to enqueue a record is as follows: Starting with the current write position in the currently written file, add the size of the record to be written and then jump ahead another 20% of the total journal size. In whichever file this places you, check to see if there is any unconsumed record present. If not, the enqueue proceeds, if there is a record anywhere in this file, the enqueue is rejected with the Enqueue Threshold error. This method has inherent uncertainty in the point of rejection of one journal file, depending on where the current write point is relative to the next unconsumed record. The most optimistic scenario will cause the error to occur at 80% capacity; however, the most pessimistic scenario will occur at 80% less one file size. Clearly the number of files used in a journal will influence this uncertainty. If the minimum number of files is used in a journal (4) then one file represents 25% of the capacity, so the error would occur anywhere between 55% and 80%. Eight files (the default) yields 67.5% to 80%, and using the maximum number (64 files at present) yields 78.4% to 80%. In a cluster where nodes join at different times are subject to this uncertainty. The simplest strategy to mitigate this risk from testing is to ensure the store is big enough to accommodate all messages. However, using a large number of small files will provide better uncertainty than a similarly sized store composed of a smaller number of larger files. Of course, using more files consumes file handles, so care needs to be taken that there are sufficient handles available for the test in question. I think the summary is: if you're store isn't large enough to handle your peak load, then it's possible for clustered brokers to exit with this error. We should re-run the tests with a large enough store and see if there are still any issues. Yes, agreed (though I still maintain that there should *never* be a capacity error on an update). |