Hide Forgot
Created attachment 488764 [details] Stack trace While running scalability tests against in-tree builds of r.1085065/r.4447, the following assertion was observed: lt-qpidd: ./qpid/broker/AsyncCompletion.h:167: void qpid::broker::AsyncCompletion::end(qpid::broker::AsyncCompletion::Callback&): Assertion `completionsNeeded.get() > 0' failed. Two test loops in a row have resulted in this failure at around iteration 8 or 9 (see below), but other test loops have succeeded without showing this problem, so it may be probabilistic in nature. Steps to reproduce: Two boxes, mrg42, mrg43 with 10g interfaces enabled as 20.0.10.{42,43} Both boxes have modified environments: limits.conf: nofile: 65536 syscfg.conf: fs.aio-max-nr: 262144 In addition, mrg42 (which runs qpid-perftest) has: ulimits.conf: nproc: 65536 The broker is run as follows on mrg43: rm -rf /tmp/rhm; ./qpidd --auth no -m no --max-connections 65100 --load-module /home/kpvdr/mrg/store/lib/.libs/msgstore.so --store-dir /tmp --jfile-size-pgs 48 --num-jfiles 16 --log-enable info+ The client is run 10 times in a row against the broker in a bash loop on mrg42 using the 10g interface: ./qpid-perftest --mode shared --summary --pub-confirm no --sync-publish no --sub-ack 0 -b 20.0.10.43 --npubs 1 --qt 10000 --nsubs 1 --count 100 Note that although the store is loaded, the test is a transient test.
Created attachment 488771 [details] Second stack trace This stack trace shows that the broker was almost idle at the time of the failure.
Most likely caused by a code path that is "completing" the enqueue more than once. No obvious candidate in the stack trace - will have to reproduce.
Please note similarity to stack trace that I found in https://bugzilla.redhat.com/show_bug.cgi?id=692546 . ( Thanks for noticing it, Gordon! ) The stacks are identical from level 9 up! I have an ultra-low-frequency (almost useless) reproducer. I would rather not close that bug yet, just in case that avenue leads to an idea.
*** Bug 692546 has been marked as a duplicate of this bug. ***
Upstream JIRA: https://issues.apache.org/jira/browse/QPID-3174
Potential fix submitted upstream: Committed revision 1087868. http://svn.apache.org/viewvc?view=revision&revision=1087868
Further similar change committed as http://svn.apache.org/viewvc?rev=1088539&view=rev and this plus change above merged to 0.10 release branch: http://svn.apache.org/viewvc?rev=1088634&view=rev
This BZ was fixed prior to -4 release - I missed setting the state to "MODIFIED"...
Verified on rhel5 / rhel6 - both i686 / x86_64 (1000 runs) rpm -qa | grep qpid python-qpid-0.10-1.el5 qpid-cpp-server-xml-0.10-7.el5 qpid-qmf-devel-0.10-10.el5 qpid-cpp-client-0.10-7.el5 qpid-java-client-0.10-6.el5 qpid-cpp-client-devel-0.10-7.el5 qpid-cpp-server-devel-0.10-7.el5 qpid-java-common-0.10-6.el5 qpid-qmf-0.10-10.el5 qpid-cpp-client-ssl-0.10-7.el5 qpid-cpp-server-cluster-0.10-7.el5 qpid-cpp-server-0.10-7.el5 qpid-java-example-0.10-6.el5 python-qpid-qmf-0.10-10.el5 qpid-cpp-client-devel-docs-0.10-7.el5 qpid-cpp-server-ssl-0.10-7.el5 qpid-tools-0.10-5.el5 qpid-cpp-server-store-0.10-7.el5 --> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0890.html