Description of problem: Customer is seeing a clustered broker exit with an invalid-argument error. Version-Release number of selected component (if applicable): How reproducible: unknown. Customer seems to see it reliably. Steps to Reproduce: unknown 1. 2. 3. Actual results: clustered broker exit with invalid-argument error Expected results: no error. Additional info:
Created attachment 481447 [details] Log files provided by the customer 3 sets of logs: mail1 only has normal logging so doesn't tell us much. mail2 and mail3 have debug+:cluster and trace+:management. Run qpid/cpp/src/tests/cluster_test_logs.py in the directory with the logs and it will split them into corresponding sections, filter out expected differences and tell you where it finds unexpected differences. mail2: Extra management events on qpidd-rh1.log, survivor trace Changed V1 statistics com.redhat.rhm.store:journal:ASIC len=NN debug SEND V1 Multicast ContentInd to=console.obj.1.0.com.redhat.rhm.store.journal props=0 stats=1 len=NN trace Changed V1 statistics org.apache.qpid.broker:binding:org.apache.qpid.broker:exchange:qpid.management,org.apache.qpid.broker:queue:topic-frgpasic-vc1.ffx.jfh.com.au.12833.1,console.obj.*.*.com.redhat.rhm.store.# len=NN trace Changed V1 statistics org.apache.qpid.broker:binding:org.apache.qpid.broker:exchange:qpid.management,org.apache.qpid.broker:queue:topic-frgpasic-vc1.ffx.jfh.com.au.12833.2,console.obj.*.*.com.redhat.rhm.store.# len=NN These are due to changing the journal properties in a journal timer thread. However they don't generate an extra SEND, so its not clear how this discrepancy causes the failure. mail3: 1-3vs2-3: - extra: debug SEND raiseEvent (v1) class=com.redhat.rhm.store.created - no relation to update 2-2vs1-3: - initial extra object: com.redhat.rhm.store:journal:TplStore - extra: debug SEND raiseEvent (v1) class=com.redhat.rhm.store.created - 2011-02-23 08:02:17 debug Suppressing management message to console.event.1.0.com.redhat.rhm.store.created I can't see how we are getting the extra store.created event, as this is called from Queue::create which is always (I think) called in a cluster safe context. I have tried to reproduce a crash like the customers on the -27 builds, no luck so far. I can easily reproduce the journal props discrepancy, but without triggering a failure.
I think it is the lazy initialisation of the TPLStore that is the issue. I can reproduce the error fairly simply: 1. start on cluster node 2. run qpid-txtest against it 3. start a second cluster node 4. run qpid-printevents against the second node 5. re-run qpid-txtest 6. restart the second cluster node 7. repeat from 5. as necessary til you see the first node shutting down There is no management object for the TplStore on the first broker as the management agent is not set when it is created (init() called before managementInit() in store plugin). However the second node gets a truncateInit() which happens after management is initialised and consequently that second node does have a TplStore management object (see via list journal on qpid-tool). The TplStore is initialised 'on demand', and it is on initialisation that the event is raised. This can therefore happen on one node and not the other.
Created attachment 481708 [details] Shell script to reproduce the issue. Attached script reproduces the problem. Run the script and note that output ends with "Broker 5555 down", and only one broker is running. It should output "Brokers 5555 5556 alive" and both brokers should be running.
Fix for the above reproducer in the store repo, r4445.
The fix mentioned in comment 4 does address one of the issue in the customers logs, but there is a second issue so the fix is not complete.
Committed to store repo r4446. The complete fix is: ------------------------------------------------------------------------ r4446 | aconway | 2011-03-04 13:57:42 -0500 (Fri, 04 Mar 2011) | 6 lines Disable management of store and journal in a cluster. The store updates statistics and properties of store and journal managed objects in timer threads, which can create inconsistencies in a cluster. ------------------------------------------------------------------------ r4445 | aconway | 2011-03-01 15:08:48 -0500 (Tue, 01 Mar 2011) | 10 lines Bug 681026 - Clustered broker exits with invalid-argument error. In a normal start-up the TplStore had no management object becase it was created before the management infrastructure. In a cluster start-up when joining an existing cluster and truncating the store, the TplStore was created after the management infrastructure and so did have a management object. This could create a cluster inconsistency. The fix is to never create a management object for the TplStore.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Running a cluster with persistent message store, durable messages and a management agent (e.g. qpid-tool, cumin) Consequence: occasionally a broker would shut down with an "invalid argument" message. Fix: Corrected issues in the store that caused inconsistencies in a cluster. Result: Brokers no longer shut down.
By testing this issue with the attached reproducer it seems to the error was fixed. There is mentioned also second issue in Comment 5 and Comment 6. Is the issue covered by the reproducer as well? If not, please provide some description how to test it.
Both issues are covered by the same reproducer. The second issue was showing up less frequently than the first.
This issue has been fixed. Verified on RHEL5.6, RHEL6.1 architectures: i386, x86_64 Successfully tested with the attached reproducer in loop. Test loops: 10000 Test duration: over 30 hours. The invalid-argument error did not occur. packages installed: python-qpid-0.10-1.el5.noarch python-qpid-qmf-0.10-6.el5.x86_64 qpid-cpp-client-0.10-6.el5.x86_64 qpid-cpp-client-devel-0.10-6.el5.x86_64 qpid-cpp-client-devel-docs-0.10-6.el5.x86_64 qpid-cpp-client-ssl-0.10-6.el5.x86_64 qpid-cpp-mrg-debuginfo-0.10-6.el5.x86_64 qpid-cpp-server-0.10-6.el5.x86_64 qpid-cpp-server-cluster-0.10-6.el5.x86_64 qpid-cpp-server-devel-0.10-6.el5.x86_64 qpid-cpp-server-ssl-0.10-6.el5.x86_64 qpid-cpp-server-store-0.10-6.el5.x86_64 qpid-cpp-server-xml-0.10-6.el5.x86_64 qpid-dotnet-0.4.738274-2.el5.x86_64 qpid-java-client-0.10-4.el5.noarch qpid-java-common-0.10-4.el5.noarch qpid-java-example-0.10-4.el5.noarch qpid-qmf-0.10-6.el5.x86_64 qpid-qmf-debuginfo-0.10-6.el5.x86_64 qpid-qmf-devel-0.10-6.el5.x86_64 qpid-tools-0.10-4.el5.noarch -> VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0890.html