Description of problem: Active-active clustered qpidd broker crashes under cluster stress by failover_soak in ha.so. There are observed qpidd crashes during active-active clustering when new ha plugin installed but not requested (ha-cluster defaults to 0/no). All qpidd crases (SIGSEGV) are located around qpid::ha::HaPlugin::earlyInitialize() -> qpid::ha::HaBroker::HaBroker(): ./core.8882: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'qpidd --cluster-name soakTestCluster_38f11756-fce3-4683-8e14-5251d6f6499b --aut' Core was generated by `qpidd --cluster-name soakTestCluster_38f11756-fce3-4683-8e14-5251d6f6499b --aut'. Program terminated with signal 11, Segmentation fault. #0 getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53 53 framing::Uuid getSystemId() const { return systemId; } (gdb) eax 0x0 0 ecx 0x6b6f7242 1802465858 ... edi 0x6336e4 6502116 eip 0x5f02e5 0x5f02e5 <qpid::ha::HaBroker::HaBroker(qpid::broker::Broker&, qpid::ha::Settings const&)+133> eflags 0x10246 [ PF ZF IF RF ] ... (*): Shared library is missing debugging information. (gdb) 2 Thread 0xb785bb70 (LWP 8883) 0x005a2424 in __kernel_vsyscall () * 1 Thread 0xb785c730 (LWP 8882) getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53 Thread 1 (Thread 0xb785c730 (LWP 8882)): #0 getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53 #1 qpid::ha::HaBroker::HaBroker (this=0x99e1978, b=..., s=...) at qpid/ha/HaBroker.cpp:69 #2 0x005f6376 in qpid::ha::HaPlugin::earlyInitialize (this=0x6336e0, target=...) at qpid/ha/HaPlugin.cpp:74 #3 0x00441a41 in operator() (t=...) at /usr/include/boost/bind/mem_fn_template.hpp:162 ... Crashes were seen on all supported OSes / archs and all were surrounding path qpid::ha::HaPlugin::earlyInitialize() -> qpid::ha::HaBroker::HaBroker() [-> getSystemId ()]. Version-Release number of selected component (if applicable): python-qpid-0.18-4.el6.noarch python-qpid-qmf-0.18-6.el6.i686 python-saslwrapper-0.18-1.el6_3.i686 qpid-cpp-client-0.18-8.el6.i686 qpid-cpp-client-devel-0.18-8.el6.i686 qpid-cpp-client-devel-docs-0.14-22.el6_3.noarch qpid-cpp-client-rdma-0.18-8.el6.i686 qpid-cpp-client-ssl-0.18-8.el6.i686 qpid-cpp-debuginfo-0.18-8.el6.i686 qpid-cpp-server-0.18-8.el6.i686 qpid-cpp-server-cluster-0.18-8.el6.i686 qpid-cpp-server-devel-0.18-8.el6.i686 qpid-cpp-server-ha-0.18-8.el6.i686 qpid-cpp-server-rdma-0.18-8.el6.i686 qpid-cpp-server-ssl-0.18-8.el6.i686 qpid-cpp-server-store-0.18-8.el6.i686 qpid-cpp-server-xml-0.18-8.el6.i686 qpid-java-client-0.18-5.el6.noarch qpid-java-common-0.18-5.el6.noarch qpid-java-example-0.18-5.el6.noarch qpid-qmf-0.18-6.el6.i686 qpid-qmf-debuginfo-0.18-6.el6.i686 qpid-qmf-devel-0.18-6.el6.i686 qpid-tests-0.18-2.el6.noarch qpid-tools-0.18-5.el6.noarch rh-qpid-cpp-tests-0.18-8.el6.i686 ruby-1.8.7.352-7.el6_2.i686 ruby-devel-1.8.7.352-7.el6_2.i686 ruby-libs-1.8.7.352-7.el6_2.i686 ruby-qpid-0.7.946106-2.el6.i686 ruby-qpid-qmf-0.18-6.el6.i686 ruby-saslwrapper-0.18-1.el6_3.i686 saslwrapper-0.18-1.el6_3.i686 saslwrapper-debuginfo-0.18-1.el6_3.i686 saslwrapper-devel-0.18-1.el6_3.i686 sesame-1.0-8.el6.i686 sesame-debuginfo-1.0-8.el6.i686 python-qpid-0.18-4.el5 python-qpid-qmf-0.18-6.el5 python-saslwrapper-0.18-1.el5 qpid-cpp-client-0.18-7.el5 qpid-cpp-client-devel-0.18-7.el5 qpid-cpp-client-devel-docs-0.18-7.el5 qpid-cpp-client-rdma-0.18-7.el5 qpid-cpp-client-ssl-0.18-7.el5 qpid-cpp-mrg-debuginfo-0.18-7.el5 qpid-cpp-server-0.18-7.el5 qpid-cpp-server-cluster-0.18-7.el5 qpid-cpp-server-devel-0.18-7.el5 qpid-cpp-server-ha-0.18-7.el5 qpid-cpp-server-rdma-0.18-7.el5 qpid-cpp-server-ssl-0.18-7.el5 qpid-cpp-server-store-0.18-7.el5 qpid-cpp-server-xml-0.18-7.el5 qpid-java-client-0.18-5.el5 qpid-java-common-0.18-5.el5 qpid-java-example-0.18-5.el5 qpid-qmf-0.18-6.el5 qpid-qmf-debuginfo-0.18-6.el5 qpid-qmf-devel-0.18-6.el5 qpid-tests-0.18-2.el5 qpid-tools-0.18-5.el5 rhm-docs-0.10-2.el5 rh-qpid-cpp-tests-0.18-7.el5 ruby-1.8.5-24.el5 ruby-devel-1.8.5-24.el5 ruby-libs-1.8.5-24.el5 ruby-qpid-qmf-0.18-6.el5 ruby-saslwrapper-0.18-1.el5 saslwrapper-0.18-1.el5 saslwrapper-debuginfo-0.18-1.el5 saslwrapper-devel-0.18-1.el5 sesame-1.0-7.el5 sesame-debuginfo-1.0-7.el5 How reproducible: 80% Steps to Reproduce: 1. run failover_soak in loop as qpid_ptest_cluster_failover_soak does 2. watch for qpidd crashes Actual results: qpidd crashes. Expected results: qpidd should not crash. Additional info:
Is it possible that management was disabled on the broker where these crashes occurred? I.e. configuration setting mgmt-enable=no There is bug in the case where mgmt-enable=no that would give exactly these results.
(In reply to comment #4) > Is it possible that management was disabled on the broker where these > crashes occurred? I.e. configuration setting mgmt-enable=no > > There is bug in the case where mgmt-enable=no that would give exactly these > results. I can confirm that in all the cases when we saw it: - management was turned off (mgmt-enable=no) - tcp-no-delay was used 15:25:46] .running core test (./failover_soak ... ) MSG:256352, DUR:1, MDL:/usr/lib/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:3, N_BROKERS:3 ./runtest.sh: line 294: 8881 Aborted (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1 [15:25:47] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0) [15:32:42] .running core test (./failover_soak ... ) MSG:208389, DUR:0, MDL:/usr/lib/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:1, N_BROKERS:3 ./runtest.sh: line 294: 12637 Aborted (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1 [15:32:43] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0) [17:28:35] .running core test (./failover_soak ... ) MSG:277935, DUR:1, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:1, N_BROKERS:3 ./runtest.sh: line 294: 92195 Aborted (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1 [17:28:36] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0) [17:31:02] .running core test (./failover_soak ... ) MSG:119763, DUR:0, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:3, N_BROKERS:3 ./runtest.sh: line 294: 93937 Aborted (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1 [17:31:03] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0) [11:33:45] .running core test (./failover_soak ... ) MSG:239310, DUR:1, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:2, N_BROKERS:3 ./runtest.sh: line 294: 13690 Aborted (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1 [11:33:45] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0) [11:34:46] .running core test (./failover_soak ... ) MSG:162971, DUR:0, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:2, N_BROKERS:3 ./runtest.sh: line 294: 14418 Aborted (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1 [11:34:46] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0) ...
Comment 8 patch seems to moved problem to other place, tracked as bug 882149.
The issue has been fixed, no other crashes detected. Tested on RHEL5.9rc/6.3 i[36]86/x86_64 using packages: python-qpid-0.18-4.el6.noarch python-qpid-qmf-0.18-13.el6.i686 python-saslwrapper-0.18-1.el6_3.i686 qpid-cpp-client-0.18-13.el6.i686 qpid-cpp-client-devel-0.18-13.el6.i686 qpid-cpp-client-devel-docs-0.14-22.el6_3.noarch qpid-cpp-client-rdma-0.18-13.el6.i686 qpid-cpp-client-ssl-0.18-13.el6.i686 qpid-cpp-debuginfo-0.18-13.el6.i686 qpid-cpp-server-0.18-13.el6.i686 qpid-cpp-server-cluster-0.18-13.el6.i686 qpid-cpp-server-devel-0.18-13.el6.i686 qpid-cpp-server-ha-0.18-13.el6.i686 qpid-cpp-server-rdma-0.18-13.el6.i686 qpid-cpp-server-ssl-0.18-13.el6.i686 qpid-cpp-server-store-0.18-13.el6.i686 qpid-cpp-server-xml-0.18-13.el6.i686 qpid-java-client-0.18-6.el6.noarch qpid-java-common-0.18-6.el6.noarch qpid-java-example-0.18-6.el6.noarch qpid-qmf-0.18-13.el6.i686 qpid-qmf-debuginfo-0.18-13.el6.i686 qpid-qmf-devel-0.18-13.el6.i686 qpid-tests-0.18-2.el6.noarch qpid-tools-0.18-7.el6_3.noarch rhm-docs-0.10-2.el6.noarch rh-qpid-cpp-tests-0.18-13.el6.i686 ruby-qpid-qmf-0.18-13.el6.i686 ruby-saslwrapper-0.18-1.el6_3.i686 saslwrapper-0.18-1.el6_3.i686 saslwrapper-debuginfo-0.18-1.el6_3.i686 saslwrapper-devel-0.18-1.el6_3.i686 sesame-1.0-8.el6.i686 sesame-debuginfo-1.0-8.el6.i686 -> VERIFIED