Bug 875660 - Active-active clustered qpidd broker crashes under cluster stress by failover_soak in ha.so around qpid::ha::HaPlugin::earlyInitialize() -> qpid::ha::HaBroker::HaBroker()
Active-active clustered qpidd broker crashes under cluster stress by failover...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
Development
Unspecified Unspecified
high Severity urgent
: 2.3
: ---
Assigned To: Alan Conway
Frantisek Reznicek
: OtherQA
Depends On: 882149
Blocks:
  Show dependency treegraph
 
Reported: 2012-11-12 05:18 EST by Frantisek Reznicek
Modified: 2015-11-15 20:14 EST (History)
5 users (show)

See Also:
Fixed In Version: qpid-cpp-0.18-10
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-03-19 12:37:34 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Frantisek Reznicek 2012-11-12 05:18:28 EST
Description of problem:

Active-active clustered qpidd broker crashes under cluster stress by failover_soak in ha.so.

There are observed qpidd crashes during active-active clustering when new ha plugin installed but not requested (ha-cluster defaults to 0/no).

All qpidd crases (SIGSEGV) are located around qpid::ha::HaPlugin::earlyInitialize() -> qpid::ha::HaBroker::HaBroker():


  ./core.8882: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'qpidd --cluster-name soakTestCluster_38f11756-fce3-4683-8e14-5251d6f6499b --aut'
    Core was generated by `qpidd --cluster-name soakTestCluster_38f11756-fce3-4683-8e14-5251d6f6499b --aut'.
    Program terminated with signal 11, Segmentation fault.
    #0  getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53
    53	    framing::Uuid getSystemId() const  { return systemId; }
    (gdb) eax            0x0	0
    ecx            0x6b6f7242	1802465858
    ...
    edi            0x6336e4	6502116
    eip            0x5f02e5	0x5f02e5 <qpid::ha::HaBroker::HaBroker(qpid::broker::Broker&, qpid::ha::Settings const&)+133>
    eflags         0x10246	[ PF ZF IF RF ]
    ...
    (*): Shared library is missing debugging information.
    (gdb)   2 Thread 0xb785bb70 (LWP 8883)  0x005a2424 in __kernel_vsyscall ()
    * 1 Thread 0xb785c730 (LWP 8882)  getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53
    Thread 1 (Thread 0xb785c730 (LWP 8882)):
    #0  getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53
    #1  qpid::ha::HaBroker::HaBroker (this=0x99e1978, b=..., s=...) at qpid/ha/HaBroker.cpp:69
    #2  0x005f6376 in qpid::ha::HaPlugin::earlyInitialize (this=0x6336e0, target=...) at qpid/ha/HaPlugin.cpp:74
    #3  0x00441a41 in operator() (t=...) at /usr/include/boost/bind/mem_fn_template.hpp:162
    ...


Crashes were seen on all supported OSes / archs and all were surrounding path
qpid::ha::HaPlugin::earlyInitialize() -> qpid::ha::HaBroker::HaBroker() [-> getSystemId ()].

Version-Release number of selected component (if applicable):
  python-qpid-0.18-4.el6.noarch
  python-qpid-qmf-0.18-6.el6.i686
  python-saslwrapper-0.18-1.el6_3.i686
  qpid-cpp-client-0.18-8.el6.i686
  qpid-cpp-client-devel-0.18-8.el6.i686
  qpid-cpp-client-devel-docs-0.14-22.el6_3.noarch
  qpid-cpp-client-rdma-0.18-8.el6.i686
  qpid-cpp-client-ssl-0.18-8.el6.i686
  qpid-cpp-debuginfo-0.18-8.el6.i686
  qpid-cpp-server-0.18-8.el6.i686
  qpid-cpp-server-cluster-0.18-8.el6.i686
  qpid-cpp-server-devel-0.18-8.el6.i686
  qpid-cpp-server-ha-0.18-8.el6.i686
  qpid-cpp-server-rdma-0.18-8.el6.i686
  qpid-cpp-server-ssl-0.18-8.el6.i686
  qpid-cpp-server-store-0.18-8.el6.i686
  qpid-cpp-server-xml-0.18-8.el6.i686
  qpid-java-client-0.18-5.el6.noarch
  qpid-java-common-0.18-5.el6.noarch
  qpid-java-example-0.18-5.el6.noarch
  qpid-qmf-0.18-6.el6.i686
  qpid-qmf-debuginfo-0.18-6.el6.i686
  qpid-qmf-devel-0.18-6.el6.i686
  qpid-tests-0.18-2.el6.noarch
  qpid-tools-0.18-5.el6.noarch
  rh-qpid-cpp-tests-0.18-8.el6.i686
  ruby-1.8.7.352-7.el6_2.i686
  ruby-devel-1.8.7.352-7.el6_2.i686
  ruby-libs-1.8.7.352-7.el6_2.i686
  ruby-qpid-0.7.946106-2.el6.i686
  ruby-qpid-qmf-0.18-6.el6.i686
  ruby-saslwrapper-0.18-1.el6_3.i686
  saslwrapper-0.18-1.el6_3.i686
  saslwrapper-debuginfo-0.18-1.el6_3.i686
  saslwrapper-devel-0.18-1.el6_3.i686
  sesame-1.0-8.el6.i686
  sesame-debuginfo-1.0-8.el6.i686


  python-qpid-0.18-4.el5
  python-qpid-qmf-0.18-6.el5
  python-saslwrapper-0.18-1.el5
  qpid-cpp-client-0.18-7.el5
  qpid-cpp-client-devel-0.18-7.el5
  qpid-cpp-client-devel-docs-0.18-7.el5
  qpid-cpp-client-rdma-0.18-7.el5
  qpid-cpp-client-ssl-0.18-7.el5
  qpid-cpp-mrg-debuginfo-0.18-7.el5
  qpid-cpp-server-0.18-7.el5
  qpid-cpp-server-cluster-0.18-7.el5
  qpid-cpp-server-devel-0.18-7.el5
  qpid-cpp-server-ha-0.18-7.el5
  qpid-cpp-server-rdma-0.18-7.el5
  qpid-cpp-server-ssl-0.18-7.el5
  qpid-cpp-server-store-0.18-7.el5
  qpid-cpp-server-xml-0.18-7.el5
  qpid-java-client-0.18-5.el5
  qpid-java-common-0.18-5.el5
  qpid-java-example-0.18-5.el5
  qpid-qmf-0.18-6.el5
  qpid-qmf-debuginfo-0.18-6.el5
  qpid-qmf-devel-0.18-6.el5
  qpid-tests-0.18-2.el5
  qpid-tools-0.18-5.el5
  rhm-docs-0.10-2.el5
  rh-qpid-cpp-tests-0.18-7.el5
  ruby-1.8.5-24.el5
  ruby-devel-1.8.5-24.el5
  ruby-libs-1.8.5-24.el5
  ruby-qpid-qmf-0.18-6.el5
  ruby-saslwrapper-0.18-1.el5
  saslwrapper-0.18-1.el5
  saslwrapper-debuginfo-0.18-1.el5
  saslwrapper-devel-0.18-1.el5
  sesame-1.0-7.el5
  sesame-debuginfo-1.0-7.el5


How reproducible:
80%

Steps to Reproduce:
1. run failover_soak in loop as qpid_ptest_cluster_failover_soak does
2. watch for qpidd crashes

  
Actual results:
qpidd crashes.

Expected results:
qpidd should not crash.

Additional info:
Comment 4 Alan Conway 2012-11-13 14:21:10 EST
Is it possible that management was disabled on the broker where these crashes occurred? I.e. configuration setting mgmt-enable=no

There is bug in the case where mgmt-enable=no that would give exactly these results.
Comment 6 Frantisek Reznicek 2012-11-14 04:29:41 EST
(In reply to comment #4)
> Is it possible that management was disabled on the broker where these
> crashes occurred? I.e. configuration setting mgmt-enable=no
> 
> There is bug in the case where mgmt-enable=no that would give exactly these
> results.


I can confirm that in all the cases when we saw it:
- management was turned off (mgmt-enable=no)
- tcp-no-delay was used


15:25:46] .running core test (./failover_soak ... ) MSG:256352, DUR:1, MDL:/usr/lib/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:3, N_BROKERS:3
./runtest.sh: line 294:  8881 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[15:25:47] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)
[15:32:42] .running core test (./failover_soak ... ) MSG:208389, DUR:0, MDL:/usr/lib/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:1, N_BROKERS:3
./runtest.sh: line 294: 12637 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[15:32:43] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)


[17:28:35] .running core test (./failover_soak ... ) MSG:277935, DUR:1, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:1, N_BROKERS:3
./runtest.sh: line 294: 92195 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[17:28:36] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)
[17:31:02] .running core test (./failover_soak ... ) MSG:119763, DUR:0, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:3, N_BROKERS:3
./runtest.sh: line 294: 93937 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[17:31:03] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)

[11:33:45] .running core test (./failover_soak ... ) MSG:239310, DUR:1, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:2, N_BROKERS:3
./runtest.sh: line 294: 13690 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[11:33:45] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)
[11:34:46] .running core test (./failover_soak ... ) MSG:162971, DUR:0, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:2, N_BROKERS:3
./runtest.sh: line 294: 14418 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[11:34:46] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)
...
Comment 9 Frantisek Reznicek 2012-11-30 04:42:52 EST
Comment 8 patch seems to moved problem to other place, tracked as bug 882149.
Comment 10 Frantisek Reznicek 2012-12-16 06:14:06 EST
The issue has been fixed, no other crashes detected.
Tested on RHEL5.9rc/6.3  i[36]86/x86_64 using packages:

  python-qpid-0.18-4.el6.noarch
  python-qpid-qmf-0.18-13.el6.i686
  python-saslwrapper-0.18-1.el6_3.i686
  qpid-cpp-client-0.18-13.el6.i686
  qpid-cpp-client-devel-0.18-13.el6.i686
  qpid-cpp-client-devel-docs-0.14-22.el6_3.noarch
  qpid-cpp-client-rdma-0.18-13.el6.i686
  qpid-cpp-client-ssl-0.18-13.el6.i686
  qpid-cpp-debuginfo-0.18-13.el6.i686
  qpid-cpp-server-0.18-13.el6.i686
  qpid-cpp-server-cluster-0.18-13.el6.i686
  qpid-cpp-server-devel-0.18-13.el6.i686
  qpid-cpp-server-ha-0.18-13.el6.i686
  qpid-cpp-server-rdma-0.18-13.el6.i686
  qpid-cpp-server-ssl-0.18-13.el6.i686
  qpid-cpp-server-store-0.18-13.el6.i686
  qpid-cpp-server-xml-0.18-13.el6.i686
  qpid-java-client-0.18-6.el6.noarch
  qpid-java-common-0.18-6.el6.noarch
  qpid-java-example-0.18-6.el6.noarch
  qpid-qmf-0.18-13.el6.i686
  qpid-qmf-debuginfo-0.18-13.el6.i686
  qpid-qmf-devel-0.18-13.el6.i686
  qpid-tests-0.18-2.el6.noarch
  qpid-tools-0.18-7.el6_3.noarch
  rhm-docs-0.10-2.el6.noarch
  rh-qpid-cpp-tests-0.18-13.el6.i686
  ruby-qpid-qmf-0.18-13.el6.i686
  ruby-saslwrapper-0.18-1.el6_3.i686
  saslwrapper-0.18-1.el6_3.i686
  saslwrapper-debuginfo-0.18-1.el6_3.i686
  saslwrapper-devel-0.18-1.el6_3.i686
  sesame-1.0-8.el6.i686
  sesame-debuginfo-1.0-8.el6.i686


-> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.