Bug 875660

Summary: Active-active clustered qpidd broker crashes under cluster stress by failover_soak in ha.so around qpid::ha::HaPlugin::earlyInitialize() -> qpid::ha::HaBroker::HaBroker()
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED CURRENTRELEASE QA Contact: Frantisek Reznicek <freznice>
Severity: urgent Docs Contact:
Priority: high    
Version: DevelopmentCC: esammons, iboverma, jross, lzhaldyb, mcressma
Target Milestone: 2.3Keywords: OtherQA
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-0.18-10 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-19 16:37:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 882149    
Bug Blocks:    

Description Frantisek Reznicek 2012-11-12 10:18:28 UTC
Description of problem:

Active-active clustered qpidd broker crashes under cluster stress by failover_soak in ha.so.

There are observed qpidd crashes during active-active clustering when new ha plugin installed but not requested (ha-cluster defaults to 0/no).

All qpidd crases (SIGSEGV) are located around qpid::ha::HaPlugin::earlyInitialize() -> qpid::ha::HaBroker::HaBroker():


  ./core.8882: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'qpidd --cluster-name soakTestCluster_38f11756-fce3-4683-8e14-5251d6f6499b --aut'
    Core was generated by `qpidd --cluster-name soakTestCluster_38f11756-fce3-4683-8e14-5251d6f6499b --aut'.
    Program terminated with signal 11, Segmentation fault.
    #0  getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53
    53	    framing::Uuid getSystemId() const  { return systemId; }
    (gdb) eax            0x0	0
    ecx            0x6b6f7242	1802465858
    ...
    edi            0x6336e4	6502116
    eip            0x5f02e5	0x5f02e5 <qpid::ha::HaBroker::HaBroker(qpid::broker::Broker&, qpid::ha::Settings const&)+133>
    eflags         0x10246	[ PF ZF IF RF ]
    ...
    (*): Shared library is missing debugging information.
    (gdb)   2 Thread 0xb785bb70 (LWP 8883)  0x005a2424 in __kernel_vsyscall ()
    * 1 Thread 0xb785c730 (LWP 8882)  getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53
    Thread 1 (Thread 0xb785c730 (LWP 8882)):
    #0  getSystemId (this=0x99e1978, b=..., s=...) at qpid/broker/System.h:53
    #1  qpid::ha::HaBroker::HaBroker (this=0x99e1978, b=..., s=...) at qpid/ha/HaBroker.cpp:69
    #2  0x005f6376 in qpid::ha::HaPlugin::earlyInitialize (this=0x6336e0, target=...) at qpid/ha/HaPlugin.cpp:74
    #3  0x00441a41 in operator() (t=...) at /usr/include/boost/bind/mem_fn_template.hpp:162
    ...


Crashes were seen on all supported OSes / archs and all were surrounding path
qpid::ha::HaPlugin::earlyInitialize() -> qpid::ha::HaBroker::HaBroker() [-> getSystemId ()].

Version-Release number of selected component (if applicable):
  python-qpid-0.18-4.el6.noarch
  python-qpid-qmf-0.18-6.el6.i686
  python-saslwrapper-0.18-1.el6_3.i686
  qpid-cpp-client-0.18-8.el6.i686
  qpid-cpp-client-devel-0.18-8.el6.i686
  qpid-cpp-client-devel-docs-0.14-22.el6_3.noarch
  qpid-cpp-client-rdma-0.18-8.el6.i686
  qpid-cpp-client-ssl-0.18-8.el6.i686
  qpid-cpp-debuginfo-0.18-8.el6.i686
  qpid-cpp-server-0.18-8.el6.i686
  qpid-cpp-server-cluster-0.18-8.el6.i686
  qpid-cpp-server-devel-0.18-8.el6.i686
  qpid-cpp-server-ha-0.18-8.el6.i686
  qpid-cpp-server-rdma-0.18-8.el6.i686
  qpid-cpp-server-ssl-0.18-8.el6.i686
  qpid-cpp-server-store-0.18-8.el6.i686
  qpid-cpp-server-xml-0.18-8.el6.i686
  qpid-java-client-0.18-5.el6.noarch
  qpid-java-common-0.18-5.el6.noarch
  qpid-java-example-0.18-5.el6.noarch
  qpid-qmf-0.18-6.el6.i686
  qpid-qmf-debuginfo-0.18-6.el6.i686
  qpid-qmf-devel-0.18-6.el6.i686
  qpid-tests-0.18-2.el6.noarch
  qpid-tools-0.18-5.el6.noarch
  rh-qpid-cpp-tests-0.18-8.el6.i686
  ruby-1.8.7.352-7.el6_2.i686
  ruby-devel-1.8.7.352-7.el6_2.i686
  ruby-libs-1.8.7.352-7.el6_2.i686
  ruby-qpid-0.7.946106-2.el6.i686
  ruby-qpid-qmf-0.18-6.el6.i686
  ruby-saslwrapper-0.18-1.el6_3.i686
  saslwrapper-0.18-1.el6_3.i686
  saslwrapper-debuginfo-0.18-1.el6_3.i686
  saslwrapper-devel-0.18-1.el6_3.i686
  sesame-1.0-8.el6.i686
  sesame-debuginfo-1.0-8.el6.i686


  python-qpid-0.18-4.el5
  python-qpid-qmf-0.18-6.el5
  python-saslwrapper-0.18-1.el5
  qpid-cpp-client-0.18-7.el5
  qpid-cpp-client-devel-0.18-7.el5
  qpid-cpp-client-devel-docs-0.18-7.el5
  qpid-cpp-client-rdma-0.18-7.el5
  qpid-cpp-client-ssl-0.18-7.el5
  qpid-cpp-mrg-debuginfo-0.18-7.el5
  qpid-cpp-server-0.18-7.el5
  qpid-cpp-server-cluster-0.18-7.el5
  qpid-cpp-server-devel-0.18-7.el5
  qpid-cpp-server-ha-0.18-7.el5
  qpid-cpp-server-rdma-0.18-7.el5
  qpid-cpp-server-ssl-0.18-7.el5
  qpid-cpp-server-store-0.18-7.el5
  qpid-cpp-server-xml-0.18-7.el5
  qpid-java-client-0.18-5.el5
  qpid-java-common-0.18-5.el5
  qpid-java-example-0.18-5.el5
  qpid-qmf-0.18-6.el5
  qpid-qmf-debuginfo-0.18-6.el5
  qpid-qmf-devel-0.18-6.el5
  qpid-tests-0.18-2.el5
  qpid-tools-0.18-5.el5
  rhm-docs-0.10-2.el5
  rh-qpid-cpp-tests-0.18-7.el5
  ruby-1.8.5-24.el5
  ruby-devel-1.8.5-24.el5
  ruby-libs-1.8.5-24.el5
  ruby-qpid-qmf-0.18-6.el5
  ruby-saslwrapper-0.18-1.el5
  saslwrapper-0.18-1.el5
  saslwrapper-debuginfo-0.18-1.el5
  saslwrapper-devel-0.18-1.el5
  sesame-1.0-7.el5
  sesame-debuginfo-1.0-7.el5


How reproducible:
80%

Steps to Reproduce:
1. run failover_soak in loop as qpid_ptest_cluster_failover_soak does
2. watch for qpidd crashes

  
Actual results:
qpidd crashes.

Expected results:
qpidd should not crash.

Additional info:

Comment 4 Alan Conway 2012-11-13 19:21:10 UTC
Is it possible that management was disabled on the broker where these crashes occurred? I.e. configuration setting mgmt-enable=no

There is bug in the case where mgmt-enable=no that would give exactly these results.

Comment 6 Frantisek Reznicek 2012-11-14 09:29:41 UTC
(In reply to comment #4)
> Is it possible that management was disabled on the broker where these
> crashes occurred? I.e. configuration setting mgmt-enable=no
> 
> There is bug in the case where mgmt-enable=no that would give exactly these
> results.


I can confirm that in all the cases when we saw it:
- management was turned off (mgmt-enable=no)
- tcp-no-delay was used


15:25:46] .running core test (./failover_soak ... ) MSG:256352, DUR:1, MDL:/usr/lib/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:3, N_BROKERS:3
./runtest.sh: line 294:  8881 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[15:25:47] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)
[15:32:42] .running core test (./failover_soak ... ) MSG:208389, DUR:0, MDL:/usr/lib/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:1, N_BROKERS:3
./runtest.sh: line 294: 12637 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[15:32:43] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)


[17:28:35] .running core test (./failover_soak ... ) MSG:277935, DUR:1, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:1, N_BROKERS:3
./runtest.sh: line 294: 92195 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[17:28:36] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)
[17:31:02] .running core test (./failover_soak ... ) MSG:119763, DUR:0, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:3, N_BROKERS:3
./runtest.sh: line 294: 93937 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[17:31:03] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)

[11:33:45] .running core test (./failover_soak ... ) MSG:239310, DUR:1, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:2, N_BROKERS:3
./runtest.sh: line 294: 13690 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[11:33:45] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)
[11:34:46] .running core test (./failover_soak ... ) MSG:162971, DUR:0, MDL:/usr/lib64/qpid/daemon, MAN:no, QPIDD_CONFIG:all, QPIDD_NO_TCPDELAY:yes, N_QUEUES:2, N_BROKERS:3
./runtest.sh: line 294: 14418 Aborted                 (core dumped) ./failover_soak $MODULES ./declare_queues ./replaying_sender ./resuming_receiver $MESSAGES $REPORT_FREQUENCY $VERBOSITY $DURABILITY $N_QUEUES $N_BROKERS > ${TEMP_FILE} 2>&1
[11:34:46] ..ERROR:core test failed! (ecode:1340000,client err_cnt:0|0,broker err_cnt:0|0)
...

Comment 9 Frantisek Reznicek 2012-11-30 09:42:52 UTC
Comment 8 patch seems to moved problem to other place, tracked as bug 882149.

Comment 10 Frantisek Reznicek 2012-12-16 11:14:06 UTC
The issue has been fixed, no other crashes detected.
Tested on RHEL5.9rc/6.3  i[36]86/x86_64 using packages:

  python-qpid-0.18-4.el6.noarch
  python-qpid-qmf-0.18-13.el6.i686
  python-saslwrapper-0.18-1.el6_3.i686
  qpid-cpp-client-0.18-13.el6.i686
  qpid-cpp-client-devel-0.18-13.el6.i686
  qpid-cpp-client-devel-docs-0.14-22.el6_3.noarch
  qpid-cpp-client-rdma-0.18-13.el6.i686
  qpid-cpp-client-ssl-0.18-13.el6.i686
  qpid-cpp-debuginfo-0.18-13.el6.i686
  qpid-cpp-server-0.18-13.el6.i686
  qpid-cpp-server-cluster-0.18-13.el6.i686
  qpid-cpp-server-devel-0.18-13.el6.i686
  qpid-cpp-server-ha-0.18-13.el6.i686
  qpid-cpp-server-rdma-0.18-13.el6.i686
  qpid-cpp-server-ssl-0.18-13.el6.i686
  qpid-cpp-server-store-0.18-13.el6.i686
  qpid-cpp-server-xml-0.18-13.el6.i686
  qpid-java-client-0.18-6.el6.noarch
  qpid-java-common-0.18-6.el6.noarch
  qpid-java-example-0.18-6.el6.noarch
  qpid-qmf-0.18-13.el6.i686
  qpid-qmf-debuginfo-0.18-13.el6.i686
  qpid-qmf-devel-0.18-13.el6.i686
  qpid-tests-0.18-2.el6.noarch
  qpid-tools-0.18-7.el6_3.noarch
  rhm-docs-0.10-2.el6.noarch
  rh-qpid-cpp-tests-0.18-13.el6.i686
  ruby-qpid-qmf-0.18-13.el6.i686
  ruby-saslwrapper-0.18-1.el6_3.i686
  saslwrapper-0.18-1.el6_3.i686
  saslwrapper-debuginfo-0.18-1.el6_3.i686
  saslwrapper-devel-0.18-1.el6_3.i686
  sesame-1.0-8.el6.i686
  sesame-debuginfo-1.0-8.el6.i686


-> VERIFIED