Bug 591292

Summary: MRG-M Heartbeat causes core
Product: Red Hat Enterprise MRG Reporter: Scott Spurrier <spurrier>
Component: qpid-cppAssignee: Gordon Sim <gsim>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: high    
Version: 1.2CC: esammons, freznice, gsim, tao
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, on a 2 node MRG cluster, one of the clients running on it could core after shutting down a single network interface on one of the broker nodes. With this update, clients no longer core.
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-14 16:07:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 510475    
Attachments:
Description Flags
Stack Trace
none
test config
none
amqpTest.latest
none
amqpTest_src_v1.2.2
none
another reproducer none

Description Scott Spurrier 2010-05-11 19:48:56 UTC
Description of problem:
While testing client failover on a 2 node MRG cluster, the client cored after taking a single network interface down on one of the broker nodes.

How reproducible:
Sporatic client cores.

Steps to Reproduce:
1. Start 25 clients that send a messages to the broker cluster every 5 seconds, and set the heartbeat for the session to 2 seconds.
2. Start 5 clients that receive the messages, also set the heartbeat for this session to 2 seconds.
3. Select one of the broker nodes and shutdown the public network interface (ifdown eth0)
4. Watch for any client cores.  If there are no cores then bring the public interface back up
5. On the other broker node shutdown it's public interface (ifdown eth0)
6. Look for client cores, if none exist repeat the above steps (1 - 5) until a client cores.

Comment 1 Scott Spurrier 2010-05-11 19:51:45 UTC
Created attachment 413243 [details]
Stack Trace

Comment 3 Scott Spurrier 2010-06-02 15:18:47 UTC
Created attachment 419069 [details]
test config

Comment 4 Scott Spurrier 2010-06-02 15:19:49 UTC
Created attachment 419070 [details]
amqpTest.latest

Comment 5 Scott Spurrier 2010-06-02 15:21:28 UTC
Created attachment 419071 [details]
amqpTest_src_v1.2.2

Comment 7 Gordon Sim 2010-06-04 17:31:12 UTC
Created attachment 421304 [details]
another reproducer

This program starts a variable number of connections that intermittently send and receive messages with random times and heartbeats turned on. The first command line arg (if present) specifies the number of connections, the second defines the base name for the queues (e.g. ./test 10 abc wil create 10 connections using queues abc_1 through abc_10). The heartbeat interval (2 secs by default), port and host of initial broker can also be specified.

I used it to reproduce the problem by starting several processes each with different numbers of connections to a two node cluster and then periodically forced failover from one node to the other by sending the first node a STOP signal (with kill), then after a short wait sending a CONT to let it continue. This emulates failover due to a timeout as would be observed with the loss of the network connection.

Comment 8 Gordon Sim 2010-06-04 17:48:54 UTC
reproducer hint: A higher number of connections in the same process seems to speed up reproducing the error (e.g. ./test 100 cores pretty frequently on failover)

Comment 10 Gordon Sim 2010-06-09 11:42:23 UTC
Though much more rare, this is still an issue on latest packages (qpid-cpp-client-0.7.946106-2):

Core was generated by `./test 1 d'.
Program terminated with signal 6, Aborted.
[New process 9569]
[New process 8382]
[New process 10221]
[New process 9552]
[New process 9536]
[New process 9527]
#0  0x00000038f9c30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00000038f9c30265 in raise () from /lib64/libc.so.6
#1  0x00000038f9c31d10 in abort () from /lib64/libc.so.6
#2  0x00000038fccbec44 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6
#3  0x00000038fccbcdb6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00000038fccbcde3 in std::terminate () from /usr/lib64/libstdc++.so.6
#5  0x00000038fccbd2ef in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6  0x00000035b6260e76 in fire (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:152
#7  0x00000035b5dfaf60 in qpid::sys::Timer::run (this=<value optimized out>) at qpid/sys/Timer.cpp:119
#8  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#9  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#10 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 6 (process 9527):
#0  0x00000038fa407955 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000035b5d23e2d in qpid::sys::Thread::join (this=<value optimized out>) at qpid/sys/posix/Thread.cpp:70
#2  0x000000000040893d in std::for_each<boost::void_ptr_iterator<__gnu_cxx::__normal_iterator<void**, std::vector<void*, std::allocator<void*> > >, Test>, boost::_bi::bind_t<void, boost::_mfi::mf0<void, Test>, boost::_bi::list1<boost::arg<1> > > > (__first={iter_ = {_M_current = 0x0}}, __last=
        {iter_ = {_M_current = 0x2540}}, __f={f_ = {f_ = 0x409790 <Test::join()>}, l_ = {a1_ = {<No data fields>}}})
    at /usr/include/boost/bind/mem_fn_template.hpp:55
#3  0x000000000040854a in main (argc=299481680, argv=0x7fff9a063228) at test.cpp:127

Thread 5 (process 9536):
#0  0x00000038fa40ad09 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000035b62a2091 in qpid::client::StateManager::waitFor (this=<value optimized out>, desired=<value optimized out>)
    at ../include/qpid/sys/posix/Condition.h:63
#2  0x00000035b6259411 in qpid::client::ConnectionHandler::waitForOpen (this=<value optimized out>) at qpid/client/ConnectionHandler.cpp:144
#3  0x00000035b6266648 in qpid::client::ConnectionImpl::open (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:282
#4  0x00000035b6258115 in qpid::client::Connection::open (this=<value optimized out>, settings=<value optimized out>)
    at qpid/client/Connection.cpp:125
#5  0x00000035b627f4f1 in qpid::client::FailoverManager::attempt (this=<value optimized out>, c=<value optimized out>, s=<value optimized out>)
    at qpid/client/FailoverManager.cpp:121
#6  0x00000035b628047c in qpid::client::FailoverManager::attempt (this=<value optimized out>, c=<value optimized out>, s=<value optimized out>, 
    urls=<value optimized out>) at qpid/client/FailoverManager.cpp:111
#7  0x00000035b6280af5 in qpid::client::FailoverManager::connect (this=<value optimized out>, brokers=<value optimized out>)
    at qpid/client/FailoverManager.cpp:82
#8  0x00000035b628128b in qpid::client::FailoverManager::execute (this=<value optimized out>, c=<value optimized out>)
    at qpid/client/FailoverManager.cpp:44
#9  0x00000000004067bd in Test::run (this=0x11d9ba50) at test.cpp:97
#10 0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#11 0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#12 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 4 (process 9552):
#0  0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524
#2  0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479
#3  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#4  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 3 (process 10221):
#0  0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1  0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524
#2  0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479
#3  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#4  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 2 (process 8382):
#0  0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524
#2  0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479
#3  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#4  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 1 (process 9569):
#0  0x00000038f9c30265 in raise () from /lib64/libc.so.6
#1  0x00000038f9c31d10 in abort () from /lib64/libc.so.6
#2  0x00000038fccbec44 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6
#3  0x00000038fccbcdb6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00000038fccbcde3 in std::terminate () from /usr/lib64/libstdc++.so.6
#5  0x00000038fccbd2ef in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6  0x00000035b6260e76 in fire (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:152
#7  0x00000035b5dfaf60 in qpid::sys::Timer::run (this=<value optimized out>) at qpid/sys/Timer.cpp:119
#8  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#9  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#10 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6
(gdb)

Comment 11 Gordon Sim 2010-06-09 14:05:09 UTC
*** Bug 602268 has been marked as a duplicate of this bug. ***

Comment 12 Gordon Sim 2010-06-09 15:19:31 UTC
Fixed on trunk (r953032) and in release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=079143676f0881d15138059d328a5531ef6a307e).

Comment 13 Gordon Sim 2010-06-11 09:32:22 UTC
Further fix applied to trunk (953610) and release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=6227acefa2c27ad6b66ae7616e0b738ba3f9a754).

Comment 14 Frantisek Reznicek 2010-06-17 10:25:10 UTC
The issue has been fixed, tested on RHEL 4.8 / 5.5 i386 / x86_64 on packages:
python-qmf-0.7.946106-3.el5
python-qpid-0.7.946106-1.el5
qmf-0.7.946106-3.el5
qmf-devel-0.7.946106-3.el5
qpid-cpp-client-0.7.946106-3.el5
qpid-cpp-client-devel-0.7.946106-3.el5
qpid-cpp-client-devel-docs-0.7.946106-3.el5
qpid-cpp-client-rdma-0.7.946106-3.el5
qpid-cpp-client-ssl-0.7.946106-3.el5
qpid-cpp-mrg-debuginfo-0.7.946106-3.el5
qpid-cpp-server-0.7.946106-3.el5
qpid-cpp-server-cluster-0.7.946106-3.el5
qpid-cpp-server-devel-0.7.946106-3.el5
qpid-cpp-server-rdma-0.7.946106-3.el5
qpid-cpp-server-ssl-0.7.946106-3.el5
qpid-cpp-server-store-0.7.946106-3.el5
qpid-cpp-server-xml-0.7.946106-3.el5
qpid-java-client-0.7.946106-3.el5
qpid-java-common-0.7.946106-3.el5
qpid-tools-0.7.946106-4.el5
rh-qpid-cpp-tests-0.7.946106-3.el5
ruby-qmf-0.7.946106-3.el5

-> VERIFIED

Comment 15 Martin Prpič 2010-10-08 11:42:44 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, on a 2 node MRG cluster, one of the clients running on it could core after shutting down a single network interface on one of the broker nodes. With this update, clients no longer core.

Comment 17 errata-xmlrpc 2010-10-14 16:07:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html