Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 591292

Summary:

MRG-M Heartbeat causes core

Product:

Red Hat Enterprise MRG

Reporter:

Scott Spurrier <spurrier>

Component:

qpid-cpp

Assignee:

Gordon Sim <gsim>

Status:

CLOSED ERRATA

QA Contact:

Frantisek Reznicek <freznice>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.2

CC:

esammons, freznice, gsim, tao

Target Milestone:

1.3

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Previously, on a 2 node MRG cluster, one of the clients running on it could core after shutting down a single network interface on one of the broker nodes. With this update, clients no longer core.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-10-14 16:07:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

510475

Attachments:

Description	Flags
Stack Trace	none
test config	none
amqpTest.latest	none
amqpTest_src_v1.2.2	none
another reproducer	none

Description Scott Spurrier 2010-05-11 19:48:56 UTC

Description of problem:
While testing client failover on a 2 node MRG cluster, the client cored after taking a single network interface down on one of the broker nodes.

How reproducible:
Sporatic client cores.

Steps to Reproduce:
1. Start 25 clients that send a messages to the broker cluster every 5 seconds, and set the heartbeat for the session to 2 seconds.
2. Start 5 clients that receive the messages, also set the heartbeat for this session to 2 seconds.
3. Select one of the broker nodes and shutdown the public network interface (ifdown eth0)
4. Watch for any client cores.  If there are no cores then bring the public interface back up
5. On the other broker node shutdown it's public interface (ifdown eth0)
6. Look for client cores, if none exist repeat the above steps (1 - 5) until a client cores.

Comment 1 Scott Spurrier 2010-05-11 19:51:45 UTC

Created attachment 413243 [details]
Stack Trace

Comment 3 Scott Spurrier 2010-06-02 15:18:47 UTC

Created attachment 419069 [details]
test config

Comment 4 Scott Spurrier 2010-06-02 15:19:49 UTC

Created attachment 419070 [details]
amqpTest.latest

Comment 5 Scott Spurrier 2010-06-02 15:21:28 UTC

Created attachment 419071 [details]
amqpTest_src_v1.2.2

Comment 7 Gordon Sim 2010-06-04 17:31:12 UTC

Created attachment 421304 [details]
another reproducer

This program starts a variable number of connections that intermittently send and receive messages with random times and heartbeats turned on. The first command line arg (if present) specifies the number of connections, the second defines the base name for the queues (e.g. ./test 10 abc wil create 10 connections using queues abc_1 through abc_10). The heartbeat interval (2 secs by default), port and host of initial broker can also be specified.

I used it to reproduce the problem by starting several processes each with different numbers of connections to a two node cluster and then periodically forced failover from one node to the other by sending the first node a STOP signal (with kill), then after a short wait sending a CONT to let it continue. This emulates failover due to a timeout as would be observed with the loss of the network connection.

Comment 8 Gordon Sim 2010-06-04 17:48:54 UTC

reproducer hint: A higher number of connections in the same process seems to speed up reproducing the error (e.g. ./test 100 cores pretty frequently on failover)

Comment 10 Gordon Sim 2010-06-09 11:42:23 UTC

Though much more rare, this is still an issue on latest packages (qpid-cpp-client-0.7.946106-2):

Core was generated by `./test 1 d'.
Program terminated with signal 6, Aborted.
[New process 9569]
[New process 8382]
[New process 10221]
[New process 9552]
[New process 9536]
[New process 9527]
#0  0x00000038f9c30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00000038f9c30265 in raise () from /lib64/libc.so.6
#1  0x00000038f9c31d10 in abort () from /lib64/libc.so.6
#2  0x00000038fccbec44 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6
#3  0x00000038fccbcdb6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00000038fccbcde3 in std::terminate () from /usr/lib64/libstdc++.so.6
#5  0x00000038fccbd2ef in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6  0x00000035b6260e76 in fire (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:152
#7  0x00000035b5dfaf60 in qpid::sys::Timer::run (this=<value optimized out>) at qpid/sys/Timer.cpp:119
#8  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#9  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#10 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 6 (process 9527):
#0  0x00000038fa407955 in pthread_join () from /lib64/libpthread.so.0
#1  0x00000035b5d23e2d in qpid::sys::Thread::join (this=<value optimized out>) at qpid/sys/posix/Thread.cpp:70
#2  0x000000000040893d in std::for_each<boost::void_ptr_iterator<__gnu_cxx::__normal_iterator<void**, std::vector<void*, std::allocator<void*> > >, Test>, boost::_bi::bind_t<void, boost::_mfi::mf0<void, Test>, boost::_bi::list1<boost::arg<1> > > > (__first={iter_ = {_M_current = 0x0}}, __last=
        {iter_ = {_M_current = 0x2540}}, __f={f_ = {f_ = 0x409790 <Test::join()>}, l_ = {a1_ = {<No data fields>}}})
    at /usr/include/boost/bind/mem_fn_template.hpp:55
#3  0x000000000040854a in main (argc=299481680, argv=0x7fff9a063228) at test.cpp:127

Thread 5 (process 9536):
#0  0x00000038fa40ad09 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000035b62a2091 in qpid::client::StateManager::waitFor (this=<value optimized out>, desired=<value optimized out>)
    at ../include/qpid/sys/posix/Condition.h:63
#2  0x00000035b6259411 in qpid::client::ConnectionHandler::waitForOpen (this=<value optimized out>) at qpid/client/ConnectionHandler.cpp:144
#3  0x00000035b6266648 in qpid::client::ConnectionImpl::open (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:282
#4  0x00000035b6258115 in qpid::client::Connection::open (this=<value optimized out>, settings=<value optimized out>)
    at qpid/client/Connection.cpp:125
#5  0x00000035b627f4f1 in qpid::client::FailoverManager::attempt (this=<value optimized out>, c=<value optimized out>, s=<value optimized out>)
    at qpid/client/FailoverManager.cpp:121
#6  0x00000035b628047c in qpid::client::FailoverManager::attempt (this=<value optimized out>, c=<value optimized out>, s=<value optimized out>, 
    urls=<value optimized out>) at qpid/client/FailoverManager.cpp:111
#7  0x00000035b6280af5 in qpid::client::FailoverManager::connect (this=<value optimized out>, brokers=<value optimized out>)
    at qpid/client/FailoverManager.cpp:82
#8  0x00000035b628128b in qpid::client::FailoverManager::execute (this=<value optimized out>, c=<value optimized out>)
    at qpid/client/FailoverManager.cpp:44
#9  0x00000000004067bd in Test::run (this=0x11d9ba50) at test.cpp:97
#10 0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#11 0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#12 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 4 (process 9552):
#0  0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524
#2  0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479
#3  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#4  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 3 (process 10221):
#0  0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#1  0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524
#2  0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479
#3  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#4  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 2 (process 8382):
#0  0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524
#2  0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479
#3  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#4  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000038f9cd3c2d in clone () from /lib64/libc.so.6

Thread 1 (process 9569):
#0  0x00000038f9c30265 in raise () from /lib64/libc.so.6
#1  0x00000038f9c31d10 in abort () from /lib64/libc.so.6
#2  0x00000038fccbec44 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6
#3  0x00000038fccbcdb6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00000038fccbcde3 in std::terminate () from /usr/lib64/libstdc++.so.6
#5  0x00000038fccbd2ef in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6  0x00000035b6260e76 in fire (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:152
#7  0x00000035b5dfaf60 in qpid::sys::Timer::run (this=<value optimized out>) at qpid/sys/Timer.cpp:119
#8  0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#9  0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0
#10 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6
(gdb)

Comment 11 Gordon Sim 2010-06-09 14:05:09 UTC

*** Bug 602268 has been marked as a duplicate of this bug. ***

Comment 12 Gordon Sim 2010-06-09 15:19:31 UTC

Fixed on trunk (r953032) and in release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=079143676f0881d15138059d328a5531ef6a307e).

Comment 13 Gordon Sim 2010-06-11 09:32:22 UTC

Further fix applied to trunk (953610) and release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=6227acefa2c27ad6b66ae7616e0b738ba3f9a754).

Comment 14 Frantisek Reznicek 2010-06-17 10:25:10 UTC

The issue has been fixed, tested on RHEL 4.8 / 5.5 i386 / x86_64 on packages:
python-qmf-0.7.946106-3.el5
python-qpid-0.7.946106-1.el5
qmf-0.7.946106-3.el5
qmf-devel-0.7.946106-3.el5
qpid-cpp-client-0.7.946106-3.el5
qpid-cpp-client-devel-0.7.946106-3.el5
qpid-cpp-client-devel-docs-0.7.946106-3.el5
qpid-cpp-client-rdma-0.7.946106-3.el5
qpid-cpp-client-ssl-0.7.946106-3.el5
qpid-cpp-mrg-debuginfo-0.7.946106-3.el5
qpid-cpp-server-0.7.946106-3.el5
qpid-cpp-server-cluster-0.7.946106-3.el5
qpid-cpp-server-devel-0.7.946106-3.el5
qpid-cpp-server-rdma-0.7.946106-3.el5
qpid-cpp-server-ssl-0.7.946106-3.el5
qpid-cpp-server-store-0.7.946106-3.el5
qpid-cpp-server-xml-0.7.946106-3.el5
qpid-java-client-0.7.946106-3.el5
qpid-java-common-0.7.946106-3.el5
qpid-tools-0.7.946106-4.el5
rh-qpid-cpp-tests-0.7.946106-3.el5
ruby-qmf-0.7.946106-3.el5

-> VERIFIED

Comment 15 Martin Prpič 2010-10-08 11:42:44 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, on a 2 node MRG cluster, one of the clients running on it could core after shutting down a single network interface on one of the broker nodes. With this update, clients no longer core.

Comment 17 errata-xmlrpc 2010-10-14 16:07:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html