Bug 591292
Summary: | MRG-M Heartbeat causes core | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Scott Spurrier <spurrier> | ||||||||||||
Component: | qpid-cpp | Assignee: | Gordon Sim <gsim> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 1.2 | CC: | esammons, freznice, gsim, tao | ||||||||||||
Target Milestone: | 1.3 | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: |
Previously, on a 2 node MRG cluster, one of the clients running on it could core after shutting down a single network interface on one of the broker nodes. With this update, clients no longer core.
|
Story Points: | --- | ||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2010-10-14 16:07:31 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 510475 | ||||||||||||||
Attachments: |
|
Description
Scott Spurrier
2010-05-11 19:48:56 UTC
Created attachment 413243 [details]
Stack Trace
Created attachment 419069 [details]
test config
Created attachment 419070 [details]
amqpTest.latest
Created attachment 419071 [details]
amqpTest_src_v1.2.2
Created attachment 421304 [details]
another reproducer
This program starts a variable number of connections that intermittently send and receive messages with random times and heartbeats turned on. The first command line arg (if present) specifies the number of connections, the second defines the base name for the queues (e.g. ./test 10 abc wil create 10 connections using queues abc_1 through abc_10). The heartbeat interval (2 secs by default), port and host of initial broker can also be specified.
I used it to reproduce the problem by starting several processes each with different numbers of connections to a two node cluster and then periodically forced failover from one node to the other by sending the first node a STOP signal (with kill), then after a short wait sending a CONT to let it continue. This emulates failover due to a timeout as would be observed with the loss of the network connection.
reproducer hint: A higher number of connections in the same process seems to speed up reproducing the error (e.g. ./test 100 cores pretty frequently on failover) Though much more rare, this is still an issue on latest packages (qpid-cpp-client-0.7.946106-2): Core was generated by `./test 1 d'. Program terminated with signal 6, Aborted. [New process 9569] [New process 8382] [New process 10221] [New process 9552] [New process 9536] [New process 9527] #0 0x00000038f9c30265 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00000038f9c30265 in raise () from /lib64/libc.so.6 #1 0x00000038f9c31d10 in abort () from /lib64/libc.so.6 #2 0x00000038fccbec44 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6 #3 0x00000038fccbcdb6 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x00000038fccbcde3 in std::terminate () from /usr/lib64/libstdc++.so.6 #5 0x00000038fccbd2ef in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6 #6 0x00000035b6260e76 in fire (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:152 #7 0x00000035b5dfaf60 in qpid::sys::Timer::run (this=<value optimized out>) at qpid/sys/Timer.cpp:119 #8 0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #9 0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0 #10 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 6 (process 9527): #0 0x00000038fa407955 in pthread_join () from /lib64/libpthread.so.0 #1 0x00000035b5d23e2d in qpid::sys::Thread::join (this=<value optimized out>) at qpid/sys/posix/Thread.cpp:70 #2 0x000000000040893d in std::for_each<boost::void_ptr_iterator<__gnu_cxx::__normal_iterator<void**, std::vector<void*, std::allocator<void*> > >, Test>, boost::_bi::bind_t<void, boost::_mfi::mf0<void, Test>, boost::_bi::list1<boost::arg<1> > > > (__first={iter_ = {_M_current = 0x0}}, __last= {iter_ = {_M_current = 0x2540}}, __f={f_ = {f_ = 0x409790 <Test::join()>}, l_ = {a1_ = {<No data fields>}}}) at /usr/include/boost/bind/mem_fn_template.hpp:55 #3 0x000000000040854a in main (argc=299481680, argv=0x7fff9a063228) at test.cpp:127 Thread 5 (process 9536): #0 0x00000038fa40ad09 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000035b62a2091 in qpid::client::StateManager::waitFor (this=<value optimized out>, desired=<value optimized out>) at ../include/qpid/sys/posix/Condition.h:63 #2 0x00000035b6259411 in qpid::client::ConnectionHandler::waitForOpen (this=<value optimized out>) at qpid/client/ConnectionHandler.cpp:144 #3 0x00000035b6266648 in qpid::client::ConnectionImpl::open (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:282 #4 0x00000035b6258115 in qpid::client::Connection::open (this=<value optimized out>, settings=<value optimized out>) at qpid/client/Connection.cpp:125 #5 0x00000035b627f4f1 in qpid::client::FailoverManager::attempt (this=<value optimized out>, c=<value optimized out>, s=<value optimized out>) at qpid/client/FailoverManager.cpp:121 #6 0x00000035b628047c in qpid::client::FailoverManager::attempt (this=<value optimized out>, c=<value optimized out>, s=<value optimized out>, urls=<value optimized out>) at qpid/client/FailoverManager.cpp:111 #7 0x00000035b6280af5 in qpid::client::FailoverManager::connect (this=<value optimized out>, brokers=<value optimized out>) at qpid/client/FailoverManager.cpp:82 #8 0x00000035b628128b in qpid::client::FailoverManager::execute (this=<value optimized out>, c=<value optimized out>) at qpid/client/FailoverManager.cpp:44 #9 0x00000000004067bd in Test::run (this=0x11d9ba50) at test.cpp:97 #10 0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #11 0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0 #12 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6 Thread 4 (process 9552): #0 0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6 #1 0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524 #2 0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479 #3 0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #4 0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0 #5 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6 Thread 3 (process 10221): #0 0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- #1 0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524 #2 0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479 #3 0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #4 0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0 #5 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6 Thread 2 (process 8382): #0 0x00000038f9cd4018 in epoll_wait () from /lib64/libc.so.6 #1 0x00000035b5d2c33f in qpid::sys::Poller::wait (this=<value optimized out>, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524 #2 0x00000035b5d2cd22 in qpid::sys::Poller::run (this=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:479 #3 0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #4 0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0 #5 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6 Thread 1 (process 9569): #0 0x00000038f9c30265 in raise () from /lib64/libc.so.6 #1 0x00000038f9c31d10 in abort () from /lib64/libc.so.6 #2 0x00000038fccbec44 in __gnu_cxx::__verbose_terminate_handler () from /usr/lib64/libstdc++.so.6 #3 0x00000038fccbcdb6 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x00000038fccbcde3 in std::terminate () from /usr/lib64/libstdc++.so.6 #5 0x00000038fccbd2ef in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6 #6 0x00000035b6260e76 in fire (this=<value optimized out>) at qpid/client/ConnectionImpl.cpp:152 #7 0x00000035b5dfaf60 in qpid::sys::Timer::run (this=<value optimized out>) at qpid/sys/Timer.cpp:119 #8 0x00000035b5d238da in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35 #9 0x00000038fa406617 in start_thread () from /lib64/libpthread.so.0 #10 0x00000038f9cd3c2d in clone () from /lib64/libc.so.6 (gdb) *** Bug 602268 has been marked as a duplicate of this bug. *** Fixed on trunk (r953032) and in release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=079143676f0881d15138059d328a5531ef6a307e). Further fix applied to trunk (953610) and release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=6227acefa2c27ad6b66ae7616e0b738ba3f9a754). The issue has been fixed, tested on RHEL 4.8 / 5.5 i386 / x86_64 on packages: python-qmf-0.7.946106-3.el5 python-qpid-0.7.946106-1.el5 qmf-0.7.946106-3.el5 qmf-devel-0.7.946106-3.el5 qpid-cpp-client-0.7.946106-3.el5 qpid-cpp-client-devel-0.7.946106-3.el5 qpid-cpp-client-devel-docs-0.7.946106-3.el5 qpid-cpp-client-rdma-0.7.946106-3.el5 qpid-cpp-client-ssl-0.7.946106-3.el5 qpid-cpp-mrg-debuginfo-0.7.946106-3.el5 qpid-cpp-server-0.7.946106-3.el5 qpid-cpp-server-cluster-0.7.946106-3.el5 qpid-cpp-server-devel-0.7.946106-3.el5 qpid-cpp-server-rdma-0.7.946106-3.el5 qpid-cpp-server-ssl-0.7.946106-3.el5 qpid-cpp-server-store-0.7.946106-3.el5 qpid-cpp-server-xml-0.7.946106-3.el5 qpid-java-client-0.7.946106-3.el5 qpid-java-common-0.7.946106-3.el5 qpid-tools-0.7.946106-4.el5 rh-qpid-cpp-tests-0.7.946106-3.el5 ruby-qmf-0.7.946106-3.el5 -> VERIFIED Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, on a 2 node MRG cluster, one of the clients running on it could core after shutting down a single network interface on one of the broker nodes. With this update, clients no longer core. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html |