Bug 625540 - cluster safe assertion from within qpid::broker::SemanticState::attached()
cluster safe assertion from within qpid::broker::SemanticState::attached()
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
beta
All Linux
high Severity medium
: 1.4
: ---
Assigned To: Alan Conway
MRG Quality Engineering
:
: 634168 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-08-19 14:48 EDT by Gordon Sim
Modified: 2010-09-15 10:10 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-08-23 17:07:43 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
broker log file 1 (18.07 KB, text/html)
2010-08-23 12:16 EDT, Gordon Sim
no flags Details
broker log file 2 (26.16 KB, text/html)
2010-08-23 12:16 EDT, Gordon Sim
no flags Details
client log file 1 (21.78 KB, text/html)
2010-08-23 12:17 EDT, Gordon Sim
no flags Details
client log file 2 (3.48 KB, text/html)
2010-08-23 12:17 EDT, Gordon Sim
no flags Details

  None (edit)
Description Gordon Sim 2010-08-19 14:48:45 EDT
(gdb) thread apply all bt

Thread 6 (Thread 27593):
#0  0x00000034de4d4108 in epoll_wait () from /lib64/libc.so.6
#1  0x00000034df52bb81 in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.3
#2  0x00000034df52c617 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.3
#3  0x00000034e190ed62 in qpid::broker::Broker::run() () from /usr/lib64/libqpidbroker.so.3
#4  0x00000000004093f4 in ?? ()
#5  0x00000034e1930d9e in qpid::broker::Daemon::fork() () from /usr/lib64/libqpidbroker.so.3
#6  0x0000000000406c65 in ?? ()
#7  0x00000000004055af in std::ios_base::Init::~Init() ()
#8  0x00000034de41d994 in __libc_start_main () from /lib64/libc.so.6
#9  0x0000000000405019 in std::ios_base::Init::~Init() ()
#10 0x00007fffddbc45e8 in ?? ()
#11 0x0000000000000000 in ?? ()

Thread 5 (Thread 27594):
#0  0x00000034dec0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000034df5fafdf in qpid::sys::Timer::run() () from /usr/lib64/libqpidcommon.so.3
#2  0x00000034df52286a in ?? () from /usr/lib64/libqpidcommon.so.3
#3  0x00000034dec0673d in start_thread () from /lib64/libpthread.so.0
#4  0x00000034de4d3d1d in clone () from /lib64/libc.so.6

Thread 4 (Thread 27596):
#0  0x00000034dec0b150 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000034df5fafdf in qpid::sys::Timer::run() () from /usr/lib64/libqpidcommon.so.3
#2  0x00000034df52286a in ?? () from /usr/lib64/libqpidcommon.so.3
#3  0x00000034dec0673d in start_thread () from /lib64/libpthread.so.0
#4  0x00000034de4d3d1d in clone () from /lib64/libc.so.6

Thread 3 (Thread 27680):
#0  0x00000034dec0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b9a1e8ba52d in qpid::sys::Waitable::wait() () from /usr/lib64/libqpidclient.so.3
#2  0x00002b9a1e8b5719 in qpid::client::SessionImpl::open(unsigned int) () from /usr/lib64/libqpidclient.so.3
#3  0x00002b9a1e87f1d8 in qpid::client::ConnectionImpl::newSession(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned short) () from /usr/lib64/libqpidclient.so.3
#4  0x00002b9a1e8732c9 in qpid::client::Connection::newSession(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int) () from /usr/lib64/libqpidclient.so.3
#5  0x00002b9a1e1bf593 in qpid::cluster::UpdateClient::run() () from /usr/lib64/qpid/daemon/cluster.so
#6  0x00000034df52286a in ?? () from /usr/lib64/libqpidcommon.so.3
#7  0x00000034dec0673d in start_thread () from /lib64/libpthread.so.0
#8  0x00000034de4d3d1d in clone () from /lib64/libc.so.6

Thread 2 (Thread 27681):
#0  0x00000034de4c60ab in write () from /lib64/libc.so.6
#1  0x00000034df51443f in qpid::sys::Socket::write(void const*, unsigned long) const () from /usr/lib64/libqpidcommon.so.3
#2  0x00000034df51e3eb in qpid::sys::posix::AsynchIO::writeable(qpid::sys::DispatchHandle&) ()
   from /usr/lib64/libqpidcommon.so.3
#3  0x00000034df5f4b27 in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator()(qpid::sys::DispatchHandle&) const () from /usr/lib64/libqpidcommon.so.3
#4  0x00000034df5f4114 in qpid::sys::DispatchHandle::processEvent(qpid::sys::Poller::EventType) ()
   from /usr/lib64/libqpidcommon.so.3
#5  0x00000034df52c644 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.3
#6  0x00000034df52286a in ?? () from /usr/lib64/libqpidcommon.so.3
#7  0x00000034dec0673d in start_thread () from /lib64/libpthread.so.0
#8  0x00000034de4d3d1d in clone () from /lib64/libc.so.6

Thread 1 (Thread 27597):
#0  0x00000034de430265 in raise () from /lib64/libc.so.6
#1  0x00000034de431d10 in abort () from /lib64/libc.so.6
#2  0x00000034df5f13d4 in qpid::sys::assertClusterSafe() () from /usr/lib64/libqpidcommon.so.3
#3  0x00000034e19bce87 in qpid::broker::SemanticState::attached() () from /usr/lib64/libqpidbroker.so.3
#4  0x00000034e19dc7a9 in qpid::broker::SessionState::readyToSend() () from /usr/lib64/libqpidbroker.so.3
#5  0x00000034df5be504 in qpid::amqp_0_10::SessionHandler::attach(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) () from /usr/lib64/libqpidcommon.so.3
#6  0x00000034df57b50d in qpid::framing::AMQP_AllOperations::SessionHandler::Invoker::visit(qpid::framing::SessionAttachBody const&) () from /usr/lib64/libqpidcommon.so.3
#7  0x00000034df5be9fe in qpid::framing::Invoker::Result qpid::framing::invoke<qpid::amqp_0_10::SessionHandler>(qpid::amqp_0_10::SessionHandler&, qpid::framing::AMQMethodBody const&) () from /usr/lib64/libqpidcommon.so.3
#8  0x00000034df5b9d83 in qpid::amqp_0_10::SessionHandler::invoke(qpid::framing::AMQMethodBody const&) ()
   from /usr/lib64/libqpidcommon.so.3
#9  0x00000034df5bcdfa in qpid::amqp_0_10::SessionHandler::handleIn(qpid::framing::AMQFrame&) ()
   from /usr/lib64/libqpidcommon.so.3
#10 0x00000034e1924e29 in qpid::broker::Connection::received(qpid::framing::AMQFrame&) () from /usr/lib64/libqpidbroker.so.3
#11 0x00002b9a1e1adfdc in qpid::cluster::Connection::received(qpid::framing::AMQFrame&) ()
   from /usr/lib64/qpid/daemon/cluster.so
#12 0x00002b9a1e1ac021 in qpid::cluster::Connection::decode(char const*, unsigned long) ()
   from /usr/lib64/qpid/daemon/cluster.so
#13 0x00000034df5ef5f2 in qpid::sys::AsynchIOHandler::readbuff(qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*) ()
   from /usr/lib64/libqpidcommon.so.3
#14 0x00000034df5202aa in boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*, std::allocator<boost::function_base> >::operator()(qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*) const ()
   from /usr/lib64/libqpidcommon.so.3
#15 0x00000034df51da13 in qpid::sys::posix::AsynchIO::readable(qpid::sys::DispatchHandle&) ()
   from /usr/lib64/libqpidcommon.so.3
#16 0x00000034df5f4b27 in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator()(qpid::sys::DispatchHandle&) const () from /usr/lib64/libqpidcommon.so.3
#17 0x00000034df5f4300 in qpid::sys::DispatchHandle::processEvent(qpid::sys::Poller::EventType) ()
   from /usr/lib64/libqpidcommon.so.3
#18 0x00000034df52c644 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.3
#19 0x00000034df52286a in ?? () from /usr/lib64/libqpidcommon.so.3
#20 0x00000034dec0673d in start_thread () from /lib64/libpthread.so.0
#21 0x00000034de4d3d1d in clone () from /lib64/libc.so.6
(gdb)
Comment 1 Gordon Sim 2010-08-19 14:49:49 EDT
Reproducer not yet known, stack traces reported as above however against first beta packages.
Comment 2 Alan Conway 2010-08-23 12:09:28 EDT
The stack trace shows that the broker is connecting to give an update to a new broker, while at the same it is receiving an update connection. I don't know how this could come about. Are there any log files available or a description of what was happening at the time of the core dump?

It would make sense for a broker to refuse the erroneous update connection in this situation, if there's no more information available. I can make that change.
Comment 3 Gordon Sim 2010-08-23 12:16:21 EDT
Created attachment 440424 [details]
broker log file 1
Comment 4 Gordon Sim 2010-08-23 12:16:48 EDT
Created attachment 440425 [details]
broker log file 2
Comment 5 Gordon Sim 2010-08-23 12:17:20 EDT
Created attachment 440426 [details]
client log file 1
Comment 6 Gordon Sim 2010-08-23 12:17:42 EDT
Created attachment 440427 [details]
client log file 2
Comment 7 Gordon Sim 2010-08-23 12:20:33 EDT
The reproducer is as attached here: https://bugzilla.redhat.com/attachment.cgi?id=436252

Additionally the following information was given by the user:

"This client run was using the 'failover_exchange' cluster failover
method.

Just to confirm, the sequence of events for these logs was as follows:
- start broker 1
- start broker 2
- start client app, confirm messages are being sent/received (the
'Sent:/got message' logging)
- stop broker 1
- broker 2 becomes primary, app successfully fails over to broker 2 and
continues sending/receiving
- start broker 1. Note I did not clean broker 1's state files before
starting - I don't believe this should be necessary though.

At this point broker 2 crashes with the attached backtrace, and the
client comes to a halt. I took a number of thread dumps at different
points in the lifecyle, which are all in the client log file
(pubsubtest.out.0.tgz). Eventually the client gives up and exits. Broker
1 appears to start, but is unusable:

- tried to re-run the test client app. It is unable to connect to broker
1. Client logs inc. stack traces attached. (pubsubtest.out.1.tgz)

- manually stop broker 1.
"

The brokers are I believe running in VMs, not sure if that is relevant.
Comment 8 Alan Conway 2010-08-23 12:55:07 EDT
It appears that the broker's cluster-url has been specified incorrectly. From the log:

2010-08-23 12:11:31 notice cluster(10.34.22.65:8356 UPDATER) sending update to 10.34.22.64:16813 at amqp:tcp:10.34.22.65:5678,tcp:10.34.22.64:5678

Note that the URL includes _both_ brokers addresses. This is causing the updater broker to connect to itself rather than the new broker, causing the crash.

Each brokers cluster-url should specify only the address(es) of that one broker. The brokers collaborate to provide the clients with the URLs of all brokers in the cluster.

I will change the broker to refuse the erroneous catch-up connection attempt with a sensible error message rather than crash, and check if the documentation for cluster-url needs clarification. I've moved the bug to 1.4, I don't think its critical for the 1.3 release.
Comment 9 Alan Conway 2010-08-23 17:07:43 EDT
Additional error checking comitted ontrunk r988312

    Check for and abort invalid catchup connections.

    Detect attempt to make a catch-up connection while we are not expecting an update.
Comment 10 ppecka 2010-09-15 10:10:35 EDT
*** Bug 634168 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.