Bug 619765 - broker shut itself down
Summary: broker shut itself down
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: beta
Hardware: All
OS: Linux
urgent
medium
Target Milestone: 1.3
: ---
Assignee: Gordon Sim
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-07-30 13:30 UTC by Graham Biswell
Modified: 2014-08-15 01:43 UTC (History)
5 users (show)

Fixed In Version: 0.10
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-24 16:37:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Last 100K lines of trace logging from each broker. (720.00 KB, application/x-tar)
2010-07-30 13:31 UTC, Graham Biswell
no flags Details
core from amqb01 (10.27 MB, application/x-gzip)
2010-07-30 14:10 UTC, Graham Biswell
no flags Details
core from amqb02 (7.97 MB, application/x-gzip)
2010-07-30 14:13 UTC, Graham Biswell
no flags Details
Proposed fix (3.08 KB, patch)
2010-07-30 17:56 UTC, Gordon Sim
aconway: review-
Details | Diff
small reproducer for this issue (4.41 KB, application/octet-stream)
2010-08-03 11:38 UTC, Graham Biswell
no flags Details

Description Graham Biswell 2010-07-30 13:30:19 UTC
Description of problem:

MRG 1.3 beta c++ broker, shut itself down with the following message:

"critical Modified cluster state outside of cluster context"

This triggered a failover to the remaining node, which ran for a few more seconds, then shut down with the same message.

Comment 1 Graham Biswell 2010-07-30 13:31:28 UTC
Created attachment 435556 [details]
Last 100K  lines of trace logging from each broker.

The amqb02 broker shutdown first at 12:15:13. amqb01 lasted until 12:15:21.

Comment 2 Graham Biswell 2010-07-30 14:10:37 UTC
Created attachment 435568 [details]
core from amqb01

Comment 3 Graham Biswell 2010-07-30 14:13:35 UTC
Created attachment 435569 [details]
core from amqb02

Comment 4 Gordon Sim 2010-07-30 14:26:04 UTC
From attached core:

Core was generated by `/usr/sbin/qpidd --daemon --config /etc/qpid-intg1.conf --log-to-syslog no --log'.
Program terminated with signal 6, Aborted.
[New process 29623]
[New process 29626]
[New process 29625]
[New process 29622]
#0  0x00002b717a748265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00002b717a748265 in raise () from /lib64/libc.so.6
#1  0x00002b717a749d10 in abort () from /lib64/libc.so.6
#2  0x00002b71796b93d4 in qpid::sys::assertClusterSafe () from /usr/lib64/libqpidcommon.so.3
#3  0x00002b71791b48ae in qpid::broker::Queue::clearLVQIndex () from /usr/lib64/libqpidbroker.so.3
#4  0x00002b71791bc07b in qpid::broker::Queue::purgeExpired () from /usr/lib64/libqpidbroker.so.3
#5  0x00002b71791c867e in qpid::broker::QueueCleaner::fired () from /usr/lib64/libqpidbroker.so.3
#6  0x00002b71796c2c60 in qpid::sys::Timer::run () from /usr/lib64/libqpidcommon.so.3
#7  0x00002b71795ea86a in ?? () from /usr/lib64/libqpidcommon.so.3
#8  0x00002b717aa7673d in start_thread () from /lib64/libpthread.so.0
#9  0x00002b717a7ebd1d in readahead () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb) thread apply all bt

Thread 4 (process 29622):
#0  0x00002b717a0154c6 in std::ostream::sentry::sentry () from /usr/lib64/libstdc++.so.6
#1  0x00002b717a0161dd in std::operator<< <std::char_traits<char> > () from /usr/lib64/libstdc++.so.6
#2  0x00002b7179687c72 in qpid::framing::operator<< () from /usr/lib64/libqpidcommon.so.3
#3  0x00002b717e2b0ad9 in qpid::cluster::operator<< () from /usr/lib64/qpid/daemon/cluster.so
#4  0x00002b717e267d97 in qpid::cluster::Cluster::processFrame () from /usr/lib64/qpid/daemon/cluster.so
#5  0x00002b717e268db4 in qpid::cluster::Cluster::deliveredFrame () from /usr/lib64/qpid/daemon/cluster.so
#6  0x00002b717e27292a in boost::function1<void, qpid::cluster::EventFrame const&, std::allocator<void> >::operator() ()
   from /usr/lib64/qpid/daemon/cluster.so
#7  0x00002b717e2789bb in qpid::cluster::PollableQueue<qpid::cluster::EventFrame>::handleBatch () from /usr/lib64/qpid/daemon/cluster.so
#8  0x00002b717e26e8b6 in boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<__gnu_cxx::__normal_iterator<qpid::cluster::EventFrame const*, std::vector<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > >, boost::_mfi::mf1<__gnu_cxx::__normal_iterator<qpid::cluster::EventFrame const*, std::vector<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > >, qpid::cluster::PollableQueue<qpid::cluster::EventFrame>, std::vector<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > const&>, boost::_bi::list2<boost::_bi::value<qpid::cluster::PollableQueue<qpid::cluster::EventFrame>*>, boost::arg<1> > >, __gnu_cxx::__normal_iterator<qpid::cluster::EventFrame const*, std::vector<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > >, std::vector<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > const&>::invoke () from /usr/lib64/qpid/daemon/cluster.so
#9  0x00002b717e272b4a in boost::function1<__gnu_cxx::__normal_iterator<qpid::cluster::EventFrame const*, std::vector<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > >, std::vector<qpid::cluster::EventFrame, std::allocator<qpid::cluster::EventFrame> > const&, std::allocator<void> >::operator() () from /usr/lib64/qpid/daemon/cluster.so
#10 0x00002b717e2782bc in qpid::sys::PollableQueue<qpid::cluster::EventFrame>::process () from /usr/lib64/qpid/daemon/cluster.so
#11 0x00002b717e27996e in qpid::sys::PollableQueue<qpid::cluster::EventFrame>::dispatch () from /usr/lib64/qpid/daemon/cluster.so
#12 0x00002b71795f082f in boost::function1<void, qpid::sys::PollableCondition&, std::allocator<boost::function_base> >::operator() ()
   from /usr/lib64/libqpidcommon.so.3
#13 0x00002b71796bcb27 in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator() ()
   from /usr/lib64/libqpidcommon.so.3
#14 0x00002b71796bc300 in qpid::sys::DispatchHandle::processEvent () from /usr/lib64/libqpidcommon.so.3
#15 0x00002b71795f4644 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.3
#16 0x00002b7179140d62 in qpid::broker::Broker::run () from /usr/lib64/libqpidbroker.so.3
#17 0x00000000004093f4 in ?? ()
#18 0x00002b7179162d9e in qpid::broker::Daemon::fork () from /usr/lib64/libqpidbroker.so.3
#19 0x0000000000406c65 in ?? ()
#20 0x00000000004055af in __cxa_pure_virtual ()
#21 0x00002b717a735994 in __libc_start_main () from /lib64/libc.so.6
#22 0x0000000000405019 in __cxa_pure_virtual ()
#23 0x00007fff7d4b7db8 in ?? ()
#24 0x0000000000000000 in ?? ()

Thread 3 (process 29625):
#0  0x00002b717aa7aee9 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x000000001cdeb8e8 in ?? ()
#2  0x0000000000000000 in ?? ()

Thread 2 (process 29626):
#0  0x00002b717a7ec108 in ?? () from /lib64/libc.so.6
#1  0x0000000000000000 in ?? ()

Thread 1 (process 29623):
#0  0x00002b717a748265 in raise () from /lib64/libc.so.6
#1  0x00002b717a749d10 in abort () from /lib64/libc.so.6
#2  0x00002b71796b93d4 in qpid::sys::assertClusterSafe () from /usr/lib64/libqpidcommon.so.3
#3  0x00002b71791b48ae in qpid::broker::Queue::clearLVQIndex () from /usr/lib64/libqpidbroker.so.3
#4  0x00002b71791bc07b in qpid::broker::Queue::purgeExpired () from /usr/lib64/libqpidbroker.so.3
#5  0x00002b71791c867e in qpid::broker::QueueCleaner::fired () from /usr/lib64/libqpidbroker.so.3
#6  0x00002b71796c2c60 in qpid::sys::Timer::run () from /usr/lib64/libqpidcommon.so.3
#7  0x00002b71795ea86a in ?? () from /usr/lib64/libqpidcommon.so.3
#8  0x00002b717aa7673d in start_thread () from /lib64/libpthread.so.0
#9  0x00002b717a7ebd1d in readahead () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()
(gdb)

Comment 5 Gordon Sim 2010-07-30 16:05:10 UTC
To reproduce send messages with a ttl to an LVQ on a clustered broker and wait until queue is purged, e.g.:

/usr/sbin/qpidd --auth no --cluster-name grs --queue-purge-interval 10

then using qpid-send util from qpid svn:

./src/tests/qpid-send --address 'my-lvq; {create: always, node:{x-declare:{arguments:{qpid.last_value_queue:1}}}}' -P qpid.LVQ_key=aaa --messages 10 --ttl 10

Cluster node fails within 10 secs.

Comment 6 Gordon Sim 2010-07-30 16:11:40 UTC
This appears to be a re-emergence of https://bugzilla.redhat.com/show_bug.cgi?id=518410; certainly the test case in the bug fails in exactly the same way.

Comment 7 Gordon Sim 2010-07-30 16:15:40 UTC
To clarify previous comment, the reproducer from 518410 causes the same symptom as *this* BZ. The symptom in the original bug was different, but the same case fails. However it does need to wait until the queue is purged and by default that time is 10 minutes which I suspect is why it was not noticed when verifying the original issue.

Comment 8 Gordon Sim 2010-07-30 17:56:57 UTC
Created attachment 435615 [details]
Proposed fix

This ensures the queue cleaner is triggered on the correct thread for a clustered broker.

Comment 9 Gordon Sim 2010-08-02 11:46:13 UTC
Fix applied to trunk (981435) and release repo (http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=2c1c64fc4341fcd3af6cf0a96012e7fe27de6fab).

Comment 10 Alan Conway 2010-08-02 14:01:49 UTC
The timer task needs to have a name so it can be recognized on re-delivery. It works now because it is the only task with the name "" (empty string). Here's a patch:

diff --git a/qpid/cpp/src/qpid/broker/QueueCleaner.cpp b/qpid/cpp/src/qpid/broker/QueueCleaner.cpp
index ed98468..bba13c1 100644
--- a/qpid/cpp/src/qpid/broker/QueueCleaner.cpp
+++ b/qpid/cpp/src/qpid/broker/QueueCleaner.cpp
@@ -39,7 +39,8 @@ void QueueCleaner::start(qpid::sys::Duration p)
     timer.add(task);
 }
 
-QueueCleaner::Task::Task(QueueCleaner& p, qpid::sys::Duration d) : sys::TimerTask(d), parent(p) {}
+QueueCleaner::Task::Task(QueueCleaner& p, qpid::sys::Duration d)
+    : sys::TimerTask(d, "QueueCleaner::fired"), parent(p) {}
 
 void QueueCleaner::Task::fire()
 {

Comment 12 Graham Biswell 2010-08-03 11:38:40 UTC
Created attachment 436252 [details]
small reproducer for this issue

"I've attached a small reproducer for the durable topic failover issue. I've been running it as follows (with the 1.3-beta MRG client libs):

- Start 2-node cluster
- Start app (should see messages being sent/received)
- stop the first broker (app usually carries on as expected)
- start the first broker

At this point the second broker (the survivor) crashes with the 'modified cluster state' error.

Last few lines of broker log & core back trace, and client errors below ..

----------------------
Broker log:

2010-08-03 11:22:31 debug Known-brokers for connection: amqp:tcp:10.34.22.65:5678,tcp:10.34.22.64:5678
2010-08-03 11:22:31 trace SENT [39383 10.34.22.65:5678]: Frame[BEbe; channel=1; {SessionAttachBody: name=qpid.cluster-update; }]
2010-08-03 11:22:31 trace cluster(10.34.22.65:20510 UPDATER) RECV 10.34.22.65:39383(10.34.22.65:20510-4 local,catchup): Frame[BEbe; channel=1; {SessionAttachBody: name=qpid.cluster-update; }]
2010-08-03 11:22:31 debug SessionState::SessionState anonymous.qpid.cluster-update: 0x86e2c20
2010-08-03 11:22:31 debug anonymous.qpid.cluster-update: attached on broker.
2010-08-03 11:22:31 debug Attached channel 1 to anonymous.qpid.cluster-update
2010-08-03 11:22:31 debug anonymous.qpid.cluster-update: ready to send, activating output.
2010-08-03 11:22:31 critical Modified cluster state outside of cluster context


------------------
Core BT:

core.qpidd.20510

(gdb) bt
#0  0x00000034de430265 in raise () from /lib64/libc.so.6
#1  0x00000034de431d10 in abort () from /lib64/libc.so.6
#2  0x00000034df5f13d4 in qpid::sys::assertClusterSafe() () from /usr/lib64/libqpidcommon.so.3
#3  0x00000034e19bce87 in qpid::broker::SemanticState::attached() () from /usr/lib64/libqpidbroker.so.3
#4  0x00000034e19dc7a9 in qpid::broker::SessionState::readyToSend() () from /usr/lib64/libqpidbroker.so.3
#5  0x00000034df5be504 in qpid::amqp_0_10::SessionHandler::attach(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) () from /usr/lib64/libqpidcommon.so.3
#6  0x00000034df57b50d in qpid::framing::AMQP_AllOperations::SessionHandler::Invoker::visit(qpid::framing::SessionAttachBody const&) () from /usr/lib64/libqpidcommon.so.3
#7  0x00000034df5be9fe in qpid::framing::Invoker::Result qpid::framing::invoke<qpid::amqp_0_10::SessionHandler>(qpid::amqp_0_10::SessionHandler&, qpid::framing::AMQMethodBody const&) () from /usr/lib64/libqpidcommon.so.3
#8  0x00000034df5b9d83 in qpid::amqp_0_10::SessionHandler::invoke(qpid::framing::AMQMethodBody const&) () from /usr/lib64/libqpidcommon.so.3
#9  0x00000034df5bcdfa in qpid::amqp_0_10::SessionHandler::handleIn(qpid::framing::AMQFrame&) () from /usr/lib64/libqpidcommon.so.3
#10 0x00000034e1924e29 in qpid::broker::Connection::received(qpid::framing::AMQFrame&) () from /usr/lib64/libqpidbroker.so.3
#11 0x00002afc6fa95fdc in qpid::cluster::Connection::received(qpid::framing::AMQFrame&) () from /usr/lib64/qpid/daemon/cluster.so
#12 0x00002afc6fa94021 in qpid::cluster::Connection::decode(char const*, unsigned long) () from /usr/lib64/qpid/daemon/cluster.so
#13 0x00000034df5ef5f2 in qpid::sys::AsynchIOHandler::readbuff(qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*) () from /usr/lib64/libqpidcommon.so.3
#14 0x00000034df5202aa in boost::function2<void, qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*, std::allocator<boost::function_base> >::operator()(qpid::sys::AsynchIO&, qpid::sys::AsynchIOBufferBase*) const () from /usr/lib64/libqpidcommon.so.3
#15 0x00000034df51da13 in qpid::sys::posix::AsynchIO::readable(qpid::sys::DispatchHandle&) () from /usr/lib64/libqpidcommon.so.3
#16 0x00000034df5f4b27 in boost::function1<void, qpid::sys::DispatchHandle&, std::allocator<boost::function_base> >::operator()(qpid::sys::DispatchHandle&) const () from /usr/lib64/libqpidcommon.so.3
#17 0x00000034df5f4300 in qpid::sys::DispatchHandle::processEvent(qpid::sys::Poller::EventType) () from /usr/lib64/libqpidcommon.so.3
#18 0x00000034df52c644 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.3
#19 0x00000034e190ed62 in qpid::broker::Broker::run() () from /usr/lib64/libqpidbroker.so.3
#20 0x00000000004093f4 in ?? ()
#21 0x00000034e1930d9e in qpid::broker::Daemon::fork() () from /usr/lib64/libqpidbroker.so.3
#22 0x0000000000406c65 in ?? ()
#23 0x00000000004055af in std::ios_base::Init::~Init() ()
#24 0x00000034de41d994 in __libc_start_main () from /lib64/libc.so.6
#25 0x0000000000405019 in std::ios_base::Init::~Init() ()
#26 0x00007fffded3b878 in ?? ()
#27 0x0000000000000000 in ?? ()


--------------------------
Client error:

Sent: test 12
*****  got message: test 12
javax.jms.JMSException: Exception when sending message
        at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:215)
        at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:490)
        at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:445)
        at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:283)
        at DurableSubscriptionTest.<init>(DurableSubscriptionTest.java:80)
        at DurableSubscriptionTest.main(DurableSubscriptionTest.java:149)
Caused by: org.apache.qpid.transport.SessionException: timed out waiting for resume to finish
        at org.apache.qpid.transport.Session.invoke(Session.java:562)
        at org.apache.qpid.transport.Session.invoke(Session.java:518)
        at org.apache.qpid.transport.SessionInvoker.messageTransfer(SessionInvoker.java:96)
        at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:201)
        ... 5 more
Exception in thread "Thread-6" org.apache.qpid.transport.SessionException: timed out waiting for resume to finish
        at org.apache.qpid.transport.Session.invoke(Session.java:562)
        at org.apache.qpid.transport.Session.invoke(Session.java:518)
        at org.apache.qpid.transport.SessionInvoker.messageFlush(SessionInvoker.java:136)
        at org.apache.qpid.client.BasicMessageConsumer_0_10.getMessageFromQueue(BasicMessageConsumer_0_10.java:404)
        at org.apache.qpid.client.BasicMessageConsumer.receive(BasicMessageConsumer.java:407)
        at org.apache.qpid.client.TopicSubscriberAdaptor.receive(TopicSubscriberAdaptor.java:91)
        at DurableSubscriptionTest$1.run(DurableSubscriptionTest.java:101)
        at java.lang.Thread.run(Thread.java:619)

Comment 13 Rajith Attapattu 2010-08-09 16:04:23 UTC
I have tested using the attached reproducer with the c++ broker built off rev 981711 in qpid trunk. I went through the steps outlined in comment #12 and the cluster works without any errors. The first broker is able to rejoin the cluster again without any issues.

Another thing I noticed in the attached example is the use of the round-robin method as the failover method. It's better if the failover_exchange method is used instead as it will automatically update the cluster membership and will only report a connection exception if there is a total cluster failure.


Note You need to log in before you can comment on or make changes to this bug.