Bug 804752 - TransportFailure exception deadlock
Summary: TransportFailure exception deadlock
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 2.0
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: 2.3
: ---
Assignee: Chuck Rolke
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks: 698367
TreeView+ depends on / blocked
 
Reported: 2012-03-19 18:14 UTC by Jason Dillaman
Modified: 2015-11-16 01:14 UTC (History)
5 users (show)

Fixed In Version: qpid-cpp-0.18-4
Doc Type: Bug Fix
Doc Text:
Cause: Client exceptions are not properly cleared when a transport exception is triggered by a broker failure. Consequence: Client code gets stuck in an infinite loop reprocessing old exceptions. Fix: Close the transport channel and clear the transport exception when the broker exits unexpectedly. Result: The client recovers and creates a new connection when the broker returns to service.
Clone Of:
: 884803 (view as bug list)
Environment:
Last Closed: 2013-03-06 18:55:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Reproducer (1.05 KB, text/x-c++src)
2012-10-01 17:55 UTC, Jason Dillaman
no flags Details
Quick patch to avoid exception handling loop (739 bytes, patch)
2012-10-01 17:59 UTC, Jason Dillaman
no flags Details | Diff
QA reproducer (1.47 KB, application/x-compressed-tar)
2012-11-01 10:11 UTC, Peter Belanyi
no flags Details
QA reproducer enhanced (1.82 KB, application/x-tbz)
2012-12-04 08:39 UTC, Frantisek Reznicek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 884803 0 medium CLOSED Client stuck in permanent loop handling TransportFailure exception 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2013:0561 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging 2.3 security update 2013-03-06 23:48:13 UTC

Internal Links: 884803

Description Jason Dillaman 2012-03-19 18:14:01 UTC
Description of problem:
Upon killing and restarting a broker, client applications get stuck attempting to recover from a TransportFailure exception.  The client can reconnect to the broker, but the exception is not cleared if another exception occurs while resetting one of the sessions during the reconnect process (i.e. if a session's receiver's queue doesn't exist).  This means that the original transport exception is still being held by one or more of the sessions.  Since "ConnectionImpl::open" first checks to see if it is already connected, the exception will never be cleared because the sessions will never be re-initialized properly.

Version-Release number of selected component (if applicable):
qpid-cpp-server-0.12-6_ptc_hotfix_3.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start a broker and create a queue
1. Create a single connection with two sessions
2. Subscribe each session to the queues
3. Call a blocking session call (i.e. nextReceiver)
4. Restart the broker, but do not create queue
  
Actual results:
Qpid client library is stuck looping on the same exception

Expected results:
Qpid client library gracefully recovers

Comment 3 Jason Dillaman 2012-10-01 17:55:14 UTC
Created attachment 619919 [details]
Reproducer

Comment 4 Jason Dillaman 2012-10-01 17:59:14 UTC
Created attachment 619924 [details]
Quick patch to avoid exception handling loop

Not the most elegant solution since it will cause a reconnection if one of the connection's sessions encounters an error during post-connection reset.  However, by disconnecting after failing to reset the sessions, it prevents the exception handling loop caused by one or more sessions not being properly reset.

Comment 7 Peter Belanyi 2012-11-01 10:11:17 UTC
Created attachment 636536 [details]
QA reproducer

Reproducer usage:
tc804752.sh [cycles] [messages]
  cycles:     run the test [cycles] times
  messages:   add [messages] number of messages to the queues before starting main.out

Comment 8 Peter Belanyi 2012-11-01 10:17:05 UTC
The QA reproducer is not able to reproduce the issue 100% reliably. It will be updated when the fixed package is available for testing and we can see the difference between the correct and incorrect behaviour.

Comment 9 Frantisek Reznicek 2012-11-05 15:09:13 UTC
There is a bit unclear what the expected behavior is.

Reproducer is written the way that it will not exit itself.

Let me ask question on how to properly distinguish the buggy behavior...

I assume that if c++ reproducer client enters the state with locks pthread_mutex_lock -> _L_lock_* -> __lll_lock_wait and is not able to get from there that proves the buggy situation.
In case it does not enter pthread_mutex_lock -> _L_lock_* -> __lll_lock_wait state then it is ok.

The observed behavior was that we saw pthread_mutex_lock -> _L_lock_* -> __lll_lock_wait state and one of two more seconds later it went somewhere else returning back to the pthread_mutex_lock -> _L_lock_* -> __lll_lock_wait state within second or so. Is this still the buggy behavior?


On slightly different topic, is it necessary / needed to put some messages into queue[12] queues or not?

Is there a way how to modify reproducer the way it would be more evident we trigger the issue? We were trying to modify c++ reproducer following way:

--- /tmp/main.cpp       2012-11-05 16:05:57.000000000 +0100
+++ /tmp/main_mod.cpp   2012-11-05 16:05:49.000000000 +0100
@@ -10,9 +10,15 @@
 {
     qpid::messaging::Session session = connection.createSession();
     qpid::messaging::Receiver receiver = session.createReceiver(queueName);
+    int cnt=10;
     while (true) {
         try {
             while (session.nextReceiver(receiver, qpid::messaging::Duration::FOREVER)) {
+              if (cnt > 0)
+              {
+                  std::cout << "s.nextReceiver(" << queueName << ") call" << std::endl;
+                  cnt--;
+              }
             }
         }
         catch (...) {

But then we were unclear whether it is necessary to have messages in queue or not as session.nextReceiver() might block and then our messages will not be printed and we fail to detect permanent loop situation...

Jason, Chuck, could you possibly answer the question please?


Raising NEEDINFO.

Comment 10 Jason Dillaman 2012-11-05 16:46:14 UTC
(In reply to comment #9)
In our environment, within the broker's init script, after starting the broker we create any needed static exchanges/queues/bindings.  Therefore, there is a race condition between starting the broker and re-creating all the internal wiring where a client could reconnect and have one of its sessions invalidated because the wiring wasn't complete (i.e. it attempts to re-subscribe to a queue that hasn't been recreated yet).  

We expect the clients to be able to eventually recover after the broker's wiring is in place.  If your test doesn't recreate the wiring, I would expect that your client would disconnect from the broker and try again to recover its sessions.  I suppose it could be argued that the client should immediately bomb-out after its initial session failure after reconnection, but for my purposes I know the broker will eventually get back into a correct wiring state and thus the client should eventually recover.

Comment 12 Frantisek Reznicek 2012-12-04 08:29:06 UTC
Initial verification part (using attachment 636536 [details]) went fine, on stable version client hangs processing additional exception within TransportFailure, candidate code shows that client can recover.

Current QA reproducer but does not verify that client can actually do something else and finish - which needs to be done.

Comment 13 Frantisek Reznicek 2012-12-04 08:39:55 UTC
Created attachment 657325 [details]
QA reproducer enhanced

Enhancer reproducer uses different way how to trigger the problem.
Steps are following:
 1] Start a broker and create a queue
 2] Create a single connection with two sessions
 3] Subscribe each session to the queues
 4] Call a blocking session call (i.e. session.receiver.fetch())
    Wait for message with special content
 5] Restart the broker, but do not create queue
 6] Send few messages to the queue[s], last one having special content
 7] Expect that Client will exit, i.e.
     message with special content will be received
     threads will get shut-down
     client will gracefully exit

Above steps are more matching with comment 10 scenario.

How to run:
  bz804752.sh [cycles] [messages] [cycle-length]
              ^ number of scenario repetitions
                       ^ number of messages published to queues at point 2]
                                  ^ number of secs used to watch the process

Comment 14 Frantisek Reznicek 2012-12-04 08:57:12 UTC
The behavior was retested using enhanced reproducer (attachment 657325 [details]) on following packages:
  qpid-cpp*-0.18-10.el*
  qpid-qmf*-0.18-9.el*


On stable packages client never exited.
On candidate packages clients exited gracefully just in around 50% of cases.

The other cases show two problems:
 a] hangs in a bit different place
    It looks like thread deadlock (there are still two concurrent threads in client)
 b] client crash in qpid::client::TCPConnector::connectFailed() -> ... -> malloc_consolidate() - this behavior is going to be tracked in separate defect as seen only on latest rhel5.9 candidate


Summary:
The above patch improves behavior and clients are eventually able to leave exception lock (which was not the case for stable version)

Repeating the scenario 10+ times shows that in approx 50% of cases qpid client enters another (very similar) threading deadlock

qpid::client::SessionImpl::checkOpen() vs. qpid::client::amqp0_10::ReceiverImpl::fetch() -> qpid::client::amqp0_10::SessionImpl::execute()

, most of the cases look like:

  [root@dhcp-37-124 bz804752]# cat pstack_004_29.log 
  Thread 6 (Thread 0x2ac666a1a940 (LWP 27978)):
  #0  0x00000031794d5428 in epoll_wait () from /lib64/libc.so.6
  #1  0x00002ac664da7b61 in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.8
  #2  0x00002ac664da85d7 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.8
  #3  0x00002ac664d9f5ba in qpid::sys::(anonymous namespace)::runRunnable(void*) () from /usr/lib64/libqpidcommon.so.8
  #4  0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000031794d503d in clone () from /lib64/libc.so.6
  Thread 5 (Thread 0x2ac66832f940 (LWP 27980)):
  #0  0x000000317b807372 in ?? () from /lib64/libgcc_s.so.1
  #1  0x000000317b8079c6 in ?? () from /lib64/libgcc_s.so.1
  #2  0x000000317b80866b in ?? () from /lib64/libgcc_s.so.1
  #3  0x000000317b808cab in _Unwind_Resume () from /lib64/libgcc_s.so.1
  #4  0x00002ac664330645 in bool qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch>(qpid::client::amqp0_10::ReceiverImpl::Fetch&) () from /usr/lib64/libqpidmessaging.so.3
  #5  0x00002ac66432f9ac in qpid::client::amqp0_10::ReceiverImpl::fetch(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3
  #6  0x00002ac6643048ad in qpid::messaging::Receiver::fetch(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3
  #7  0x000000000040213f in thread(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
  #8  0x0000000000402d31 in void boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> >::operator()<void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list0&, int) ()
  #9  0x0000000000402ddc in boost::_bi::bind_t<void, void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> > >::operator()() ()
  #10 0x0000000000402dfb in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> > >, void>::invoke(boost::detail::function::any_pointer) ()
  #11 0x00002ac66456cca7 in boost::function0<void, std::allocator<boost::function_base> >::operator()() const () from /usr/lib64/libboost_thread.so.2
  #12 0x00002ac66456c70f in ?? () from /usr/lib64/libboost_thread.so.2
  #13 0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0
  #14 0x00000031794d503d in clone () from /lib64/libc.so.6
  Thread 4 (Thread 0x2ac668d30940 (LWP 27981)):
  #0  0x0000003179c0d654 in __lll_lock_wait () from /lib64/libpthread.so.0
  #1  0x0000003179c08f65 in _L_lock_1127 () from /lib64/libpthread.so.0
  #2  0x0000003179c08e63 in pthread_mutex_lock () from /lib64/libpthread.so.0
  #3  0x0000003179509a8a in dl_iterate_phdr () from /lib64/libc.so.6
  #4  0x000000317b80a6b6 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
  #5  0x000000317b807625 in ?? () from /lib64/libgcc_s.so.1
  #6  0x000000317b808f79 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
  #7  0x000000317e8bcf64 in __cxa_throw () from /usr/lib64/libstdc++.so.6
  #8  0x00002ac664a23a51 in qpid::client::SessionImpl::checkOpen() const () from /usr/lib64/libqpidclient.so.8
  #9  0x00002ac664a2a6c6 in qpid::client::SessionImpl::sendCommand(qpid::framing::AMQBody const&, qpid::framing::MethodContent const*) () from /usr/lib64/libqpidclient.so.8
  #10 0x00002ac664a2a9ab in qpid::client::SessionImpl::send(qpid::framing::AMQBody const&) () from /usr/lib64/libqpidclient.so.8
  #11 0x00002ac6649d39f2 in qpid::client::no_keyword::AsyncSession_0_10::messageSetFlowMode(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned char, bool) () from /usr/lib64/libqpidclient.so.8
  #12 0x00002ac66432f522 in qpid::client::amqp0_10::ReceiverImpl::fetchImpl(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3
  #13 0x00002ac6643305e9 in bool qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch>(qpid::client::amqp0_10::ReceiverImpl::Fetch&) () from /usr/lib64/libqpidmessaging.so.3
  #14 0x00002ac66432f9ac in qpid::client::amqp0_10::ReceiverImpl::fetch(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3
  #15 0x00002ac6643048ad in qpid::messaging::Receiver::fetch(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3
  #16 0x000000000040213f in thread(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
  #17 0x0000000000402d31 in void boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> >::operator()<void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list0&, int) ()
  #18 0x0000000000402ddc in boost::_bi::bind_t<void, void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> > >::operator()() ()
  #19 0x0000000000402dfb in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> > >, void>::invoke(boost::detail::function::any_pointer) ()
  #20 0x00002ac66456cca7 in boost::function0<void, std::allocator<boost::function_base> >::operator()() const () from /usr/lib64/libboost_thread.so.2
  #21 0x00002ac66456c70f in ?? () from /usr/lib64/libboost_thread.so.2
  #22 0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0
  #23 0x00000031794d503d in clone () from /lib64/libc.so.6
  Thread 3 (Thread 0x2ac669731940 (LWP 27993)):
  #0  0x00000031794d5428 in epoll_wait () from /lib64/libc.so.6
  #1  0x00002ac664da7b61 in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.8
  #2  0x00002ac664da85d7 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.8
  #3  0x00002ac664d9f5ba in qpid::sys::(anonymous namespace)::runRunnable(void*) () from /usr/lib64/libqpidcommon.so.8
  #4  0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000031794d503d in clone () from /lib64/libc.so.6
  Thread 2 (Thread 0x2ac66a132940 (LWP 28029)):
  #0  0x00000031794d5428 in epoll_wait () from /lib64/libc.so.6
  #1  0x00002ac664da7b61 in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.8
  #2  0x00002ac664da85d7 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.8
  #3  0x00002ac664d9f5ba in qpid::sys::(anonymous namespace)::runRunnable(void*) () from /usr/lib64/libqpidcommon.so.8
  #4  0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0
  #5  0x00000031794d503d in clone () from /lib64/libc.so.6
  Thread 1 (Thread 0x2ac665140f10 (LWP 27977)):
  #0  0x0000003179c07c65 in pthread_join () from /lib64/libpthread.so.0
  #1  0x00002ac66456c3ce in boost::thread::join() () from /usr/lib64/libboost_thread.so.2
  #2  0x0000000000401fa6 in main ()

All data will be attached.

-> ASSIGNED

Comment 16 Justin Ross 2012-12-06 18:20:29 UTC
This is issue is mitigated but not completely fixed.  I've created a clone, bug 884803, to track further work.  For the purposes of 2.3, I'm marking this modified, and we can pull the new bug, currently against 2.4, back into 2.3 after deliberation.

Comment 17 Frantisek Reznicek 2012-12-07 08:53:17 UTC
(In reply to comment #16)
> This is issue is mitigated but not completely fixed.  I've created a clone,
> bug 884803, to track further work.  For the purposes of 2.3, I'm marking
> this modified, and we can pull the new bug, currently against 2.4, back into
> 2.3 after deliberation.

Justin could you doublecheck please?

bug 804752 now tracks just TransportFailure exception deadlock i.e.
           handling multiple exceptions within TransportFailure exception
bug 884803 tracks still the whole topic i.e.
          'Client stuck in permanent loop handling TransportFailure exception'
           which is superset of bug 804752.

Am I getting it right, please? (I need that for being able to VERI this one)

Comment 18 Justin Ross 2012-12-07 13:20:00 UTC
(In reply to comment #17)
> (In reply to comment #16)
> > This is issue is mitigated but not completely fixed.  I've created a clone,
> > bug 884803, to track further work.  For the purposes of 2.3, I'm marking
> > this modified, and we can pull the new bug, currently against 2.4, back into
> > 2.3 after deliberation.
> 
> Justin could you doublecheck please?
> 
> bug 804752 now tracks just TransportFailure exception deadlock i.e.
>            handling multiple exceptions within TransportFailure exception
> bug 884803 tracks still the whole topic i.e.
>           'Client stuck in permanent loop handling TransportFailure
> exception'
>            which is superset of bug 804752.
> 
> Am I getting it right, please? (I need that for being able to VERI this one)

Yes, you said it better than I.

Comment 19 Frantisek Reznicek 2012-12-10 09:20:30 UTC
The issue has been fixed: TransportFailure exception deadlock i.e. handling multiple exceptions within TransportFailure exception is now fixed.

Another problem == deadlock somewhere further is tracked as separate bug 884803.

Retested on RHEL 5.9RC / 6.3,  i[36]86 / x86_64 on packages:
  qpid-cpp*-0.18-10.el*
  qpid-qmf*-0.18-9.el*


-> VERIFIED

Comment 21 errata-xmlrpc 2013-03-06 18:55:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0561.html


Note You need to log in before you can comment on or make changes to this bug.