Description of problem: Upon killing and restarting a broker, client applications get stuck attempting to recover from a TransportFailure exception. The client can reconnect to the broker, but the exception is not cleared if another exception occurs while resetting one of the sessions during the reconnect process (i.e. if a session's receiver's queue doesn't exist). This means that the original transport exception is still being held by one or more of the sessions. Since "ConnectionImpl::open" first checks to see if it is already connected, the exception will never be cleared because the sessions will never be re-initialized properly. Version-Release number of selected component (if applicable): qpid-cpp-server-0.12-6_ptc_hotfix_3.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Start a broker and create a queue 1. Create a single connection with two sessions 2. Subscribe each session to the queues 3. Call a blocking session call (i.e. nextReceiver) 4. Restart the broker, but do not create queue Actual results: Qpid client library is stuck looping on the same exception Expected results: Qpid client library gracefully recovers
Similar to issue witnessed by Alan Conway: https://issues.apache.org/jira/browse/QPID-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014573#comment-13014573
Created attachment 619919 [details] Reproducer
Created attachment 619924 [details] Quick patch to avoid exception handling loop Not the most elegant solution since it will cause a reconnection if one of the connection's sessions encounters an error during post-connection reset. However, by disconnecting after failing to reset the sessions, it prevents the exception handling loop caused by one or more sessions not being properly reset.
Created attachment 636536 [details] QA reproducer Reproducer usage: tc804752.sh [cycles] [messages] cycles: run the test [cycles] times messages: add [messages] number of messages to the queues before starting main.out
The QA reproducer is not able to reproduce the issue 100% reliably. It will be updated when the fixed package is available for testing and we can see the difference between the correct and incorrect behaviour.
There is a bit unclear what the expected behavior is. Reproducer is written the way that it will not exit itself. Let me ask question on how to properly distinguish the buggy behavior... I assume that if c++ reproducer client enters the state with locks pthread_mutex_lock -> _L_lock_* -> __lll_lock_wait and is not able to get from there that proves the buggy situation. In case it does not enter pthread_mutex_lock -> _L_lock_* -> __lll_lock_wait state then it is ok. The observed behavior was that we saw pthread_mutex_lock -> _L_lock_* -> __lll_lock_wait state and one of two more seconds later it went somewhere else returning back to the pthread_mutex_lock -> _L_lock_* -> __lll_lock_wait state within second or so. Is this still the buggy behavior? On slightly different topic, is it necessary / needed to put some messages into queue[12] queues or not? Is there a way how to modify reproducer the way it would be more evident we trigger the issue? We were trying to modify c++ reproducer following way: --- /tmp/main.cpp 2012-11-05 16:05:57.000000000 +0100 +++ /tmp/main_mod.cpp 2012-11-05 16:05:49.000000000 +0100 @@ -10,9 +10,15 @@ { qpid::messaging::Session session = connection.createSession(); qpid::messaging::Receiver receiver = session.createReceiver(queueName); + int cnt=10; while (true) { try { while (session.nextReceiver(receiver, qpid::messaging::Duration::FOREVER)) { + if (cnt > 0) + { + std::cout << "s.nextReceiver(" << queueName << ") call" << std::endl; + cnt--; + } } } catch (...) { But then we were unclear whether it is necessary to have messages in queue or not as session.nextReceiver() might block and then our messages will not be printed and we fail to detect permanent loop situation... Jason, Chuck, could you possibly answer the question please? Raising NEEDINFO.
(In reply to comment #9) In our environment, within the broker's init script, after starting the broker we create any needed static exchanges/queues/bindings. Therefore, there is a race condition between starting the broker and re-creating all the internal wiring where a client could reconnect and have one of its sessions invalidated because the wiring wasn't complete (i.e. it attempts to re-subscribe to a queue that hasn't been recreated yet). We expect the clients to be able to eventually recover after the broker's wiring is in place. If your test doesn't recreate the wiring, I would expect that your client would disconnect from the broker and try again to recover its sessions. I suppose it could be argued that the client should immediately bomb-out after its initial session failure after reconnection, but for my purposes I know the broker will eventually get back into a correct wiring state and thus the client should eventually recover.
Initial verification part (using attachment 636536 [details]) went fine, on stable version client hangs processing additional exception within TransportFailure, candidate code shows that client can recover. Current QA reproducer but does not verify that client can actually do something else and finish - which needs to be done.
Created attachment 657325 [details] QA reproducer enhanced Enhancer reproducer uses different way how to trigger the problem. Steps are following: 1] Start a broker and create a queue 2] Create a single connection with two sessions 3] Subscribe each session to the queues 4] Call a blocking session call (i.e. session.receiver.fetch()) Wait for message with special content 5] Restart the broker, but do not create queue 6] Send few messages to the queue[s], last one having special content 7] Expect that Client will exit, i.e. message with special content will be received threads will get shut-down client will gracefully exit Above steps are more matching with comment 10 scenario. How to run: bz804752.sh [cycles] [messages] [cycle-length] ^ number of scenario repetitions ^ number of messages published to queues at point 2] ^ number of secs used to watch the process
The behavior was retested using enhanced reproducer (attachment 657325 [details]) on following packages: qpid-cpp*-0.18-10.el* qpid-qmf*-0.18-9.el* On stable packages client never exited. On candidate packages clients exited gracefully just in around 50% of cases. The other cases show two problems: a] hangs in a bit different place It looks like thread deadlock (there are still two concurrent threads in client) b] client crash in qpid::client::TCPConnector::connectFailed() -> ... -> malloc_consolidate() - this behavior is going to be tracked in separate defect as seen only on latest rhel5.9 candidate Summary: The above patch improves behavior and clients are eventually able to leave exception lock (which was not the case for stable version) Repeating the scenario 10+ times shows that in approx 50% of cases qpid client enters another (very similar) threading deadlock qpid::client::SessionImpl::checkOpen() vs. qpid::client::amqp0_10::ReceiverImpl::fetch() -> qpid::client::amqp0_10::SessionImpl::execute() , most of the cases look like: [root@dhcp-37-124 bz804752]# cat pstack_004_29.log Thread 6 (Thread 0x2ac666a1a940 (LWP 27978)): #0 0x00000031794d5428 in epoll_wait () from /lib64/libc.so.6 #1 0x00002ac664da7b61 in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.8 #2 0x00002ac664da85d7 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.8 #3 0x00002ac664d9f5ba in qpid::sys::(anonymous namespace)::runRunnable(void*) () from /usr/lib64/libqpidcommon.so.8 #4 0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0 #5 0x00000031794d503d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x2ac66832f940 (LWP 27980)): #0 0x000000317b807372 in ?? () from /lib64/libgcc_s.so.1 #1 0x000000317b8079c6 in ?? () from /lib64/libgcc_s.so.1 #2 0x000000317b80866b in ?? () from /lib64/libgcc_s.so.1 #3 0x000000317b808cab in _Unwind_Resume () from /lib64/libgcc_s.so.1 #4 0x00002ac664330645 in bool qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch>(qpid::client::amqp0_10::ReceiverImpl::Fetch&) () from /usr/lib64/libqpidmessaging.so.3 #5 0x00002ac66432f9ac in qpid::client::amqp0_10::ReceiverImpl::fetch(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3 #6 0x00002ac6643048ad in qpid::messaging::Receiver::fetch(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3 #7 0x000000000040213f in thread(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () #8 0x0000000000402d31 in void boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> >::operator()<void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list0&, int) () #9 0x0000000000402ddc in boost::_bi::bind_t<void, void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> > >::operator()() () #10 0x0000000000402dfb in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> > >, void>::invoke(boost::detail::function::any_pointer) () #11 0x00002ac66456cca7 in boost::function0<void, std::allocator<boost::function_base> >::operator()() const () from /usr/lib64/libboost_thread.so.2 #12 0x00002ac66456c70f in ?? () from /usr/lib64/libboost_thread.so.2 #13 0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0 #14 0x00000031794d503d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x2ac668d30940 (LWP 27981)): #0 0x0000003179c0d654 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x0000003179c08f65 in _L_lock_1127 () from /lib64/libpthread.so.0 #2 0x0000003179c08e63 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000003179509a8a in dl_iterate_phdr () from /lib64/libc.so.6 #4 0x000000317b80a6b6 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1 #5 0x000000317b807625 in ?? () from /lib64/libgcc_s.so.1 #6 0x000000317b808f79 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1 #7 0x000000317e8bcf64 in __cxa_throw () from /usr/lib64/libstdc++.so.6 #8 0x00002ac664a23a51 in qpid::client::SessionImpl::checkOpen() const () from /usr/lib64/libqpidclient.so.8 #9 0x00002ac664a2a6c6 in qpid::client::SessionImpl::sendCommand(qpid::framing::AMQBody const&, qpid::framing::MethodContent const*) () from /usr/lib64/libqpidclient.so.8 #10 0x00002ac664a2a9ab in qpid::client::SessionImpl::send(qpid::framing::AMQBody const&) () from /usr/lib64/libqpidclient.so.8 #11 0x00002ac6649d39f2 in qpid::client::no_keyword::AsyncSession_0_10::messageSetFlowMode(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned char, bool) () from /usr/lib64/libqpidclient.so.8 #12 0x00002ac66432f522 in qpid::client::amqp0_10::ReceiverImpl::fetchImpl(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3 #13 0x00002ac6643305e9 in bool qpid::client::amqp0_10::SessionImpl::execute<qpid::client::amqp0_10::ReceiverImpl::Fetch>(qpid::client::amqp0_10::ReceiverImpl::Fetch&) () from /usr/lib64/libqpidmessaging.so.3 #14 0x00002ac66432f9ac in qpid::client::amqp0_10::ReceiverImpl::fetch(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3 #15 0x00002ac6643048ad in qpid::messaging::Receiver::fetch(qpid::messaging::Message&, qpid::messaging::Duration) () from /usr/lib64/libqpidmessaging.so.3 #16 0x000000000040213f in thread(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () #17 0x0000000000402d31 in void boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> >::operator()<void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list0&, int) () #18 0x0000000000402ddc in boost::_bi::bind_t<void, void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> > >::operator()() () #19 0x0000000000402dfb in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, void (*)(qpid::messaging::Connection, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), boost::_bi::list2<boost::_bi::value<qpid::messaging::Connection>, boost::_bi::value<char const*> > >, void>::invoke(boost::detail::function::any_pointer) () #20 0x00002ac66456cca7 in boost::function0<void, std::allocator<boost::function_base> >::operator()() const () from /usr/lib64/libboost_thread.so.2 #21 0x00002ac66456c70f in ?? () from /usr/lib64/libboost_thread.so.2 #22 0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0 #23 0x00000031794d503d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x2ac669731940 (LWP 27993)): #0 0x00000031794d5428 in epoll_wait () from /lib64/libc.so.6 #1 0x00002ac664da7b61 in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.8 #2 0x00002ac664da85d7 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.8 #3 0x00002ac664d9f5ba in qpid::sys::(anonymous namespace)::runRunnable(void*) () from /usr/lib64/libqpidcommon.so.8 #4 0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0 #5 0x00000031794d503d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x2ac66a132940 (LWP 28029)): #0 0x00000031794d5428 in epoll_wait () from /lib64/libc.so.6 #1 0x00002ac664da7b61 in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.8 #2 0x00002ac664da85d7 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.8 #3 0x00002ac664d9f5ba in qpid::sys::(anonymous namespace)::runRunnable(void*) () from /usr/lib64/libqpidcommon.so.8 #4 0x0000003179c0683d in start_thread () from /lib64/libpthread.so.0 #5 0x00000031794d503d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2ac665140f10 (LWP 27977)): #0 0x0000003179c07c65 in pthread_join () from /lib64/libpthread.so.0 #1 0x00002ac66456c3ce in boost::thread::join() () from /usr/lib64/libboost_thread.so.2 #2 0x0000000000401fa6 in main () All data will be attached. -> ASSIGNED
This is issue is mitigated but not completely fixed. I've created a clone, bug 884803, to track further work. For the purposes of 2.3, I'm marking this modified, and we can pull the new bug, currently against 2.4, back into 2.3 after deliberation.
(In reply to comment #16) > This is issue is mitigated but not completely fixed. I've created a clone, > bug 884803, to track further work. For the purposes of 2.3, I'm marking > this modified, and we can pull the new bug, currently against 2.4, back into > 2.3 after deliberation. Justin could you doublecheck please? bug 804752 now tracks just TransportFailure exception deadlock i.e. handling multiple exceptions within TransportFailure exception bug 884803 tracks still the whole topic i.e. 'Client stuck in permanent loop handling TransportFailure exception' which is superset of bug 804752. Am I getting it right, please? (I need that for being able to VERI this one)
(In reply to comment #17) > (In reply to comment #16) > > This is issue is mitigated but not completely fixed. I've created a clone, > > bug 884803, to track further work. For the purposes of 2.3, I'm marking > > this modified, and we can pull the new bug, currently against 2.4, back into > > 2.3 after deliberation. > > Justin could you doublecheck please? > > bug 804752 now tracks just TransportFailure exception deadlock i.e. > handling multiple exceptions within TransportFailure exception > bug 884803 tracks still the whole topic i.e. > 'Client stuck in permanent loop handling TransportFailure > exception' > which is superset of bug 804752. > > Am I getting it right, please? (I need that for being able to VERI this one) Yes, you said it better than I.
The issue has been fixed: TransportFailure exception deadlock i.e. handling multiple exceptions within TransportFailure exception is now fixed. Another problem == deadlock somewhere further is tracked as separate bug 884803. Retested on RHEL 5.9RC / 6.3, i[36]86 / x86_64 on packages: qpid-cpp*-0.18-10.el* qpid-qmf*-0.18-9.el* -> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0561.html