Bug 874192 - Deadlock in Qpid client if connection heartbeat is missed
Summary: Deadlock in Qpid client if connection heartbeat is missed
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: Development
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.3
: ---
Assignee: Andrew Stitcher
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks: 698367
TreeView+ depends on / blocked
 
Reported: 2012-11-07 16:46 UTC by Jason Dillaman
Modified: 2013-03-19 16:38 UTC (History)
6 users (show)

Fixed In Version: qpid-cpp-0.18-10
Doc Type: Bug Fix
Doc Text:
Cause: In just the wrong circumstances in the C++ client code, if a heartbeat is missed whilst still trying before a connection is actually connected, the Timer thread that processes the heartbeat failure can deadlock with the a thread that is concurrently processing the same connection. Consequence: The client deadlocks whilst trying to close a connection that has been timed out due to heartbeat failure whilst still connecting. Fix: The code in the Timer thread was changed to defer the actual processing to another thread later so it could not conflict. Result: The deadlock no longer occurs.
Clone Of:
Environment:
Last Closed: 2013-03-19 16:38:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pstack dump from deadlocked thread (1.45 KB, application/octet-stream)
2012-11-07 17:22 UTC, Jason Dillaman
no flags Details
Reproducer (409 bytes, text/plain)
2012-11-16 16:52 UTC, Jason Dillaman
no flags Details
Quick patch to prevent deadlock (1.49 KB, patch)
2012-11-16 18:24 UTC, Jason Dillaman
no flags Details | Diff
Proposed fix (3.59 KB, patch)
2012-11-20 00:24 UTC, Andrew Stitcher
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-4447 0 None None None Never

Description Jason Dillaman 2012-11-07 16:46:08 UTC
Description of problem:
If heartbeats are enabled and the client's heartbeat timeout task fires, there will be a deadlock between the timeout task invocation and the destructor for the connection which is attempting to cancel the task.

Version-Release number of selected component (if applicable):
Qpid 0.18

How reproducible:
100%

Steps to Reproduce:
1. Connect a client to a broker w/ heartbeats enabled
2. Interrupt the network connection (or pause the broker process) so that the client misses its heartbeats
  
Actual results:
Client deadlocks attempting to close the connection

Expected results:
Client closes the connection

Additional info:

Comment 1 Jason Dillaman 2012-11-07 17:22:11 UTC
Created attachment 640226 [details]
pstack dump from deadlocked thread

Comment 2 Jason Dillaman 2012-11-07 17:42:24 UTC
pstack:


 Thread 7 (Thread 0xb35fdb70 (LWP 14837)):
#0  0x00f23424 in __kernel_vsyscall ()
#1  0x00cb227c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009f2c3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x041c88fb in qpid::sys::TimerTask::cancel() () from /usr/lib/libqpidcommon.so.8
#4  0x006917a1 in qpid::client::ConnectionImpl::~ConnectionImpl() () from /usr/lib/libqpidclient.so.8
#5  0x00691b93 in qpid::client::ConnectionImpl::~ConnectionImpl() () from /usr/lib/libqpidclient.so.8
#6  0x0068fbb7 in qpid::client::ConnectionImpl::shutdown() () from /usr/lib/libqpidclient.so.8
#7  0x006c4fda in qpid::client::TCPConnector::connectFailed(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/libqpidclient.so.8
#8  0x006c4800 in qpid::client::TCPConnector::abort() () from /usr/lib/libqpidclient.so.8
#9  0x0068ea48 in qpid::client::ConnectionImpl::idleIn() () from /usr/lib/libqpidclient.so.8
#10 0x0068efac in ?? () from /usr/lib/libqpidclient.so.8
#11 0x041c7132 in qpid::sys::TimerTask::fireTask() () from /usr/lib/libqpidcommon.so.8
#12 0x041c754b in qpid::sys::Timer::fire(boost::intrusive_ptr<qpid::sys::TimerTask>) () from /usr/lib/libqpidcommon.so.8
#13 0x041c8faf in qpid::sys::Timer::run() () from /usr/lib/libqpidcommon.so.8
#14 0x040cab42 in ?? () from /usr/lib/libqpidcommon.so.8
#15 0x00caea09 in start_thread () from /lib/libpthread.so.0
#16 0x009e300e in clone () from /lib/libc.so.6

Comment 3 Jason Dillaman 2012-11-07 17:59:43 UTC
Issue is not 100% reproducible, but has happened often across numerous machines.

Comment 4 Andrew Stitcher 2012-11-07 18:31:03 UTC
In my testing stopping the broker isn't sufficient to cause this bug. I think that firewalling the connection will probably do it though.

Comment 5 Andrew Stitcher 2012-11-07 19:54:05 UTC
I can't replicate this bug. From code inspection it does appear there are some code paths that can lead to this deadlock, but it's going to be very hard to test that they've been eliminated unless the bug is easily replicable - perhaps a simplified version of the client code that is experiencing this is necessary.

Comment 6 mick 2012-11-13 08:47:04 UTC
10 cycles of SIGSTOP / SIGCONT to broker produce no deadlock in client.
Next, I will try firewall sabotage.

I am on an 8-core box.

here is client command script:



#! /bin/bash

QPID_RECV=/home/mick/trunk/qpid/qpid/cpp/src/tests/qpid-receive


${QPID_RECV}                    \
  --messages 1                  \
  --forever                     \
  --address 'q;{create:always}' \
  --broker  localhost:5813      \
  --connection-options '{reconnect:true,heartbeat:3}'




And here is broker script:

#! /bin/bash

export LD_LIBRARY_PATH=$TRUNK/qpid/cpp/src/.libs

QPIDD_SRC=$TRUNK/qpid/cpp/src

QPIDD=$QPIDD_SRC/.libs/qpidd

echo $QPIDD

rm -rf /tmp/mick
mkdir /tmp/mick

port=5813

$QPIDD                                      \
  --no-module-dir                           \
  -p ${port}                                \
  --data-dir /tmp/mick/data_${port}         \
  --auth=no                                 \
  --mgmt-enable=yes                         \
  --log-enable info+                        \
  --log-to-file /tmp/mick/qpidd_${port}.log \
  -d


echo "============================================"
echo "started broker from $QPIDD"
echo "============================================"

Comment 7 mick 2012-11-13 09:05:41 UTC
I can't get it to work with firewall rules either.  
( But I am not knowledgeable about fooling with firewalls...)

With above example, and using these two rules:

    iptables -A INPUT -p tcp --dport 5813 -j REJECT
    iptables -A OUTPUT -p tcp --dport 5813 -j REJECT

I cannot get client to react, nor can I get a client stack hanging in qpid::sys::TimerTask::cancel() as in comment 2, above.


I think we need more explicit info on how to reproduce,
with scripts & reproduction frequency.

Comment 8 mick 2012-11-13 15:22:44 UTC
It just occurred to me to simply try killing the broker, while the client is connected.   i.e. kill -9 .

No dice that way either -- no client hangs in 10 trials.
Hmm.

Comment 9 mick 2012-11-13 15:55:57 UTC
I will try same three methods with earlier code checkpoints on this branch. ( 0.18-mrg )

Comment 10 mick 2012-11-13 16:47:41 UTC
This Just In:  don't use a quiescent connection.  Get it under load as much as possible.

Comment 11 mick 2012-11-14 18:04:31 UTC
Still can't reproduce under load.  8-core box, 4 senders, 4 receivers.  total throughput of 100,000 messages per sec.

5 trials, no hangs.    :-(


...now looking to see if I can increase the odds....

Comment 12 Jason Dillaman 2012-11-16 16:52:16 UTC
Created attachment 646451 [details]
Reproducer

The following change should be made to Qpid to help force the race condition:

diff --git a/qpid/cpp/src/qpid/client/ConnectionImpl.cpp b/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
index 0abfbe0..0fa0a73 100644
--- a/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
+++ b/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
@@ -361,6 +361,7 @@ void ConnectionImpl::shutdown() {
     if (!handler.isClosed()) {
         failedConnection();
     }
+    if (!shutdownComplete) sleep(30);
     bool canDelete;
     {
         Mutex::ScopedLock l(lock);

Comment 13 Jason Dillaman 2012-11-16 18:24:01 UTC
Created attachment 646510 [details]
Quick patch to prevent deadlock

Comment 14 Andrew Stitcher 2012-11-19 15:27:51 UTC
Please do not use this patch. I have a fix for this problem, but unless it can be reliably reproduced there is no point in applying fixes to code that we cannot test.

Comment 15 Jason Dillaman 2012-11-19 15:46:29 UTC
Reliable reproducer is attached.
(In reply to comment #14)
> Please do not use this patch. I have a fix for this problem, but unless it
> can be reliably reproduced there is no point in applying fixes to code that
> we cannot test.

Comment 16 Andrew Stitcher 2012-11-20 00:24:00 UTC
Created attachment 648218 [details]
Proposed fix

Comment 17 Andrew Stitcher 2012-11-20 00:26:27 UTC
Jason, can you try the patch I just attached and see if it fixes the issue for you - this is the approach that I think is the appropriate way to fix the issue.

The fix seems to fix the problem for me using your reproducer with the change to ConnectionImpl.cpp to widen the race window.

Comment 22 Andrew Stitcher 2012-11-20 18:36:09 UTC
There is now a fix for this issue on the trunk branch r1411750. This fix will go into the upstream 0.22 release.


Note You need to log in before you can comment on or make changes to this bug.