Bug 874192

Summary: Deadlock in Qpid client if connection heartbeat is missed
Product: Red Hat Enterprise MRG Reporter: Jason Dillaman <jdillama>
Component: qpid-cppAssignee: Andrew Stitcher <astitcher>
Status: CLOSED CURRENTRELEASE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: astitcher, esammons, freznice, jross, lzhaldyb, mcressma
Target Milestone: 2.3Keywords: OtherQA, Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qpid-cpp-0.18-10 Doc Type: Bug Fix
Doc Text:
Cause: In just the wrong circumstances in the C++ client code, if a heartbeat is missed whilst still trying before a connection is actually connected, the Timer thread that processes the heartbeat failure can deadlock with the a thread that is concurrently processing the same connection. Consequence: The client deadlocks whilst trying to close a connection that has been timed out due to heartbeat failure whilst still connecting. Fix: The code in the Timer thread was changed to defer the actual processing to another thread later so it could not conflict. Result: The deadlock no longer occurs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-03-19 16:38:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 698367    
Attachments:
Description Flags
pstack dump from deadlocked thread
none
Reproducer
none
Quick patch to prevent deadlock
none
Proposed fix none

Description Jason Dillaman 2012-11-07 16:46:08 UTC
Description of problem:
If heartbeats are enabled and the client's heartbeat timeout task fires, there will be a deadlock between the timeout task invocation and the destructor for the connection which is attempting to cancel the task.

Version-Release number of selected component (if applicable):
Qpid 0.18

How reproducible:
100%

Steps to Reproduce:
1. Connect a client to a broker w/ heartbeats enabled
2. Interrupt the network connection (or pause the broker process) so that the client misses its heartbeats
  
Actual results:
Client deadlocks attempting to close the connection

Expected results:
Client closes the connection

Additional info:

Comment 1 Jason Dillaman 2012-11-07 17:22:11 UTC
Created attachment 640226 [details]
pstack dump from deadlocked thread

Comment 2 Jason Dillaman 2012-11-07 17:42:24 UTC
pstack:


 Thread 7 (Thread 0xb35fdb70 (LWP 14837)):
#0  0x00f23424 in __kernel_vsyscall ()
#1  0x00cb227c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x009f2c3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6
#3  0x041c88fb in qpid::sys::TimerTask::cancel() () from /usr/lib/libqpidcommon.so.8
#4  0x006917a1 in qpid::client::ConnectionImpl::~ConnectionImpl() () from /usr/lib/libqpidclient.so.8
#5  0x00691b93 in qpid::client::ConnectionImpl::~ConnectionImpl() () from /usr/lib/libqpidclient.so.8
#6  0x0068fbb7 in qpid::client::ConnectionImpl::shutdown() () from /usr/lib/libqpidclient.so.8
#7  0x006c4fda in qpid::client::TCPConnector::connectFailed(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/libqpidclient.so.8
#8  0x006c4800 in qpid::client::TCPConnector::abort() () from /usr/lib/libqpidclient.so.8
#9  0x0068ea48 in qpid::client::ConnectionImpl::idleIn() () from /usr/lib/libqpidclient.so.8
#10 0x0068efac in ?? () from /usr/lib/libqpidclient.so.8
#11 0x041c7132 in qpid::sys::TimerTask::fireTask() () from /usr/lib/libqpidcommon.so.8
#12 0x041c754b in qpid::sys::Timer::fire(boost::intrusive_ptr<qpid::sys::TimerTask>) () from /usr/lib/libqpidcommon.so.8
#13 0x041c8faf in qpid::sys::Timer::run() () from /usr/lib/libqpidcommon.so.8
#14 0x040cab42 in ?? () from /usr/lib/libqpidcommon.so.8
#15 0x00caea09 in start_thread () from /lib/libpthread.so.0
#16 0x009e300e in clone () from /lib/libc.so.6

Comment 3 Jason Dillaman 2012-11-07 17:59:43 UTC
Issue is not 100% reproducible, but has happened often across numerous machines.

Comment 4 Andrew Stitcher 2012-11-07 18:31:03 UTC
In my testing stopping the broker isn't sufficient to cause this bug. I think that firewalling the connection will probably do it though.

Comment 5 Andrew Stitcher 2012-11-07 19:54:05 UTC
I can't replicate this bug. From code inspection it does appear there are some code paths that can lead to this deadlock, but it's going to be very hard to test that they've been eliminated unless the bug is easily replicable - perhaps a simplified version of the client code that is experiencing this is necessary.

Comment 6 mick 2012-11-13 08:47:04 UTC
10 cycles of SIGSTOP / SIGCONT to broker produce no deadlock in client.
Next, I will try firewall sabotage.

I am on an 8-core box.

here is client command script:



#! /bin/bash

QPID_RECV=/home/mick/trunk/qpid/qpid/cpp/src/tests/qpid-receive


${QPID_RECV}                    \
  --messages 1                  \
  --forever                     \
  --address 'q;{create:always}' \
  --broker  localhost:5813      \
  --connection-options '{reconnect:true,heartbeat:3}'




And here is broker script:

#! /bin/bash

export LD_LIBRARY_PATH=$TRUNK/qpid/cpp/src/.libs

QPIDD_SRC=$TRUNK/qpid/cpp/src

QPIDD=$QPIDD_SRC/.libs/qpidd

echo $QPIDD

rm -rf /tmp/mick
mkdir /tmp/mick

port=5813

$QPIDD                                      \
  --no-module-dir                           \
  -p ${port}                                \
  --data-dir /tmp/mick/data_${port}         \
  --auth=no                                 \
  --mgmt-enable=yes                         \
  --log-enable info+                        \
  --log-to-file /tmp/mick/qpidd_${port}.log \
  -d


echo "============================================"
echo "started broker from $QPIDD"
echo "============================================"

Comment 7 mick 2012-11-13 09:05:41 UTC
I can't get it to work with firewall rules either.  
( But I am not knowledgeable about fooling with firewalls...)

With above example, and using these two rules:

    iptables -A INPUT -p tcp --dport 5813 -j REJECT
    iptables -A OUTPUT -p tcp --dport 5813 -j REJECT

I cannot get client to react, nor can I get a client stack hanging in qpid::sys::TimerTask::cancel() as in comment 2, above.


I think we need more explicit info on how to reproduce,
with scripts & reproduction frequency.

Comment 8 mick 2012-11-13 15:22:44 UTC
It just occurred to me to simply try killing the broker, while the client is connected.   i.e. kill -9 .

No dice that way either -- no client hangs in 10 trials.
Hmm.

Comment 9 mick 2012-11-13 15:55:57 UTC
I will try same three methods with earlier code checkpoints on this branch. ( 0.18-mrg )

Comment 10 mick 2012-11-13 16:47:41 UTC
This Just In:  don't use a quiescent connection.  Get it under load as much as possible.

Comment 11 mick 2012-11-14 18:04:31 UTC
Still can't reproduce under load.  8-core box, 4 senders, 4 receivers.  total throughput of 100,000 messages per sec.

5 trials, no hangs.    :-(


...now looking to see if I can increase the odds....

Comment 12 Jason Dillaman 2012-11-16 16:52:16 UTC
Created attachment 646451 [details]
Reproducer

The following change should be made to Qpid to help force the race condition:

diff --git a/qpid/cpp/src/qpid/client/ConnectionImpl.cpp b/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
index 0abfbe0..0fa0a73 100644
--- a/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
+++ b/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
@@ -361,6 +361,7 @@ void ConnectionImpl::shutdown() {
     if (!handler.isClosed()) {
         failedConnection();
     }
+    if (!shutdownComplete) sleep(30);
     bool canDelete;
     {
         Mutex::ScopedLock l(lock);

Comment 13 Jason Dillaman 2012-11-16 18:24:01 UTC
Created attachment 646510 [details]
Quick patch to prevent deadlock

Comment 14 Andrew Stitcher 2012-11-19 15:27:51 UTC
Please do not use this patch. I have a fix for this problem, but unless it can be reliably reproduced there is no point in applying fixes to code that we cannot test.

Comment 15 Jason Dillaman 2012-11-19 15:46:29 UTC
Reliable reproducer is attached.
(In reply to comment #14)
> Please do not use this patch. I have a fix for this problem, but unless it
> can be reliably reproduced there is no point in applying fixes to code that
> we cannot test.

Comment 16 Andrew Stitcher 2012-11-20 00:24:00 UTC
Created attachment 648218 [details]
Proposed fix

Comment 17 Andrew Stitcher 2012-11-20 00:26:27 UTC
Jason, can you try the patch I just attached and see if it fixes the issue for you - this is the approach that I think is the appropriate way to fix the issue.

The fix seems to fix the problem for me using your reproducer with the change to ConnectionImpl.cpp to widen the race window.

Comment 22 Andrew Stitcher 2012-11-20 18:36:09 UTC
There is now a fix for this issue on the trunk branch r1411750. This fix will go into the upstream 0.22 release.