Bug 874192
Summary: | Deadlock in Qpid client if connection heartbeat is missed | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Jason Dillaman <jdillama> | ||||||||||
Component: | qpid-cpp | Assignee: | Andrew Stitcher <astitcher> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | high | ||||||||||||
Version: | Development | CC: | astitcher, esammons, freznice, jross, lzhaldyb, mcressma | ||||||||||
Target Milestone: | 2.3 | Keywords: | OtherQA, Patch | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | qpid-cpp-0.18-10 | Doc Type: | Bug Fix | ||||||||||
Doc Text: |
Cause:
In just the wrong circumstances in the C++ client code, if a heartbeat is missed whilst still trying before a connection is actually connected, the Timer thread that processes the heartbeat failure can deadlock with the a thread that is concurrently processing the same connection.
Consequence:
The client deadlocks whilst trying to close a connection that has been timed out due to heartbeat failure whilst still connecting.
Fix:
The code in the Timer thread was changed to defer the actual processing to another thread later so it could not conflict.
Result:
The deadlock no longer occurs.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2013-03-19 16:38:31 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 698367 | ||||||||||||
Attachments: |
|
Description
Jason Dillaman
2012-11-07 16:46:08 UTC
Created attachment 640226 [details]
pstack dump from deadlocked thread
pstack: Thread 7 (Thread 0xb35fdb70 (LWP 14837)): #0 0x00f23424 in __kernel_vsyscall () #1 0x00cb227c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0 #2 0x009f2c3d in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libc.so.6 #3 0x041c88fb in qpid::sys::TimerTask::cancel() () from /usr/lib/libqpidcommon.so.8 #4 0x006917a1 in qpid::client::ConnectionImpl::~ConnectionImpl() () from /usr/lib/libqpidclient.so.8 #5 0x00691b93 in qpid::client::ConnectionImpl::~ConnectionImpl() () from /usr/lib/libqpidclient.so.8 #6 0x0068fbb7 in qpid::client::ConnectionImpl::shutdown() () from /usr/lib/libqpidclient.so.8 #7 0x006c4fda in qpid::client::TCPConnector::connectFailed(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/libqpidclient.so.8 #8 0x006c4800 in qpid::client::TCPConnector::abort() () from /usr/lib/libqpidclient.so.8 #9 0x0068ea48 in qpid::client::ConnectionImpl::idleIn() () from /usr/lib/libqpidclient.so.8 #10 0x0068efac in ?? () from /usr/lib/libqpidclient.so.8 #11 0x041c7132 in qpid::sys::TimerTask::fireTask() () from /usr/lib/libqpidcommon.so.8 #12 0x041c754b in qpid::sys::Timer::fire(boost::intrusive_ptr<qpid::sys::TimerTask>) () from /usr/lib/libqpidcommon.so.8 #13 0x041c8faf in qpid::sys::Timer::run() () from /usr/lib/libqpidcommon.so.8 #14 0x040cab42 in ?? () from /usr/lib/libqpidcommon.so.8 #15 0x00caea09 in start_thread () from /lib/libpthread.so.0 #16 0x009e300e in clone () from /lib/libc.so.6 Issue is not 100% reproducible, but has happened often across numerous machines. In my testing stopping the broker isn't sufficient to cause this bug. I think that firewalling the connection will probably do it though. I can't replicate this bug. From code inspection it does appear there are some code paths that can lead to this deadlock, but it's going to be very hard to test that they've been eliminated unless the bug is easily replicable - perhaps a simplified version of the client code that is experiencing this is necessary. 10 cycles of SIGSTOP / SIGCONT to broker produce no deadlock in client. Next, I will try firewall sabotage. I am on an 8-core box. here is client command script: #! /bin/bash QPID_RECV=/home/mick/trunk/qpid/qpid/cpp/src/tests/qpid-receive ${QPID_RECV} \ --messages 1 \ --forever \ --address 'q;{create:always}' \ --broker localhost:5813 \ --connection-options '{reconnect:true,heartbeat:3}' And here is broker script: #! /bin/bash export LD_LIBRARY_PATH=$TRUNK/qpid/cpp/src/.libs QPIDD_SRC=$TRUNK/qpid/cpp/src QPIDD=$QPIDD_SRC/.libs/qpidd echo $QPIDD rm -rf /tmp/mick mkdir /tmp/mick port=5813 $QPIDD \ --no-module-dir \ -p ${port} \ --data-dir /tmp/mick/data_${port} \ --auth=no \ --mgmt-enable=yes \ --log-enable info+ \ --log-to-file /tmp/mick/qpidd_${port}.log \ -d echo "============================================" echo "started broker from $QPIDD" echo "============================================" I can't get it to work with firewall rules either. ( But I am not knowledgeable about fooling with firewalls...) With above example, and using these two rules: iptables -A INPUT -p tcp --dport 5813 -j REJECT iptables -A OUTPUT -p tcp --dport 5813 -j REJECT I cannot get client to react, nor can I get a client stack hanging in qpid::sys::TimerTask::cancel() as in comment 2, above. I think we need more explicit info on how to reproduce, with scripts & reproduction frequency. It just occurred to me to simply try killing the broker, while the client is connected. i.e. kill -9 . No dice that way either -- no client hangs in 10 trials. Hmm. I will try same three methods with earlier code checkpoints on this branch. ( 0.18-mrg ) This Just In: don't use a quiescent connection. Get it under load as much as possible. Still can't reproduce under load. 8-core box, 4 senders, 4 receivers. total throughput of 100,000 messages per sec. 5 trials, no hangs. :-( ...now looking to see if I can increase the odds.... Created attachment 646451 [details]
Reproducer
The following change should be made to Qpid to help force the race condition:
diff --git a/qpid/cpp/src/qpid/client/ConnectionImpl.cpp b/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
index 0abfbe0..0fa0a73 100644
--- a/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
+++ b/qpid/cpp/src/qpid/client/ConnectionImpl.cpp
@@ -361,6 +361,7 @@ void ConnectionImpl::shutdown() {
if (!handler.isClosed()) {
failedConnection();
}
+ if (!shutdownComplete) sleep(30);
bool canDelete;
{
Mutex::ScopedLock l(lock);
Created attachment 646510 [details]
Quick patch to prevent deadlock
Please do not use this patch. I have a fix for this problem, but unless it can be reliably reproduced there is no point in applying fixes to code that we cannot test. Reliable reproducer is attached. (In reply to comment #14) > Please do not use this patch. I have a fix for this problem, but unless it > can be reliably reproduced there is no point in applying fixes to code that > we cannot test. Created attachment 648218 [details]
Proposed fix
Jason, can you try the patch I just attached and see if it fixes the issue for you - this is the approach that I think is the appropriate way to fix the issue. The fix seems to fix the problem for me using your reproducer with the change to ConnectionImpl.cpp to widen the race window. There is now a fix for this issue on the trunk branch r1411750. This fix will go into the upstream 0.22 release. |