Red Hat Bugzilla – Bug 1588015
Qpidd deadlocks due to linearstore inactivity timer
Last modified: 2018-08-22 16:07:48 EDT
Created attachment 1448327 [details] full backtrace Description of problem: Qpid in production hangs and stops responding. After manually dumping core, backtrace of all threads shows that almost all of them (#5-#26) are waiting on the same lock (0x25638b0), while Thread #4 seems to be holding it: #15 0x00007f38119ca2cf in qpid::management::ManagementObject::ManageablePtr::ManagementMethod (this=this@entry=0x25638a8, methodId=methodId@entry=9, args=..., text="") at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/management/ManagementObject.cpp:265 l = {mutex = @0x25638b0} This particular thread is trying to delete a queue and is waiting for #3 qpid::sys::TimerTask::cancel (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:94 l = {mutex = @0x49c889c0} and owns this lock: #4 0x00007f380c44b2f0 in qpid::linearstore::InactivityFireEvent::cancel (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.cpp:85 sl = {mutex = @0x49c88a30} It looks like the TimerTask::cancel waits for 0x49c88990 in Thread #1 that just fires it: Thread 1 (Thread 0x7f380bbf0700 (LWP 29952)): #0 0x00007f3810dd24cd in __lll_lock_wait () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007f3810dcdde6 in _L_lock_870 () from /lib64/libpthread.so.0 No symbol table info available. #2 0x00007f3810dcdcdf in pthread_mutex_lock () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00007f380c44b013 in qpid::sys::Mutex::lock (this=this@entry=0x49c88a30) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/posix/Mutex.h:116 e = <optimized out> #4 0x00007f380c44bd58 in ScopedLock (l=..., this=<synthetic pointer>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Mutex.h:33 No locals. #5 qpid::linearstore::InactivityFireEvent::fire (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.cpp:70 sl = {mutex = @0x49c88a30} #6 0x00007f38119d55a9 in qpid::sys::Timer::fire (this=<optimized out>, t=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:222 __PRETTY_FUNCTION__ = "virtual void qpid::sys::Timer::fire(boost::intrusive_ptr<qpid::sys::TimerTask>)" #7 0x00007f38119d66da in qpid::sys::Timer::run (this=0x256a310) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:156 u = {mutex = @0x256a318} end = {timepoint = 151158672631781} overrun = <optimized out> warningsEnabled = <optimized out> s = {tt = @0x49c88990} t = {px = 0x49c88990} start = {timepoint = 151158682729264} delay = {nanosecs = 3372811} l = {mutex = @0x256a318} __PRETTY_FUNCTION__ = "virtual void qpid::sys::Timer::run()" ... This could be possibly caused by incorrect state machine in InactivityFireEvent which causes firing and cancelling that task concurrently can get stuck, with consequent deadlock of other threads. Version-Release number of selected component (if applicable): Satellite 6.3 qpid-cpp-server-1.36.0-18.el7.x86_64 How reproducible: Unknown yet Additional info: Possibly could be fixed with https://issues.apache.org/jira/browse/QPID-7975
Inspection of the attached dump seems to show that there is indeed a thread deadlock occurring. A call to JournalImpl::fire() from the inactivity timer occurring (almost) simultaneously with a call to JournalImpl::cancel() on another thread is deadlocking.
(In reply to Kim van der Riet from comment #3) > Inspection of the attached dump seems to show that there is indeed a thread > deadlock occurring. A call to JournalImpl::fire() from the inactivity timer > occurring (almost) simultaneously with a call to JournalImpl::cancel() on > another thread is deadlocking. Sorry, should be InactivityFireEvent::fire() and InactivityFireEvent::cancel() respectively.
(In reply to Kim van der Riet from comment #3) > Inspection of the attached dump seems to show that there is indeed a thread > deadlock occurring. A call to InactivityFireEvent::fire() from the inactivity timer > occurring (almost) simultaneously with a call to InactivityFireEvent::cancel() on > another thread is deadlocking. Thanks for analysis. The deadlock really happens on 1.36.0-18 with the QPID-7975 fixed, so there must be another bug (or potentially this JIRA has a regression? though dont expect so, it is a valid theory). Any idea how to reproduce it? I was trying: - setting journal-flush-timeout=20ms or similar low values to fire InactivityFireEvent timer more often - creating and deleting many journal files concurrently (to cancel such timer) - sending/receiving durable messages to dur.queues (would that cancel the timer as well?) But so far, no deadlock. Something else to try?
Reproducer outside Satellite: 0) rm -rf /var/lib/qpidd/.qpidd # just to start with absolutely clean table 1) set journal-flush-timeout=50ms to qpidd.conf - the lower the better. I reproduce the deadlock always within a minute or two with this value, but it took ~10 minutes with default 500ms (I think). 2) run this script that sporadically sends and receives a durable message to a queue that is time to time deleted - and it does so concurrently for 50 queues: for i in $(seq 1 50); do while true; do echo "$(date): loop $i" qpid-send -a "DurableTest-${i}; {create:always, node:{type:queue, durable:true}}" -m 1 --content-size=1100 --durable=yes qpid-receive -a DurableTest-${i} --print-content=no 2> /dev/null sleep 0.2 done & done for i in $(seq 1 50); do while true; do echo "$(date): deleting queue $i" qpid-config del queue DurableTest-${i} --force sleep $((RANDOM%5)).$((RANDOM%10)) done & done 3) check if qpidd responds, via e.g.: while true; do date; qpid-stat -g; sleep 5; done 4) within a minute or so, deadlock comes to you and your broker.
Kim, when thinking about workaround, I feel from some experiments that journal-flush-timeout value is linear to the time the broker survives. What is the performance/ throughput penalty in increasing it? It can delay (durable) message delivery up to this value in the worst possible case?
I cannot reproduce using this reproducer on my hardware. *But* if I add a sleep of 10ms into InactivityFireEvent::cancel() just after taking the lock, then the deadlock occurs quite quickly for me - after a minute or so of running Pavel's reproducer. diff --git a/src/qpid/linearstore/JournalImpl.cpp b/src/qpid/linearstore/JournalImpl.cpp index fe919fbd3..e512e9a05 100644 --- a/src/qpid/linearstore/JournalImpl.cpp +++ b/src/qpid/linearstore/JournalImpl.cpp @@ -82,6 +82,7 @@ void InactivityFireEvent::fire() { void InactivityFireEvent::cancel() { ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock); + ::usleep(10 * 1000); // <-- *** DEBUG *** ::qpid::sys::TimerTask::cancel(); _state = CANCELLED; } After making the following fix, but leaving the sleep in place: diff --git a/src/qpid/linearstore/JournalImpl.cpp b/src/qpid/linearstore/JournalImpl.cpp index fe919fbd3..55f33bb80 100644 --- a/src/qpid/linearstore/JournalImpl.cpp +++ b/src/qpid/linearstore/JournalImpl.cpp @@ -81,9 +81,12 @@ void InactivityFireEvent::fire() { } void InactivityFireEvent::cancel() { - ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock); ::qpid::sys::TimerTask::cancel(); - _state = CANCELLED; + { + ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock); + ::usleep(10 * 1000); // <-- *** DEBUG *** + _state = CANCELLED; + } } GetEventsFireEvent::GetEventsFireEvent(JournalImpl* p, I cannot get the reproducer to fail. I am hopeful that this will make a valid fix. I will continue to test for a while. In essence, the fix allows the underlying timer to *first* cancel (which includes waiting for any flush fires that may have occurred on another thread) before taking the lock to change the local state to CANCELLED. To be sure we are not introducing a regression, we should probably also check the reproducers for QPID-7975.
(In reply to Pavel Moravec from comment #7) > Kim, > when thinking about workaround, I feel from some experiments that > journal-flush-timeout value is linear to the time the broker survives. What > is the performance/ throughput penalty in increasing it? It can delay > (durable) message delivery up to this value in the worst possible case? This will be a throughput vs. latency tradeoff. The lower the flush timer, the lower the latency will be, but at the cost of throughput. Higher values of the timer have the opposite effect. The default has been chosen as a sort of compromise between these two issues, but if a user has throughput as the top issue, then they can increase the timer value, but with some messages possibly having a high latency. Low-latency users may do the opposite. I hope that answers the question, but I don't have hard numbers for you. We need to establish some performance tests on standardized hardware to answer that.
(In reply to Kim van der Riet from comment #8) > I cannot reproduce using this reproducer on my hardware. > > *But* if I add a sleep of 10ms into InactivityFireEvent::cancel() just after > taking the lock, then the deadlock occurs quite quickly for me - after a > minute or so of running Pavel's reproducer. > > diff --git a/src/qpid/linearstore/JournalImpl.cpp > b/src/qpid/linearstore/JournalImpl.cpp > index fe919fbd3..e512e9a05 100644 > --- a/src/qpid/linearstore/JournalImpl.cpp > +++ b/src/qpid/linearstore/JournalImpl.cpp > @@ -82,6 +82,7 @@ void InactivityFireEvent::fire() { > > void InactivityFireEvent::cancel() { > ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock); > + ::usleep(10 * 1000); // <-- *** DEBUG *** > ::qpid::sys::TimerTask::cancel(); > _state = CANCELLED; > } > > > After making the following fix, but leaving the sleep in place: > > diff --git a/src/qpid/linearstore/JournalImpl.cpp > b/src/qpid/linearstore/JournalImpl.cpp > index fe919fbd3..55f33bb80 100644 > --- a/src/qpid/linearstore/JournalImpl.cpp > +++ b/src/qpid/linearstore/JournalImpl.cpp > @@ -81,9 +81,12 @@ void InactivityFireEvent::fire() { > } > > void InactivityFireEvent::cancel() { > - ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock); > ::qpid::sys::TimerTask::cancel(); > - _state = CANCELLED; > + { > + ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock); > + ::usleep(10 * 1000); // <-- *** DEBUG *** > + _state = CANCELLED; > + } > } > > GetEventsFireEvent::GetEventsFireEvent(JournalImpl* p, > > I cannot get the reproducer to fail. I am hopeful that this will make a > valid fix. I will continue to test for a while. > > In essence, the fix allows the underlying timer to *first* cancel (which > includes waiting for any flush fires that may have occurred on another > thread) before taking the lock to change the local state to CANCELLED. > > To be sure we are not introducing a regression, we should probably also > check the reproducers for QPID-7975. Thanks for such prompt patch. I checked that: (*) 1.36.0-9 does not lead to the deadlock (though I run the test for a short time so far) - so I think bz1502855/QPID-7975 introduced this deadlock as a regression (*) 1.36.0-18 locally built with this patch does not deadlock, so the patch seems working for me as well I would vote this to be added to Sat 6.3.z since: - the deadlock has serious impact to Satellite - it is nontrivial to identify it from the symptoms like "pulp repo sync stuck" a user notices) - due to the probability nature and number of Sat deployments, it would be just a matter of time next (many) customers will hit it
Checked in fix on master, see upstream https://issues.apache.org/jira/browse/QPID-8206 for commit details.
I tested this on sat6.3.3-2 # rpm -qa qpid-cpp-server* qpid-cpp-server-linearstore-1.36.0-19.el7.x86_64 qpid-cpp-server-1.36.0-19.el7.x86_64 - Tested the reproducer from Comment #6 and did not experience any deadlocks. Also, there are no regressions detected in our automation. thus putting this to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2550