Bug 1588015 - Qpidd deadlocks due to linearstore inactivity timer
Summary: Qpidd deadlocks due to linearstore inactivity timer
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Satellite
Classification: Red Hat
Component: Qpid
Version: 6.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: Unspecified
Assignee: Kim van der Riet
QA Contact: Roman Plevka
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-06 13:24 UTC by Barbora Vassova
Modified: 2019-06-13 21:26 UTC (History)
11 users (show)

Fixed In Version: qpid-cpp-1.36.0-19
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1596361 (view as bug list)
Environment:
Last Closed: 2018-08-22 20:07:12 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Apache JIRA QPID-8206 0 None None None 2018-06-07 17:10:45 UTC
Red Hat Bugzilla 1502855 0 urgent CLOSED sending durable messages concurrently from 0-10 and 1.0 producer causes high latency on ACKing the 1.0 ones 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2018:2550 0 None None None 2018-08-22 20:07:47 UTC

Internal Links: 1502855

Description Barbora Vassova 2018-06-06 13:24:38 UTC
Created attachment 1448327 [details]
full backtrace

Description of problem:
Qpid in production hangs and stops responding. After manually dumping core, backtrace of all threads shows that almost all of them (#5-#26) are waiting on the same lock (0x25638b0), while Thread #4 seems to be holding it: 

#15 0x00007f38119ca2cf in qpid::management::ManagementObject::ManageablePtr::ManagementMethod (this=this@entry=0x25638a8, methodId=methodId@entry=9, args=..., text="")
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/management/ManagementObject.cpp:265
        l = {mutex = @0x25638b0}

This particular thread is trying to delete a queue and is waiting for

#3  qpid::sys::TimerTask::cancel (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:94
        l = {mutex = @0x49c889c0}

and owns this lock: 

#4  0x00007f380c44b2f0 in qpid::linearstore::InactivityFireEvent::cancel (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.cpp:85
        sl = {mutex = @0x49c88a30}

It looks like the TimerTask::cancel waits for 0x49c88990 in Thread #1 that just fires it:

Thread 1 (Thread 0x7f380bbf0700 (LWP 29952)):
#0  0x00007f3810dd24cd in __lll_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f3810dcdde6 in _L_lock_870 () from /lib64/libpthread.so.0
No symbol table info available.
#2  0x00007f3810dcdcdf in pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f380c44b013 in qpid::sys::Mutex::lock (this=this@entry=0x49c88a30) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/posix/Mutex.h:116
        e = <optimized out>
#4  0x00007f380c44bd58 in ScopedLock (l=..., this=<synthetic pointer>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Mutex.h:33
No locals.
#5  qpid::linearstore::InactivityFireEvent::fire (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.cpp:70
        sl = {mutex = @0x49c88a30}
#6  0x00007f38119d55a9 in qpid::sys::Timer::fire (this=<optimized out>, t=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:222
        __PRETTY_FUNCTION__ = "virtual void qpid::sys::Timer::fire(boost::intrusive_ptr<qpid::sys::TimerTask>)"
#7  0x00007f38119d66da in qpid::sys::Timer::run (this=0x256a310) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:156
        u = {mutex = @0x256a318}
        end = {timepoint = 151158672631781}
        overrun = <optimized out>
        warningsEnabled = <optimized out>
        s = {tt = @0x49c88990}
        t = {px = 0x49c88990}
        start = {timepoint = 151158682729264}
        delay = {nanosecs = 3372811}
        l = {mutex = @0x256a318}
        __PRETTY_FUNCTION__ = "virtual void qpid::sys::Timer::run()"
...

This could be possibly caused by incorrect state machine in InactivityFireEvent which causes firing and cancelling that task concurrently can get stuck, with consequent deadlock of other threads.


Version-Release number of selected component (if applicable):
Satellite 6.3
qpid-cpp-server-1.36.0-18.el7.x86_64 

How reproducible:
Unknown yet

Additional info:
Possibly could be fixed with https://issues.apache.org/jira/browse/QPID-7975

Comment 3 Kim van der Riet 2018-06-06 17:03:03 UTC
Inspection of the attached dump seems to show that there is indeed a thread deadlock occurring. A call to JournalImpl::fire() from the inactivity timer occurring (almost) simultaneously with a call to JournalImpl::cancel() on another thread is deadlocking.

Comment 4 Kim van der Riet 2018-06-06 17:08:19 UTC
(In reply to Kim van der Riet from comment #3)
> Inspection of the attached dump seems to show that there is indeed a thread
> deadlock occurring. A call to JournalImpl::fire() from the inactivity timer
> occurring (almost) simultaneously with a call to JournalImpl::cancel() on
> another thread is deadlocking.

Sorry, should be InactivityFireEvent::fire() and InactivityFireEvent::cancel() respectively.

Comment 5 Pavel Moravec 2018-06-07 07:17:11 UTC
(In reply to Kim van der Riet from comment #3)
> Inspection of the attached dump seems to show that there is indeed a thread
> deadlock occurring. A call to InactivityFireEvent::fire() from the inactivity timer
> occurring (almost) simultaneously with a call to InactivityFireEvent::cancel() on
> another thread is deadlocking.

Thanks for analysis. The deadlock really happens on 1.36.0-18 with the QPID-7975 fixed, so there must be another bug (or potentially this JIRA has a regression? though dont expect so, it is a valid theory).

Any idea how to reproduce it? I was trying:
- setting   journal-flush-timeout=20ms  or similar low values to fire InactivityFireEvent timer more often
- creating and deleting many journal files concurrently (to cancel such timer)
- sending/receiving durable messages to dur.queues (would that cancel the timer as well?)

But so far, no deadlock.

Something else to try?

Comment 6 Pavel Moravec 2018-06-07 12:10:31 UTC
Reproducer outside Satellite:

0) rm -rf /var/lib/qpidd/.qpidd   # just to start with absolutely clean table


1) set 

journal-flush-timeout=50ms

to qpidd.conf - the lower the better. I reproduce the deadlock always within a minute or two with this value, but it took ~10 minutes with default 500ms (I think).


2) run this script that sporadically sends and receives a durable message to a queue that is time to time deleted - and it does so concurrently for 50 queues:

for i in $(seq 1 50); do
	while true; do
		echo "$(date): loop $i"
		qpid-send -a "DurableTest-${i}; {create:always, node:{type:queue, durable:true}}" -m 1 --content-size=1100 --durable=yes
		qpid-receive -a DurableTest-${i} --print-content=no 2> /dev/null
		sleep 0.2
	done &
done

for i in $(seq 1 50); do
	while true; do
		echo "$(date): deleting queue $i"
		qpid-config del queue DurableTest-${i} --force
		sleep $((RANDOM%5)).$((RANDOM%10))
	done &
done


3) check if qpidd responds, via e.g.:

while true; do date; qpid-stat -g; sleep 5; done

4) within a minute or so, deadlock comes to you and your broker.

Comment 7 Pavel Moravec 2018-06-07 12:22:55 UTC
Kim,
when thinking about workaround, I feel from some experiments that journal-flush-timeout value is linear to the time the broker survives. What is the performance/ throughput penalty in increasing it? It can delay (durable) message delivery up to this value in the worst possible case?

Comment 8 Kim van der Riet 2018-06-07 16:44:07 UTC
I cannot reproduce using this reproducer on my hardware.

*But* if I add a sleep of 10ms into InactivityFireEvent::cancel() just after taking the lock, then the deadlock occurs quite quickly for me - after a minute or so of running Pavel's reproducer.

diff --git a/src/qpid/linearstore/JournalImpl.cpp b/src/qpid/linearstore/JournalImpl.cpp
index fe919fbd3..e512e9a05 100644
--- a/src/qpid/linearstore/JournalImpl.cpp
+++ b/src/qpid/linearstore/JournalImpl.cpp
@@ -82,6 +82,7 @@ void InactivityFireEvent::fire() {

 void InactivityFireEvent::cancel() {
     ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
+    ::usleep(10 * 1000); // <-- *** DEBUG ***
     ::qpid::sys::TimerTask::cancel();
     _state = CANCELLED;
 }


After making the following fix, but leaving the sleep in place:

diff --git a/src/qpid/linearstore/JournalImpl.cpp b/src/qpid/linearstore/JournalImpl.cpp
index fe919fbd3..55f33bb80 100644
--- a/src/qpid/linearstore/JournalImpl.cpp
+++ b/src/qpid/linearstore/JournalImpl.cpp
@@ -81,9 +81,12 @@ void InactivityFireEvent::fire() {
 }

 void InactivityFireEvent::cancel() {
-    ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
     ::qpid::sys::TimerTask::cancel();
-    _state = CANCELLED;
+    {
+        ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
+        ::usleep(10 * 1000); // <-- *** DEBUG ***
+        _state = CANCELLED;
+    }
 }

 GetEventsFireEvent::GetEventsFireEvent(JournalImpl* p,

I cannot get the reproducer to fail. I am hopeful that this will make a valid fix. I will continue to test for a while.

In essence, the fix allows the underlying timer to *first* cancel (which includes waiting for any flush fires that may have occurred on another thread) before taking the lock to change the local state to CANCELLED.

To be sure we are not introducing a regression, we should probably also check the reproducers for QPID-7975.

Comment 9 Kim van der Riet 2018-06-07 16:55:35 UTC
(In reply to Pavel Moravec from comment #7)
> Kim,
> when thinking about workaround, I feel from some experiments that
> journal-flush-timeout value is linear to the time the broker survives. What
> is the performance/ throughput penalty in increasing it? It can delay
> (durable) message delivery up to this value in the worst possible case?

This will be a throughput vs. latency tradeoff. The lower the flush timer, the lower the latency will be, but at the cost of throughput. Higher values of the timer have the opposite effect. The default has been chosen as a sort of compromise between these two issues, but if a user has throughput as the top issue, then they can increase the timer value, but with some messages possibly having a high latency. Low-latency users may do the opposite.

I hope that answers the question, but I don't have hard numbers for you. We need to establish some performance tests on standardized hardware to answer that.

Comment 10 Pavel Moravec 2018-06-08 07:14:56 UTC
(In reply to Kim van der Riet from comment #8)
> I cannot reproduce using this reproducer on my hardware.
> 
> *But* if I add a sleep of 10ms into InactivityFireEvent::cancel() just after
> taking the lock, then the deadlock occurs quite quickly for me - after a
> minute or so of running Pavel's reproducer.
> 
> diff --git a/src/qpid/linearstore/JournalImpl.cpp
> b/src/qpid/linearstore/JournalImpl.cpp
> index fe919fbd3..e512e9a05 100644
> --- a/src/qpid/linearstore/JournalImpl.cpp
> +++ b/src/qpid/linearstore/JournalImpl.cpp
> @@ -82,6 +82,7 @@ void InactivityFireEvent::fire() {
> 
>  void InactivityFireEvent::cancel() {
>      ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
> +    ::usleep(10 * 1000); // <-- *** DEBUG ***
>      ::qpid::sys::TimerTask::cancel();
>      _state = CANCELLED;
>  }
> 
> 
> After making the following fix, but leaving the sleep in place:
> 
> diff --git a/src/qpid/linearstore/JournalImpl.cpp
> b/src/qpid/linearstore/JournalImpl.cpp
> index fe919fbd3..55f33bb80 100644
> --- a/src/qpid/linearstore/JournalImpl.cpp
> +++ b/src/qpid/linearstore/JournalImpl.cpp
> @@ -81,9 +81,12 @@ void InactivityFireEvent::fire() {
>  }
> 
>  void InactivityFireEvent::cancel() {
> -    ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
>      ::qpid::sys::TimerTask::cancel();
> -    _state = CANCELLED;
> +    {
> +        ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
> +        ::usleep(10 * 1000); // <-- *** DEBUG ***
> +        _state = CANCELLED;
> +    }
>  }
> 
>  GetEventsFireEvent::GetEventsFireEvent(JournalImpl* p,
> 
> I cannot get the reproducer to fail. I am hopeful that this will make a
> valid fix. I will continue to test for a while.
> 
> In essence, the fix allows the underlying timer to *first* cancel (which
> includes waiting for any flush fires that may have occurred on another
> thread) before taking the lock to change the local state to CANCELLED.
> 
> To be sure we are not introducing a regression, we should probably also
> check the reproducers for QPID-7975.


Thanks for such prompt patch. I checked that:

(*) 1.36.0-9 does not lead to the deadlock (though I run the test for a short time so far) - so I think bz1502855/QPID-7975 introduced this deadlock as a regression

(*) 1.36.0-18 locally built with this patch does not deadlock, so the patch seems working for me as well

I would vote this to be added to Sat 6.3.z since:
- the deadlock has serious impact to Satellite
- it is nontrivial to identify it from the symptoms like "pulp repo sync stuck" a user notices)
- due to the probability nature and number of Sat deployments, it would be just a matter of time next (many) customers will hit it

Comment 13 Kim van der Riet 2018-06-08 14:14:06 UTC
Checked in fix on master, see upstream https://issues.apache.org/jira/browse/QPID-8206 for commit details.

Comment 19 Roman Plevka 2018-08-03 12:12:03 UTC
I tested this on sat6.3.3-2

# rpm -qa qpid-cpp-server*
qpid-cpp-server-linearstore-1.36.0-19.el7.x86_64
qpid-cpp-server-1.36.0-19.el7.x86_64

- Tested the reproducer from Comment #6 and did not experience any deadlocks.
Also, there are no regressions detected in our automation.

thus putting this to VERIFIED

Comment 21 errata-xmlrpc 2018-08-22 20:07:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2550


Note You need to log in before you can comment on or make changes to this bug.