Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1588015

Summary:	Qpidd deadlocks due to linearstore inactivity timer
Product:	Red Hat Satellite	Reporter:	Barbora Vassova <bvassova>
Component:	Qpid	Assignee:	Kim van der Riet <kim.vdriet>
Status:	CLOSED ERRATA	QA Contact:	Roman Plevka <rplevka>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	6.3.0	CC:	andrew.schofield, ehelms, jhutar, kabbott, kim.vdriet, mcressma, mmccune, pcreech, pmoravec, rplevka, zkraus
Target Milestone:	Unspecified	Keywords:	PrioBumpField, Regression, Triaged
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qpid-cpp-1.36.0-19	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1596361 (view as bug list)		Environment:
Last Closed:	2018-08-22 20:07:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Barbora Vassova 2018-06-06 13:24:38 UTC

Created attachment 1448327 [details]
full backtrace

Description of problem:
Qpid in production hangs and stops responding. After manually dumping core, backtrace of all threads shows that almost all of them (#5-#26) are waiting on the same lock (0x25638b0), while Thread #4 seems to be holding it: 

#15 0x00007f38119ca2cf in qpid::management::ManagementObject::ManageablePtr::ManagementMethod (this=this@entry=0x25638a8, methodId=methodId@entry=9, args=..., text="")
    at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/management/ManagementObject.cpp:265
        l = {mutex = @0x25638b0}

This particular thread is trying to delete a queue and is waiting for

#3  qpid::sys::TimerTask::cancel (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:94
        l = {mutex = @0x49c889c0}

and owns this lock: 

#4  0x00007f380c44b2f0 in qpid::linearstore::InactivityFireEvent::cancel (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.cpp:85
        sl = {mutex = @0x49c88a30}

It looks like the TimerTask::cancel waits for 0x49c88990 in Thread #1 that just fires it:

Thread 1 (Thread 0x7f380bbf0700 (LWP 29952)):
#0  0x00007f3810dd24cd in __lll_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f3810dcdde6 in _L_lock_870 () from /lib64/libpthread.so.0
No symbol table info available.
#2  0x00007f3810dcdcdf in pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x00007f380c44b013 in qpid::sys::Mutex::lock (this=this@entry=0x49c88a30) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/posix/Mutex.h:116
        e = <optimized out>
#4  0x00007f380c44bd58 in ScopedLock (l=..., this=<synthetic pointer>) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Mutex.h:33
No locals.
#5  qpid::linearstore::InactivityFireEvent::fire (this=0x49c88990) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/linearstore/JournalImpl.cpp:70
        sl = {mutex = @0x49c88a30}
#6  0x00007f38119d55a9 in qpid::sys::Timer::fire (this=<optimized out>, t=...) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:222
        __PRETTY_FUNCTION__ = "virtual void qpid::sys::Timer::fire(boost::intrusive_ptr<qpid::sys::TimerTask>)"
#7  0x00007f38119d66da in qpid::sys::Timer::run (this=0x256a310) at /usr/src/debug/qpid-cpp-1.36.0/src/qpid/sys/Timer.cpp:156
        u = {mutex = @0x256a318}
        end = {timepoint = 151158672631781}
        overrun = <optimized out>
        warningsEnabled = <optimized out>
        s = {tt = @0x49c88990}
        t = {px = 0x49c88990}
        start = {timepoint = 151158682729264}
        delay = {nanosecs = 3372811}
        l = {mutex = @0x256a318}
        __PRETTY_FUNCTION__ = "virtual void qpid::sys::Timer::run()"
...

This could be possibly caused by incorrect state machine in InactivityFireEvent which causes firing and cancelling that task concurrently can get stuck, with consequent deadlock of other threads.


Version-Release number of selected component (if applicable):
Satellite 6.3
qpid-cpp-server-1.36.0-18.el7.x86_64 

How reproducible:
Unknown yet

Additional info:
Possibly could be fixed with https://issues.apache.org/jira/browse/QPID-7975

Comment 3 Kim van der Riet 2018-06-06 17:03:03 UTC

Inspection of the attached dump seems to show that there is indeed a thread deadlock occurring. A call to JournalImpl::fire() from the inactivity timer occurring (almost) simultaneously with a call to JournalImpl::cancel() on another thread is deadlocking.

Comment 4 Kim van der Riet 2018-06-06 17:08:19 UTC

(In reply to Kim van der Riet from comment #3)
> Inspection of the attached dump seems to show that there is indeed a thread
> deadlock occurring. A call to JournalImpl::fire() from the inactivity timer
> occurring (almost) simultaneously with a call to JournalImpl::cancel() on
> another thread is deadlocking.

Sorry, should be InactivityFireEvent::fire() and InactivityFireEvent::cancel() respectively.

Comment 5 Pavel Moravec 2018-06-07 07:17:11 UTC

(In reply to Kim van der Riet from comment #3)
> Inspection of the attached dump seems to show that there is indeed a thread
> deadlock occurring. A call to InactivityFireEvent::fire() from the inactivity timer
> occurring (almost) simultaneously with a call to InactivityFireEvent::cancel() on
> another thread is deadlocking.

Thanks for analysis. The deadlock really happens on 1.36.0-18 with the QPID-7975 fixed, so there must be another bug (or potentially this JIRA has a regression? though dont expect so, it is a valid theory).

Any idea how to reproduce it? I was trying:
- setting   journal-flush-timeout=20ms  or similar low values to fire InactivityFireEvent timer more often
- creating and deleting many journal files concurrently (to cancel such timer)
- sending/receiving durable messages to dur.queues (would that cancel the timer as well?)

But so far, no deadlock.

Something else to try?

Comment 6 Pavel Moravec 2018-06-07 12:10:31 UTC

Reproducer outside Satellite:

0) rm -rf /var/lib/qpidd/.qpidd   # just to start with absolutely clean table


1) set 

journal-flush-timeout=50ms

to qpidd.conf - the lower the better. I reproduce the deadlock always within a minute or two with this value, but it took ~10 minutes with default 500ms (I think).


2) run this script that sporadically sends and receives a durable message to a queue that is time to time deleted - and it does so concurrently for 50 queues:

for i in $(seq 1 50); do
	while true; do
		echo "$(date): loop $i"
		qpid-send -a "DurableTest-${i}; {create:always, node:{type:queue, durable:true}}" -m 1 --content-size=1100 --durable=yes
		qpid-receive -a DurableTest-${i} --print-content=no 2> /dev/null
		sleep 0.2
	done &
done

for i in $(seq 1 50); do
	while true; do
		echo "$(date): deleting queue $i"
		qpid-config del queue DurableTest-${i} --force
		sleep $((RANDOM%5)).$((RANDOM%10))
	done &
done


3) check if qpidd responds, via e.g.:

while true; do date; qpid-stat -g; sleep 5; done

4) within a minute or so, deadlock comes to you and your broker.

Comment 7 Pavel Moravec 2018-06-07 12:22:55 UTC

Kim,
when thinking about workaround, I feel from some experiments that journal-flush-timeout value is linear to the time the broker survives. What is the performance/ throughput penalty in increasing it? It can delay (durable) message delivery up to this value in the worst possible case?

Comment 8 Kim van der Riet 2018-06-07 16:44:07 UTC

I cannot reproduce using this reproducer on my hardware.

*But* if I add a sleep of 10ms into InactivityFireEvent::cancel() just after taking the lock, then the deadlock occurs quite quickly for me - after a minute or so of running Pavel's reproducer.

diff --git a/src/qpid/linearstore/JournalImpl.cpp b/src/qpid/linearstore/JournalImpl.cpp
index fe919fbd3..e512e9a05 100644
--- a/src/qpid/linearstore/JournalImpl.cpp
+++ b/src/qpid/linearstore/JournalImpl.cpp
@@ -82,6 +82,7 @@ void InactivityFireEvent::fire() {

 void InactivityFireEvent::cancel() {
     ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
+    ::usleep(10 * 1000); // <-- *** DEBUG ***
     ::qpid::sys::TimerTask::cancel();
     _state = CANCELLED;
 }


After making the following fix, but leaving the sleep in place:

diff --git a/src/qpid/linearstore/JournalImpl.cpp b/src/qpid/linearstore/JournalImpl.cpp
index fe919fbd3..55f33bb80 100644
--- a/src/qpid/linearstore/JournalImpl.cpp
+++ b/src/qpid/linearstore/JournalImpl.cpp
@@ -81,9 +81,12 @@ void InactivityFireEvent::fire() {
 }

 void InactivityFireEvent::cancel() {
-    ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
     ::qpid::sys::TimerTask::cancel();
-    _state = CANCELLED;
+    {
+        ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
+        ::usleep(10 * 1000); // <-- *** DEBUG ***
+        _state = CANCELLED;
+    }
 }

 GetEventsFireEvent::GetEventsFireEvent(JournalImpl* p,

I cannot get the reproducer to fail. I am hopeful that this will make a valid fix. I will continue to test for a while.

In essence, the fix allows the underlying timer to *first* cancel (which includes waiting for any flush fires that may have occurred on another thread) before taking the lock to change the local state to CANCELLED.

To be sure we are not introducing a regression, we should probably also check the reproducers for QPID-7975.

Comment 9 Kim van der Riet 2018-06-07 16:55:35 UTC

(In reply to Pavel Moravec from comment #7)
> Kim,
> when thinking about workaround, I feel from some experiments that
> journal-flush-timeout value is linear to the time the broker survives. What
> is the performance/ throughput penalty in increasing it? It can delay
> (durable) message delivery up to this value in the worst possible case?

This will be a throughput vs. latency tradeoff. The lower the flush timer, the lower the latency will be, but at the cost of throughput. Higher values of the timer have the opposite effect. The default has been chosen as a sort of compromise between these two issues, but if a user has throughput as the top issue, then they can increase the timer value, but with some messages possibly having a high latency. Low-latency users may do the opposite.

I hope that answers the question, but I don't have hard numbers for you. We need to establish some performance tests on standardized hardware to answer that.

Comment 10 Pavel Moravec 2018-06-08 07:14:56 UTC

(In reply to Kim van der Riet from comment #8)
> I cannot reproduce using this reproducer on my hardware.
> 
> *But* if I add a sleep of 10ms into InactivityFireEvent::cancel() just after
> taking the lock, then the deadlock occurs quite quickly for me - after a
> minute or so of running Pavel's reproducer.
> 
> diff --git a/src/qpid/linearstore/JournalImpl.cpp
> b/src/qpid/linearstore/JournalImpl.cpp
> index fe919fbd3..e512e9a05 100644
> --- a/src/qpid/linearstore/JournalImpl.cpp
> +++ b/src/qpid/linearstore/JournalImpl.cpp
> @@ -82,6 +82,7 @@ void InactivityFireEvent::fire() {
> 
>  void InactivityFireEvent::cancel() {
>      ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
> +    ::usleep(10 * 1000); // <-- *** DEBUG ***
>      ::qpid::sys::TimerTask::cancel();
>      _state = CANCELLED;
>  }
> 
> 
> After making the following fix, but leaving the sleep in place:
> 
> diff --git a/src/qpid/linearstore/JournalImpl.cpp
> b/src/qpid/linearstore/JournalImpl.cpp
> index fe919fbd3..55f33bb80 100644
> --- a/src/qpid/linearstore/JournalImpl.cpp
> +++ b/src/qpid/linearstore/JournalImpl.cpp
> @@ -81,9 +81,12 @@ void InactivityFireEvent::fire() {
>  }
> 
>  void InactivityFireEvent::cancel() {
> -    ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
>      ::qpid::sys::TimerTask::cancel();
> -    _state = CANCELLED;
> +    {
> +        ::qpid::sys::Mutex::ScopedLock sl(_ifeStateLock);
> +        ::usleep(10 * 1000); // <-- *** DEBUG ***
> +        _state = CANCELLED;
> +    }
>  }
> 
>  GetEventsFireEvent::GetEventsFireEvent(JournalImpl* p,
> 
> I cannot get the reproducer to fail. I am hopeful that this will make a
> valid fix. I will continue to test for a while.
> 
> In essence, the fix allows the underlying timer to *first* cancel (which
> includes waiting for any flush fires that may have occurred on another
> thread) before taking the lock to change the local state to CANCELLED.
> 
> To be sure we are not introducing a regression, we should probably also
> check the reproducers for QPID-7975.


Thanks for such prompt patch. I checked that:

(*) 1.36.0-9 does not lead to the deadlock (though I run the test for a short time so far) - so I think bz1502855/QPID-7975 introduced this deadlock as a regression

(*) 1.36.0-18 locally built with this patch does not deadlock, so the patch seems working for me as well

I would vote this to be added to Sat 6.3.z since:
- the deadlock has serious impact to Satellite
- it is nontrivial to identify it from the symptoms like "pulp repo sync stuck" a user notices)
- due to the probability nature and number of Sat deployments, it would be just a matter of time next (many) customers will hit it

Comment 13 Kim van der Riet 2018-06-08 14:14:06 UTC

Checked in fix on master, see upstream https://issues.apache.org/jira/browse/QPID-8206 for commit details.

Comment 19 Roman Plevka 2018-08-03 12:12:03 UTC

I tested this on sat6.3.3-2

# rpm -qa qpid-cpp-server*
qpid-cpp-server-linearstore-1.36.0-19.el7.x86_64
qpid-cpp-server-1.36.0-19.el7.x86_64

- Tested the reproducer from Comment #6 and did not experience any deadlocks.
Also, there are no regressions detected in our automation.

thus putting this to VERIFIED

Comment 21 errata-xmlrpc 2018-08-22 20:07:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2550