Red Hat Bugzilla – Bug 146549
Deadlock in rpmsq.c due to believed incorrect use of pthread_cond*
Last modified: 2009-05-21 14:06:09 EDT
Description of problem:
Ever so often with scriptlets that return rather quickly (i.e. path
of execution in script is small) rpm will deadlock waiting for the
script to return. I have been seeing this with my rpm test hanrness
for about a year now, but recently have reproduced this on RHEL 3.
Version-Release number of selected component (if applicable):
Very Rare, but it is reproducible if you wait long enough (no pun
Steps to Reproduce:
1. Create a small two small spec files that contains %pre, %post, %
preun and %postun scriptlets that create a semaphore file and then
exit 0. The first should something like x-1-1 and the second x-1-2.
2. Bulid the packages from the spec file.
3. Now install the first, upgrade to the second, erase them and
start all over in a loop:
rpm -Uvh x-1-1.noarch.rpm
rpm -Uvh x-1-2.noarch.rpm
rpm -e x
Eventually, after perhaps thousands of passes it will lockup. When
it does, strace will reveal it is haning on a futex, and gdb will
show it is running pthread_cond_wait().
That it would not hang ever.
I spent a lot of time trying to figure out what was going on
(probably more than I should have). The first thing I found was that
rpm was using pthread_cond_signal() in its signal handler and
pthread_cond_wait() in rpmsqWaitUnregister() without proper locking
of the signaler and the waiter (it was using the conditions mutex
with the wait, but not with the signaler). I initially created a
patch to fix this, only to find that the signal handler (and thus
the caller of pthread_cond_signal()) where in the same thread of
execution so my initial patch actually create a far easier to step
Once I realized that pthread_cond_signal() and pthread_cond_wait()
were in the same thread of execution, I arrived at what I think is
a better solution. Ultimately, you can't use pthread_cond stuff in
the same thread of execution (quick read of the man page will show
this), so instead I thought through a way of using a single mutex
to accomplish the synchronization that the pthread_cond's were used
to provide. Its pretty simple:
Before the fork crate and lock the rpmsq objects mutex.
In the signal handler:
unlock the mutex.
try to lock the mutex()
since it starts off locked rpmsqWaitUnregister() will not be able to
move forward (i.e. will block) until the signal handler unlocks the
Also, I would understand if this was not fixed in RHEL 3, I merely
logging this as a bug with a possible patch. What you guys do with
it is up to you (I would do nothing until it has been tested out in
Fedora land if I where you).
Created attachment 110381 [details]
Patch to fix the problem in rpmsq.c
*** Bug 117620 has been marked as a duplicate of this bug. ***
Fixed (by applying patch) in rpm-4.4.3-0.20.
(In reply to comment #3)
> Fixed (by applying patch) in rpm-4.4.3-0.20.
RPM 4.4.3-0.20 doesn't seem to be out yet - is there a typo here?
Also, will the patch be applied to RHEL3 U7?
rpm-4.4.5 was released several weeks ago. Ask RHEL support what is in U7.
There clearly was a bug here, which was fixed both upstream and in the 3 current RHEL releases, so the NOTABUG resolution is incorrect here. This bug # appears to cover the upstream patch, so should be resolved "UPSTREAM" seemingly?
According to the SRPM, this is fixed in RHEL 3u8 (apparently as part of Bug 185322, which is not visible externally).
According to the SRPM, this is fixed in RHEL 4u4 (apparently as part of Bug 185324, which is not visible externally).
According to the SRPM, this is fixed in RHEL 5u0 (which only references this bug number as the place for the fix).