Bug 146549 - Deadlock in rpmsq.c due to believed incorrect use of pthread_cond*
Summary: Deadlock in rpmsq.c due to believed incorrect use of pthread_cond*
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: rpm
Version: 3.0
Hardware: All
OS: Linux
medium
low
Target Milestone: ---
Assignee: Paul Nasrat
QA Contact: Mike McLean
URL:
Whiteboard:
: 117620 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-01-29 14:46 UTC by James Olin Oden
Modified: 2009-05-21 18:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-14 12:18:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to fix the problem in rpmsq.c (3.24 KB, patch)
2005-01-29 14:49 UTC, James Olin Oden
no flags Details | Diff

Description James Olin Oden 2005-01-29 14:46:55 UTC
Description of problem:
Ever so often with scriptlets that return rather quickly (i.e. path 
of execution in script is small) rpm will deadlock waiting for the 
script to return.  I have been seeing this with my rpm test hanrness 
for about a year now, but recently have reproduced this on RHEL 3.

Version-Release number of selected component (if applicable):
4.2.3-10

How reproducible:
Very Rare, but it is reproducible if you wait long enough (no pun 
intended)

Steps to Reproduce:
1. Create a small two small spec files that contains %pre, %post, %
preun and %postun scriptlets that create a semaphore file and then 
exit 0.  The first should something like x-1-1 and the second x-1-2.
2.  Bulid the packages from the spec file.
3.  Now install the first, upgrade to the second, erase them and 
start all over in a loop:

    while :
    do
      rpm -Uvh x-1-1.noarch.rpm
      rpm -Uvh x-1-2.noarch.rpm
      rpm -e x
    done

  
Actual results:
Eventually, after perhaps thousands of passes it will lockup.  When 
it does, strace will reveal it is haning on a futex, and gdb will 
show it is running pthread_cond_wait().

Expected results:
That it would not hang ever.

Additional info:
I spent a lot of time trying to figure out what was going on 
(probably more than I should have).  The first thing I found was that
rpm was using pthread_cond_signal() in its signal handler and 
pthread_cond_wait() in rpmsqWaitUnregister() without proper locking 
of the signaler and the waiter (it was using the conditions mutex 
with the wait, but not with the signaler).  I initially created a 
patch to fix this, only to find that the signal handler (and thus 
the caller of pthread_cond_signal()) where in the same thread of 
execution so my initial patch actually create a far easier to step 
on deadlock.

Once I realized that pthread_cond_signal() and pthread_cond_wait() 
were in the same thread of execution, I arrived at what I think is 
a better solution.  Ultimately, you can't use pthread_cond stuff in 
the same thread of execution (quick read of the man page will show 
this), so instead I thought through a way of using a single mutex
to accomplish the synchronization that the pthread_cond's were used
to provide.  Its pretty simple:

    Before the fork crate and lock the rpmsq objects mutex.
    In the signal handler:
        unlock the mutex.
    In rpmsqWaitUnregister():
        try to lock the mutex()

since it starts off locked rpmsqWaitUnregister() will not be able to 
move forward (i.e. will block) until the signal handler unlocks the 
mutex.

Also, I would understand if this was not fixed in RHEL 3, I merely 
logging this as a bug with a possible patch.  What you guys do with 
it is up to you (I would do nothing until it has been tested out in 
Fedora land if I where you).

Comment 1 James Olin Oden 2005-01-29 14:49:12 UTC
Created attachment 110381 [details]
Patch to fix the problem in rpmsq.c

Comment 2 Jeff Johnson 2005-02-07 23:09:31 UTC
*** Bug 117620 has been marked as a duplicate of this bug. ***

Comment 3 Jeff Johnson 2005-10-26 01:08:10 UTC
Fixed (by applying patch) in rpm-4.4.3-0.20.

Comment 4 Evan Chan 2006-01-31 01:50:01 UTC
(In reply to comment #3)
> Fixed (by applying patch) in rpm-4.4.3-0.20.

RPM 4.4.3-0.20 doesn't seem to be out yet - is there a typo here?

Also, will the patch be applied to RHEL3 U7?

thanks!

Comment 5 Jeff Johnson 2006-03-14 12:18:29 UTC
rpm-4.4.5 was released several weeks ago. Ask RHEL support what is in U7.

Comment 13 app 2009-05-21 18:06:09 UTC
There clearly was a bug here, which was fixed both upstream and in the 3 current RHEL releases, so the NOTABUG resolution is incorrect here.  This bug # appears to cover the upstream patch, so should be resolved "UPSTREAM" seemingly?

According to the SRPM, this is fixed in RHEL 3u8 (apparently as part of Bug 185322, which is not visible externally).

According to the SRPM, this is fixed in RHEL 4u4 (apparently as part of Bug 185324, which is not visible externally).

According to the SRPM, this is fixed in RHEL 5u0 (which only references this bug number as the place for the fix).


Note You need to log in before you can comment on or make changes to this bug.