117620 – rpm hangs waiting for scriplet ever so often (yet reproducibly)

Bug 117620 - rpm hangs waiting for scriplet ever so often (yet reproducibly)

Summary: rpm hangs waiting for scriplet ever so often (yet reproducibly)

Keywords:
Status:	CLOSED DUPLICATE of bug 146549
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpm
Sub Component:
Version:	rawhide
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Jeff Johnson
QA Contact:	Mike McLean
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-03-05 21:33 UTC by James Olin Oden
Modified:	2007-11-30 22:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-02-21 19:01:50 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test harness for rpm. (377.79 KB, application/octet-stream) 2004-03-05 21:37 UTC, James Olin Oden	no flags	Details
View All

Description James Olin Oden 2004-03-05 21:33:18 UTC

Description of problem:
Sometimes when running rpm transations with rpms containing scriptlets
rpm will hang while waiting on the scripltet to exit.

Version-Release number of selected component (if applicable):
rpm-4.3 (and in HEAD of cvs)

How reproducible:
Fairly hard.  I have written a test harness for rpm, and it takes
anwhere from 400-2000 transactions before it dies.


Steps to Reproduce:
1. Run transactions with scriptlets over and over again (at least
1000's of transactions)
  
Actual results:
Eventually rpm will hang waiting for a scriptlet to return.


Expected results:
It would never hang waiting for a scriptlet to return.

Additional info:

Here is a backtrace from one such event:

0xffffe002 in ?? ()
(gdb) bt
#0  0xffffe002 in ?? ()
#1  0x4017724f in rpmsqWait (sq=0x0) at rpmsq.c:499
#2  0x40033064 in psmWait (psm=0x8081558) at psm.c:472
#3  0x40033766 in runScript (psm=0x8081558, h=0x80848b8, 
sln=0x40059a26 "%pre",
    progArgc=1, progArgv=0xbfffe160,
    script=0x8084625 "echo \"works-1-2:%pre:Instances: $1\"\nexit 0", 
arg1=2,
    arg2=-1) at psm.c:852
#4  0x40033d6f in runInstScript (psm=0x8081558) at psm.c:921
#5  0x4003619f in rpmpsmStage (psm=0x8081558, stage=PSM_SCRIPT) at 
psm.c:1955
#6  0x4003464a in rpmpsmNext (psm=0x0, nstage=PSM_UNKNOWN) at 
psm.c:1270
#7  0x40035296 in rpmpsmStage (psm=0x8081558, stage=134704680) at 
psm.c:1468
#8  0x4003464a in rpmpsmNext (psm=0x0, nstage=PSM_UNKNOWN) at 
psm.c:1270
#9  0x40035f80 in rpmpsmStage (psm=0x8081558, stage=PSM_PKGINSTALL) 
at psm.c:1891
#10 0x40053df4 in rpmtsRun (ts=0x8076e28, okProbs=0x0,
    ignoreSet=RPMPROB_FILTER_NONE) at transaction.c:1619
#11 0x40042756 in rpmInstall (ts=0x8076e28, ia=0x4006d4c0, 
fileArgv=0x0)
    at rpminstall.c:694
#12 0x0804b538 in main (argc=8, argv=0xbfffe834) at rpmqv.c:781
#13 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6

I am going to attach my test harness so that you can more easily 
creat the problem.  The test harness will require some jiggering
because all the test case are designed to test autorollback 
functionality in rpm.

Oh, yeah rpm is of course waiting on futex, but in this case it
aint for the database.

Comment 1 James Olin Oden 2004-03-05 21:37:15 UTC

Created attachment 98329 [details]
Test harness for rpm.

Here is email I sent ot Jeff Johnson concerning this test harness:

=== email ===
I have attached my little test harness.  Its no where from complete,
and pretty much geared towards testing my autorollback stuff.
After you untar it you will need to edit conf.sh and set which rpm
you want to use (it has a var USE_BUILD_RPM, which when set to 1 will
use the rpm in your build directory; set BUILDDIR to tell it where that
is).  Next simply, type ./init.  This will clean out the repackaged 
package directory (its own), and copy the local system rpmdb into
the ./db directory (and remove the __db* files).

Once you have done this you are ready to test.	There are seven tests
numbered 1 - 7, and to run one you just type:

	./test 1

or some other test number.   Normally, to catch db handle leaks I run:

	./re-test 1 2 3 4 5 6 7

which will run all the tests one at a time in a loop.  The problem with 
this for you is you won't have the autorollback patch so it will fail.
So you just need to change the re-test script to ignore test errors,
because in this case your only looking for the lockup.

Concerning the lockup, you hit it right on the head.   It was in 
rpmsq.c while waiting for a child to return, and it sure looked
like it was hung on the mutex.	 I will add the sleeps on my 
side, and re-run, and then try the sq->reaper = 0, turn on debug
and so on.  I really want to see though, if this happens on a more
sane system than mine.	If it does it may be a real problem.

If you want the backtrace, I would be happy to send it, but I don't 
know how much more it can tell you than we have already discussed 
(which is why I am ommiting it for now).

Also, if you feel we might get better results of testing for this by
sending this to some list (rpm, rpm-devel), then feel free to forward 
this on (or tell me to).

====

Comment 2 Jeff Johnson 2004-03-07 13:05:18 UTC

Fixed sufficiently to pass >2000 eexec of ./re-test on i686.

Comment 3 Paul Winder 2005-02-03 11:45:03 UTC

I have come across the same bug and been able to produce a patch which
seems to fix it.

This is based on the source out of rpm-4.2.3-13.src.rpm and is to file
rpmio/rpmsq.c

bash-3.00# diff -u Orpmsq.c rpmsq.c
--- Orpmsq.c    Thu Feb  3 09:55:41 2005
+++ rpmsq.c     Wed Feb  2 16:05:46 2005
@@ -462,15 +462,15 @@
 
     /* Wait for handler to receive SIGCHLD. */
     /*@-infloops@*/
-    while (ret == 0 && sq->reaped != sq->child) {
-       if (nothreads)
+    if (nothreads) {
+       while (ret == 0 && sq->reaped != sq->child)
            /* Note that sigpause re-enables SIGCHLD. */
            ret = sigpause(SIGCHLD);
-       else {
-           xx = sigrelse(SIGCHLD);
+    } else {
+       xx = sigrelse(SIGCHLD);
+       while (ret == 0 && sq->reaped != sq->child)
            ret = pthread_cond_wait(&sq->cond, &sq->mutex);
-           xx = sighold(SIGCHLD);
-       }
+       xx = sighold(SIGCHLD);
     }
     /*@=infloops@*/

tracing showed that the SIGCLD from the terminating scriplet is
getting fired on return from sigrelse(). This is where the race
condition exists. The SIGCLD handler will set sq->reaped and call
pthread_cond_signal() to wake up a sleeping parent. In the parent, the
while loop *outside* of the sigrelse() uses this as a termination
condition, because we are get the SIGCLD after the while (condition)
but before the pthread_cond_wait (futex) we neither catch the
termination condition of the while() nor do we wake up the parent
because it is not asleep. So, the parent goes to sleep....

The patch moves the sigrelse() outside the while loop, seems to close
the race window

Comment 4 James Olin Oden 2005-02-03 12:28:12 UTC

The head is patched with a different patch that fixes this.   
ultimately the use of pthread_cond* in the same thread of execution 
as I read the man pages for pthread_cond* is wrong and the real 
source of the dead lock.  See bugzilla:

   https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=146549

Your patch does not completely remove the race condition, but it does 
make it less likely for it to occur (and whether it fixes the race 
condition or not may be the right thing to do).

Cheers...james

Comment 5 Jeff Johnson 2005-02-07 23:09:13 UTC


*** This bug has been marked as a duplicate of 146549 ***

Comment 6 Red Hat Bugzilla 2006-02-21 19:01:50 UTC

Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Note You need to log in before you can comment on or make changes to this bug.