Description of problem: Sometimes when running rpm transations with rpms containing scriptlets rpm will hang while waiting on the scripltet to exit. Version-Release number of selected component (if applicable): rpm-4.3 (and in HEAD of cvs) How reproducible: Fairly hard. I have written a test harness for rpm, and it takes anwhere from 400-2000 transactions before it dies. Steps to Reproduce: 1. Run transactions with scriptlets over and over again (at least 1000's of transactions) Actual results: Eventually rpm will hang waiting for a scriptlet to return. Expected results: It would never hang waiting for a scriptlet to return. Additional info: Here is a backtrace from one such event: 0xffffe002 in ?? () (gdb) bt #0 0xffffe002 in ?? () #1 0x4017724f in rpmsqWait (sq=0x0) at rpmsq.c:499 #2 0x40033064 in psmWait (psm=0x8081558) at psm.c:472 #3 0x40033766 in runScript (psm=0x8081558, h=0x80848b8, sln=0x40059a26 "%pre", progArgc=1, progArgv=0xbfffe160, script=0x8084625 "echo \"works-1-2:%pre:Instances: $1\"\nexit 0", arg1=2, arg2=-1) at psm.c:852 #4 0x40033d6f in runInstScript (psm=0x8081558) at psm.c:921 #5 0x4003619f in rpmpsmStage (psm=0x8081558, stage=PSM_SCRIPT) at psm.c:1955 #6 0x4003464a in rpmpsmNext (psm=0x0, nstage=PSM_UNKNOWN) at psm.c:1270 #7 0x40035296 in rpmpsmStage (psm=0x8081558, stage=134704680) at psm.c:1468 #8 0x4003464a in rpmpsmNext (psm=0x0, nstage=PSM_UNKNOWN) at psm.c:1270 #9 0x40035f80 in rpmpsmStage (psm=0x8081558, stage=PSM_PKGINSTALL) at psm.c:1891 #10 0x40053df4 in rpmtsRun (ts=0x8076e28, okProbs=0x0, ignoreSet=RPMPROB_FILTER_NONE) at transaction.c:1619 #11 0x40042756 in rpmInstall (ts=0x8076e28, ia=0x4006d4c0, fileArgv=0x0) at rpminstall.c:694 #12 0x0804b538 in main (argc=8, argv=0xbfffe834) at rpmqv.c:781 #13 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6 I am going to attach my test harness so that you can more easily creat the problem. The test harness will require some jiggering because all the test case are designed to test autorollback functionality in rpm. Oh, yeah rpm is of course waiting on futex, but in this case it aint for the database.
Created attachment 98329 [details] Test harness for rpm. Here is email I sent ot Jeff Johnson concerning this test harness: === email === I have attached my little test harness. Its no where from complete, and pretty much geared towards testing my autorollback stuff. After you untar it you will need to edit conf.sh and set which rpm you want to use (it has a var USE_BUILD_RPM, which when set to 1 will use the rpm in your build directory; set BUILDDIR to tell it where that is). Next simply, type ./init. This will clean out the repackaged package directory (its own), and copy the local system rpmdb into the ./db directory (and remove the __db* files). Once you have done this you are ready to test. There are seven tests numbered 1 - 7, and to run one you just type: ./test 1 or some other test number. Normally, to catch db handle leaks I run: ./re-test 1 2 3 4 5 6 7 which will run all the tests one at a time in a loop. The problem with this for you is you won't have the autorollback patch so it will fail. So you just need to change the re-test script to ignore test errors, because in this case your only looking for the lockup. Concerning the lockup, you hit it right on the head. It was in rpmsq.c while waiting for a child to return, and it sure looked like it was hung on the mutex. I will add the sleeps on my side, and re-run, and then try the sq->reaper = 0, turn on debug and so on. I really want to see though, if this happens on a more sane system than mine. If it does it may be a real problem. If you want the backtrace, I would be happy to send it, but I don't know how much more it can tell you than we have already discussed (which is why I am ommiting it for now). Also, if you feel we might get better results of testing for this by sending this to some list (rpm, rpm-devel), then feel free to forward this on (or tell me to). ====
Fixed sufficiently to pass >2000 eexec of ./re-test on i686.
I have come across the same bug and been able to produce a patch which seems to fix it. This is based on the source out of rpm-4.2.3-13.src.rpm and is to file rpmio/rpmsq.c bash-3.00# diff -u Orpmsq.c rpmsq.c --- Orpmsq.c Thu Feb 3 09:55:41 2005 +++ rpmsq.c Wed Feb 2 16:05:46 2005 @@ -462,15 +462,15 @@ /* Wait for handler to receive SIGCHLD. */ /*@-infloops@*/ - while (ret == 0 && sq->reaped != sq->child) { - if (nothreads) + if (nothreads) { + while (ret == 0 && sq->reaped != sq->child) /* Note that sigpause re-enables SIGCHLD. */ ret = sigpause(SIGCHLD); - else { - xx = sigrelse(SIGCHLD); + } else { + xx = sigrelse(SIGCHLD); + while (ret == 0 && sq->reaped != sq->child) ret = pthread_cond_wait(&sq->cond, &sq->mutex); - xx = sighold(SIGCHLD); - } + xx = sighold(SIGCHLD); } /*@=infloops@*/ tracing showed that the SIGCLD from the terminating scriplet is getting fired on return from sigrelse(). This is where the race condition exists. The SIGCLD handler will set sq->reaped and call pthread_cond_signal() to wake up a sleeping parent. In the parent, the while loop *outside* of the sigrelse() uses this as a termination condition, because we are get the SIGCLD after the while (condition) but before the pthread_cond_wait (futex) we neither catch the termination condition of the while() nor do we wake up the parent because it is not asleep. So, the parent goes to sleep.... The patch moves the sigrelse() outside the while loop, seems to close the race window
The head is patched with a different patch that fixes this. ultimately the use of pthread_cond* in the same thread of execution as I read the man pages for pthread_cond* is wrong and the real source of the dead lock. See bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=146549 Your patch does not completely remove the race condition, but it does make it less likely for it to occur (and whether it fixes the race condition or not may be the right thing to do). Cheers...james
*** This bug has been marked as a duplicate of 146549 ***
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.