Bug 117620
Summary: | rpm hangs waiting for scriplet ever so often (yet reproducibly) | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | James Olin Oden <james.oden> | ||||
Component: | rpm | Assignee: | Jeff Johnson <jbj> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Mike McLean <mikem> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | rawhide | CC: | herrold, nobody+pnasrat, paul.winder | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2006-02-21 19:01:50 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
James Olin Oden
2004-03-05 21:33:18 UTC
Created attachment 98329 [details]
Test harness for rpm.
Here is email I sent ot Jeff Johnson concerning this test harness:
=== email ===
I have attached my little test harness. Its no where from complete,
and pretty much geared towards testing my autorollback stuff.
After you untar it you will need to edit conf.sh and set which rpm
you want to use (it has a var USE_BUILD_RPM, which when set to 1 will
use the rpm in your build directory; set BUILDDIR to tell it where that
is). Next simply, type ./init. This will clean out the repackaged
package directory (its own), and copy the local system rpmdb into
the ./db directory (and remove the __db* files).
Once you have done this you are ready to test. There are seven tests
numbered 1 - 7, and to run one you just type:
./test 1
or some other test number. Normally, to catch db handle leaks I run:
./re-test 1 2 3 4 5 6 7
which will run all the tests one at a time in a loop. The problem with
this for you is you won't have the autorollback patch so it will fail.
So you just need to change the re-test script to ignore test errors,
because in this case your only looking for the lockup.
Concerning the lockup, you hit it right on the head. It was in
rpmsq.c while waiting for a child to return, and it sure looked
like it was hung on the mutex. I will add the sleeps on my
side, and re-run, and then try the sq->reaper = 0, turn on debug
and so on. I really want to see though, if this happens on a more
sane system than mine. If it does it may be a real problem.
If you want the backtrace, I would be happy to send it, but I don't
know how much more it can tell you than we have already discussed
(which is why I am ommiting it for now).
Also, if you feel we might get better results of testing for this by
sending this to some list (rpm, rpm-devel), then feel free to forward
this on (or tell me to).
====
Fixed sufficiently to pass >2000 eexec of ./re-test on i686. I have come across the same bug and been able to produce a patch which seems to fix it. This is based on the source out of rpm-4.2.3-13.src.rpm and is to file rpmio/rpmsq.c bash-3.00# diff -u Orpmsq.c rpmsq.c --- Orpmsq.c Thu Feb 3 09:55:41 2005 +++ rpmsq.c Wed Feb 2 16:05:46 2005 @@ -462,15 +462,15 @@ /* Wait for handler to receive SIGCHLD. */ /*@-infloops@*/ - while (ret == 0 && sq->reaped != sq->child) { - if (nothreads) + if (nothreads) { + while (ret == 0 && sq->reaped != sq->child) /* Note that sigpause re-enables SIGCHLD. */ ret = sigpause(SIGCHLD); - else { - xx = sigrelse(SIGCHLD); + } else { + xx = sigrelse(SIGCHLD); + while (ret == 0 && sq->reaped != sq->child) ret = pthread_cond_wait(&sq->cond, &sq->mutex); - xx = sighold(SIGCHLD); - } + xx = sighold(SIGCHLD); } /*@=infloops@*/ tracing showed that the SIGCLD from the terminating scriplet is getting fired on return from sigrelse(). This is where the race condition exists. The SIGCLD handler will set sq->reaped and call pthread_cond_signal() to wake up a sleeping parent. In the parent, the while loop *outside* of the sigrelse() uses this as a termination condition, because we are get the SIGCLD after the while (condition) but before the pthread_cond_wait (futex) we neither catch the termination condition of the while() nor do we wake up the parent because it is not asleep. So, the parent goes to sleep.... The patch moves the sigrelse() outside the while loop, seems to close the race window The head is patched with a different patch that fixes this. ultimately the use of pthread_cond* in the same thread of execution as I read the man pages for pthread_cond* is wrong and the real source of the dead lock. See bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=146549 Your patch does not completely remove the race condition, but it does make it less likely for it to occur (and whether it fixes the race condition or not may be the right thing to do). Cheers...james *** This bug has been marked as a duplicate of 146549 *** Changed to 'CLOSED' state since 'RESOLVED' has been deprecated. |