Bug 73134
Summary: | rpm-4.1 hangs: SIGCHLD missed | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Raw Hide | Reporter: | Enrico Scholz <rh-bugzilla> | ||||||||||
Component: | rpm | Assignee: | Jeff Johnson <jbj> | ||||||||||
Status: | CLOSED RAWHIDE | QA Contact: | |||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 1.0 | CC: | barryn, dengen40, gczarcinski, gt, herrold, hugh, rdieter, redhat-bugzilla, redhat, rivenburgh, sh4d0wstr1f3, wtogami, yaneti | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | i386 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2002-12-17 19:32:48 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Enrico Scholz
2002-08-31 00:17:27 UTC
Hmmm this is a missing SIGCHLD while running a scriptlet. You can (or could, haven't checked recently) use waitpid if you recompile rpm. Change to 0 at lib/psm.c:988 psm->reaper = 1; Otherwise, there has to be hole in the SIGCHLD handler code. If you could take a look at the code to see if you can spot the problem, I'd be grateful. Missing signals are very painful to fix ... I've got a fix (I believe), but meanwhile I'm gonna use this bug as one of several "rpm hangs" categories. I was hoping sleep(2) would avoid the need for this patch. Opinions (and real world experience) appreciated, it's gonna take a bit more time to believe there are no races here. Patch will be in rpm-4.1-1.04 when built. Created attachment 74324 [details]
Avoid SIGCHLD race patch.
I can confirm this problem. I just upgraded a bunch of rpms from rawhide and it happened two timces. On time when upgrading the single package timeconfig-3.2.8-2 -> timeconfig-3.2.8-3 again there is a dead child process root 20131 0.1 3.5 7772 3336 pts/1 S 14:41 0:00 /bin/rpm -U /usr/src/redhat-7.3.94/updates/timeconfig-3.2.8-3.i386.rpm root 20132 0.0 0.0 0 0 pts/1 Z 14:41 0:00 [sh <defunct>] and "killall -CHLD rpm" recovers. OK, I think I finally found the SIGCHLD problem: Ya gotta register the handler before doing the fork(), duh. I'm gonna put rpm-4.1-1.06 packages with the fix on ftp://people.redhat.com/jbj/test-4.1 I'd be grateful for any testing. Comment on: "I was hoping sleep(2) would [work]" remark If it is a resource race, a 'jittery' retry interval rather than a static wait time may clear some cases; If it is a true resourse unavailable cross-deadly embrace, there is no joy to be found there. If all that was wrong was the grammar/usage error and fix articulated 2002-09-04 13:05:53, you are out of the woods, anyway. My $0.02 -- I use jitter elsewhere in RH Linux work to get past resource unavailability friction in this fashion. This seems to be the problem I've been experiencing. I see the zombie shell, and the kill -CHLD trick kicks it. I've downloaded those patched packages, will let you know if I experience it again... *** Bug 73651 has been marked as a duplicate of this bug. *** I am still seeing hangs with rpm-4.1-1.06; but there are no child-zombies and strace tells rpm is waiting in pause(). After looking into your sighandler I see a race-condition when multiple child-processes are exiting before the sighandler is executed. Therefore, I suggest the classical version which calls waitpid() in a loop: --- lib/psm.c ---- | case SIGCHLD: | - { int status = 0; | + while (1) { | + int status = 0; | pid_t reaped = waitpid(0, &status, WNOHANG); | int i; | + if (reaped==-1) break; | Unfortunately, this does not explain the pause() hangs... :( Thanks for the patch. I'm not sure that it's a fix, as there should (ATM) only be a single child. BTW, the simple fix for this problem is psm->child = 0; psm->reaped = 0; psm->status = 0; - psm->reaper = 1; + psm->reaper = 0; but there are important speedup's to an install by overlapping scriptlet execution with rpm execution if the race can be identified, and your patch will definitely be useful then. Hmmm, SIGCHLD received between accessing the reaped pid and pausing is a race... Although already sent per email, I am entering it again for completeness and to test if I can add comments. Yesterday I got strange message about an invalide username... ---------- Another (IMO more) critical race is in | if ((psm->child = psmRegisterFork(psm)) == 0) { When the child-fork() in psmRegisterFork() is faster than the parent, 'psm->child' is unset in the signal-handler and 'psm->reaped' will not set therefore. With increasing time there happens: parent child ------------------------------------------ fork() exec() exit() <<< SIGCHLD handler searches child-pid in psm list but can not find it psm->child=pid while (!psm->reaped) pause(); Unless you are running a realtime-system, sleep() will not be a solution ;) Instead of, I suggest the classical way -- the synchronisation through pipes: | pipe(psm->pipe); | psm->child = fork(); | | if (psm->child==0) { // child | close(psm->pipe[1]); | read(psm->pipe[0],...); // blocks until EOF | close(psm->pipe[0]); | ... | exec(...) | } | else { // parent | close(psm->pipe[0]); | close(psm->pipe[1]); | | psmWait(psm); | } The race in the | while (psmGetReaped(psm) == 0) | (void) pause(); loop can be removed by triggering a periodic SIGALRM (e.g. every 0.5 seconds). Created attachment 86412 [details]
shows problem
Created attachment 86413 [details]
shows potential solution
Enrico is right. this occurs when the child runs before the parent. This happens more often in 2.4 than it did in 2.2. See the included test programs for a demonstration. The second test program shows how to avoid the problem by keeping SIGCHLD blocked until the child pid is set, which is what the rpm patch implements. Created attachment 86414 [details]
Patch against rpm 4.1-1.06
Proposed fix is in rpm-4.1-9 packages at ftp://people.redhat.com/jbj/test-4.1 What remains is to do s/pause()/sleep(1)/ in lib/psm.c (which obscures the problem, but avoids the race mentioned above). FYI: rpm-4.1-9 still caused one hang here. But it is *much* better than with the previous version. gt.at, are you an apt-get user by chance? If so do you recall using CTRL-C to abort apt-get sometime before your last rpm-4.1-9 lockup? https://listman.redhat.com/mailman/private/rpm-list/2002-November/017047.html https://listman.redhat.com/mailman/private/rpm-list/2002-November/017073.html https://listman.redhat.com/mailman/private/rpm-list/2002-November/017083.html Not an apt-get user. Using RPM in an install script under a scheduler that (almost always) lets the child run first. Playing with the 4.1.9 version. No, I don't use apt-get, I use my own script autoupdate. It uses perl-RPM2, but it does only access the rpmdb in ro mode (perl-RPM2 doesn't support write operations). I'll try some stress testing to see if it happens again. I think there is still a small window for a missed wake up in psm.c right between where the signal is unblocked with sigprocmask(SIG_SETMASK) and actually doing the pause(). Since reaped was already tested against child in the while() above, if the signal handler runs at this point (just before calling pause()), we'll call pause and not wake up till we get another signal. One solution could be to set an itimer so we get a SIGALRM but that seems a bit ugly. This problem appears resolved since rpm-4.1-9 with the associated s/pause()/sleep(1)/ or (even better) sigsuspend. As you can see from comment #19, this is not fixed in 4.1-9. I do no longer see this with 4.1.1. Good work. Thanks! I have hit a bug with RPM in this class. - RHL 8.0 - rpm-4.1-1.06 (original). If this is a known RHL8 problem, why is there not an updated rpm available from updates.redhat.com? ftp://people.redhat.com/jbj/test-4.1 no longer exists. I wonder if test-4.2.1's rpms apply to 8.0. I wonder if I can use the new rpm with old popt etc. Hung in pause() while doing an rpm -Fv of all updates. #0 0x08184847 in __libc_pause () #1 0x0814267f in pause () #2 0x0805fccc in psmWait () #3 0x08060236 in runScript () #4 0x08060838 in runInstScript () #5 0x08062a83 in rpmpsmStage () #6 0x0806240b in rpmpsmStage () #7 0x080628d5 in rpmpsmStage () #8 0x0807d085 in rpmtsRun () #9 0x0806dbf1 in rpmInstall () #10 0x08048e4d in main () #11 0x0815ad62 in __libc_start_main () ps shows that there are no children, not even zombies. kill -CHLD awakens only momentarily. Here is an strace: pause() = ? ERESTARTNOHAND (To be restarted) --- SIGCHLD (Child exited) --- wait4(0, 0xbfff8bc8, WNOHANG, NULL) = -1 ECHILD (No child processes) sigreturn() = ? (mask now [RTMIN]) rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 pause(^C <unfinished ...> |