73134 – rpm-4.1 hangs: SIGCHLD missed

Bug 73134 - rpm-4.1 hangs: SIGCHLD missed

Summary: rpm-4.1 hangs: SIGCHLD missed

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Raw Hide
Classification:	Retired
Component:	rpm
Sub Component:
Version:	1.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jeff Johnson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	73651 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-08-31 00:17 UTC by Enrico Scholz
Modified:	2007-04-18 16:46 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2002-12-17 19:32:48 UTC
Embargoed:

Attachments	(Terms of Use)
Avoid SIGCHLD race patch. (2.08 KB, patch) 2002-08-31 18:20 UTC, Jeff Johnson	no flags	Details \| Diff
shows problem (724 bytes, text/plain) 2002-11-25 23:07 UTC, S. A. Hutchins	no flags	Details
shows potential solution (943 bytes, text/plain) 2002-11-25 23:08 UTC, S. A. Hutchins	no flags	Details
Patch against rpm 4.1-1.06 (1.07 KB, patch) 2002-11-25 23:12 UTC, S. A. Hutchins	no flags	Details \| Diff
View All

Description Enrico Scholz 2002-08-31 00:17:27 UTC

Description of Problem:

While upgrading packages it happens randomly that the rpm-process is stalled
(progressbar is at 100% but nothing happens then (waited until 2h)). 'ps' shows
that there is a zombie child process:

| # ps axfu
| root 16916 17.2  3.1 19180 14352 pts/3 S Aug27 2:27 \_ rpm -Uvh --oldpackage -
| root 17919  0.0  0.0     0     0 pts/3 Z Aug27 0:00     \_ [sh <defunct>]


I can completely recover by executing 'killall -CHLD rpm'.


Version-Release number of selected component (if applicable):

rpm-4.1-1.02  (first seen somewhere at 4.1-0.80 with the withdrawn 
               glibc-2.2.90-24)
glibc-2.2.90-26
bash-2.05b-5
vanilla linux-2.4.19


How Reproducible:

randomly

Comment 1 Jeff Johnson 2002-08-31 13:31:16 UTC

Hmmm this is a missing SIGCHLD while running
a scriptlet.

You can (or could, haven't checked recently)
use waitpid if you recompile rpm.
Change to 0 at lib/psm.c:988
	    psm->reaper = 1;

Otherwise, there has to be hole in the
SIGCHLD handler code. If you could take
a look at the code to see if you can spot
the problem, I'd be grateful. Missing
signals are very painful to fix ...

Comment 2 Jeff Johnson 2002-08-31 17:46:44 UTC

I've got a fix (I believe), but meanwhile I'm gonna use this
bug as one of several "rpm hangs" categories.

Comment 3 Jeff Johnson 2002-08-31 18:19:04 UTC

I was hoping sleep(2) would avoid the need
for this patch. Opinions (and real world experience)
appreciated, it's gonna take a bit more time to
believe there are no races here.

Patch will be in rpm-4.1-1.04 when built.

Comment 4 Jeff Johnson 2002-08-31 18:20:19 UTC

Created attachment 74324 [details]
Avoid SIGCHLD race patch.

Comment 5 Gerald Teschl 2002-09-04 12:57:17 UTC

I can confirm this problem. I just upgraded a bunch of rpms from rawhide
and it happened two timces.

On time when upgrading the single package

timeconfig-3.2.8-2 -> timeconfig-3.2.8-3

again there is a dead child process

root     20131  0.1  3.5  7772 3336 pts/1    S    14:41   0:00 /bin/rpm -U
/usr/src/redhat-7.3.94/updates/timeconfig-3.2.8-3.i386.rpm
root     20132  0.0  0.0     0    0 pts/1    Z    14:41   0:00 [sh <defunct>]

and "killall -CHLD rpm" recovers.

Comment 6 Jeff Johnson 2002-09-04 17:05:53 UTC

OK, I think I finally found the SIGCHLD problem:

  Ya gotta register the handler before doing the fork(), duh.

I'm gonna put rpm-4.1-1.06 packages with the fix on
        ftp://people.redhat.com/jbj/test-4.1

I'd be grateful for any testing.

Comment 7 R P Herrold 2002-09-05 02:29:51 UTC

Comment on:  "I was hoping sleep(2) would [work]" remark

If it is a resource race, a 'jittery' retry interval rather than a static wait
time may clear some cases;  If it is a true resourse unavailable cross-deadly
embrace, there is no joy to be found there.  If all that was wrong was the
grammar/usage error and fix articulated 2002-09-04 13:05:53, you are out of the
woods, anyway.

My $0.02 -- I use jitter elsewhere in RH Linux work to get past resource
unavailability friction in this fashion.

Comment 8 Scott Lamb 2002-09-07 19:03:06 UTC

This seems to be the problem I've been experiencing. I see the zombie shell, and
the kill -CHLD trick kicks it. I've downloaded those patched packages, will let
you know if I experience it again...

Comment 9 Jeff Johnson 2002-09-17 11:46:31 UTC

*** Bug 73651 has been marked as a duplicate of this bug. ***

Comment 10 Enrico Scholz 2002-10-01 18:17:16 UTC

I am still seeing hangs with rpm-4.1-1.06; but there are no child-zombies and
strace tells rpm is waiting in pause().

After looking into your sighandler I see a race-condition when multiple
child-processes are exiting before the sighandler is executed. Therefore, I
suggest the classical version which calls waitpid() in a loop:

 --- lib/psm.c ----
|          case SIGCHLD:
| -        {   int status = 0;
| +        while (1) {
| +            int status = 0;
|              pid_t reaped = waitpid(0, &status, WNOHANG);
|              int i;                                                          
                         
| +            if (reaped==-1) break;
|

Unfortunately, this does not explain the pause() hangs... :(

Comment 11 Jeff Johnson 2002-10-01 18:35:57 UTC

Thanks for the patch.

I'm not sure that it's a fix, as there should
(ATM) only be a single child.

BTW, the simple fix for this problem is
    psm->child = 0;
    psm->reaped = 0;
    psm->status = 0;
-   psm->reaper = 1;
+   psm->reaper = 0;

but there are important speedup's to an install
by overlapping scriptlet execution with rpm
execution if the race can be identified, and your
patch will definitely be useful then.

Comment 12 Jeff Johnson 2002-10-01 19:01:23 UTC

Hmmm, SIGCHLD received between accessing the reaped
pid and pausing is a race...

Comment 13 Enrico Scholz 2002-10-07 19:38:20 UTC

Although already sent per email, I am entering it again for completeness and to
test if I can add comments. Yesterday I got strange message about an invalide
username...

----------


Another (IMO more) critical race is in

|    if ((psm->child = psmRegisterFork(psm)) == 0) {

When the child-fork() in psmRegisterFork() is faster than the parent,
'psm->child' is unset in the signal-handler and 'psm->reaped' will not
set therefore.

With increasing time there happens:


      parent                            child
    ------------------------------------------
                      fork()                      
                                        exec()
                                        exit()
      <<< SIGCHLD
          handler searches
          child-pid in psm list
          but can not find it

     psm->child=pid
     while (!psm->reaped) pause();


Unless you are running a realtime-system, sleep() will not be a
solution ;)  Instead of, I suggest the classical way -- the
synchronisation through pipes:

| pipe(psm->pipe);
| psm->child = fork();
|
| if (psm->child==0) { // child
|    close(psm->pipe[1]);
|    read(psm->pipe[0],...); // blocks until EOF
|    close(psm->pipe[0]);
|    ...
|    exec(...)
| }
| else { // parent
|    close(psm->pipe[0]);
|    close(psm->pipe[1]);
|
|    psmWait(psm);
| }



The race in the 

|        while (psmGetReaped(psm) == 0)
|            (void) pause();

loop can be removed by triggering a periodic SIGALRM (e.g. every 0.5 seconds).

Comment 14 S. A. Hutchins 2002-11-25 23:07:20 UTC

Created attachment 86412 [details]
shows problem

Comment 15 S. A. Hutchins 2002-11-25 23:08:10 UTC

Created attachment 86413 [details]
shows potential solution

Comment 16 S. A. Hutchins 2002-11-25 23:08:54 UTC

Enrico is right. this occurs when the child runs before the parent. This happens
more often in 2.4 than it did in 2.2. See the included test programs for a
demonstration. The second test program shows how to avoid the problem by keeping
SIGCHLD blocked until the child pid is set, which is what the rpm patch implements.

Comment 17 S. A. Hutchins 2002-11-25 23:12:55 UTC

Created attachment 86414 [details]
Patch against rpm 4.1-1.06

Comment 18 Jeff Johnson 2002-11-25 23:19:52 UTC

Proposed fix is in rpm-4.1-9 packages at
	ftp://people.redhat.com/jbj/test-4.1
What remains is to do
	s/pause()/sleep(1)/
in lib/psm.c (which obscures the problem, but avoids
the race mentioned above).

Comment 19 Gerald Teschl 2002-11-26 10:05:46 UTC

FYI: rpm-4.1-9 still caused one hang here. But it is *much* better than with the
previous version.

Comment 20 Warren Togami 2002-11-26 17:19:56 UTC

gt.at, are you an apt-get user by chance?  If so do you recall using
CTRL-C to abort apt-get sometime before your last rpm-4.1-9 lockup?

https://listman.redhat.com/mailman/private/rpm-list/2002-November/017047.html
https://listman.redhat.com/mailman/private/rpm-list/2002-November/017073.html
https://listman.redhat.com/mailman/private/rpm-list/2002-November/017083.html

Comment 21 S. A. Hutchins 2002-11-26 20:14:45 UTC

Not an apt-get user. Using RPM in an install script under a scheduler that
(almost always) lets the child run first. Playing with the 4.1.9 version.

Comment 22 Gerald Teschl 2002-11-26 20:19:52 UTC

No, I don't use apt-get, I use my own script autoupdate. It uses perl-RPM2,
but it does only access the rpmdb in ro mode (perl-RPM2 doesn't support
write operations). I'll try some stress testing to see if it happens again.

Comment 23 Dwight Engen 2002-12-11 18:15:52 UTC

I think there is still a small window for a missed wake up in psm.c right
between where the signal is unblocked with sigprocmask(SIG_SETMASK) and actually
doing the pause(). Since reaped was already tested against child in the while()
above, if the signal handler runs at this point (just before calling pause()),
we'll call pause and not wake up till we get another signal.

One solution could be to set an itimer so we get a SIGALRM but that seems a bit
ugly.

Comment 24 Jeff Johnson 2002-12-17 19:32:48 UTC

This problem appears resolved since rpm-4.1-9 with the
associated
    s/pause()/sleep(1)/
or (even better) sigsuspend.

Comment 25 Gerald Teschl 2003-03-12 10:20:10 UTC

As you can see from comment #19, this is not fixed in 4.1-9.

Comment 26 Gerald Teschl 2003-03-24 17:15:05 UTC

I do no longer see this with 4.1.1. Good work. Thanks!

Comment 27 D. Hugh Redelmeier 2003-08-22 17:24:18 UTC

I have hit a bug with RPM in this class.
- RHL 8.0
- rpm-4.1-1.06 (original).

If this is a known RHL8 problem, why is there not an updated rpm available from
updates.redhat.com?

ftp://people.redhat.com/jbj/test-4.1 no longer exists.  I wonder if test-4.2.1's
rpms apply to 8.0.  I wonder if I can use the new rpm with old popt etc.

Hung in pause() while doing an rpm -Fv of all updates.
#0  0x08184847 in __libc_pause ()
#1  0x0814267f in pause ()
#2  0x0805fccc in psmWait ()
#3  0x08060236 in runScript ()
#4  0x08060838 in runInstScript ()
#5  0x08062a83 in rpmpsmStage ()
#6  0x0806240b in rpmpsmStage ()
#7  0x080628d5 in rpmpsmStage ()
#8  0x0807d085 in rpmtsRun ()
#9  0x0806dbf1 in rpmInstall ()
#10 0x08048e4d in main ()
#11 0x0815ad62 in __libc_start_main ()

ps shows that there are no children, not even zombies.

kill -CHLD awakens only momentarily.  Here is an strace:
pause()                                 = ? ERESTARTNOHAND (To be restarted)
--- SIGCHLD (Child exited) ---
wait4(0, 0xbfff8bc8, WNOHANG, NULL)     = -1 ECHILD (No child processes)
sigreturn()                             = ? (mask now [RTMIN])
rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0
rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
pause(^C <unfinished ...>

Note You need to log in before you can comment on or make changes to this bug.