Bug 100439 - RHL9: reboot loop causes hangs on PE650
RHL9: reboot loop causes hangs on PE650
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Dave Jones
Brian Brock
Depends On:
  Show dependency treegraph
Reported: 2003-07-22 09:20 EDT by Larry Troan
Modified: 2016-04-18 05:41 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2003-08-18 15:03:29 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
taroon patch to fix this issue (6.96 KB, patch)
2003-08-07 13:16 EDT, Michael K. Johnson
no flags Details | Diff

  None (edit)
Description Larry Troan 2003-07-22 09:20:52 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0rc1) Gecko/20020424

Description of problem:
We have seen sporadic lockups on reboots with Red Hat 9.

We created a small script that is kicked off from the rc that waits 30 seconds 
after start up and then reboots the machine.  It reboots for about 200 loops on 
average before getting hung on shutdown.  Sysrq is responsive and results in a 
trace attached to this bug.

Version-Release number of selected component (if applicable):

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. reboot machine

Actual Results:  machine hangs at sending a kill signal to all procs.  A few
times it has hung at other portions of the shutdown sequence, but I have only
caught the trace at a time after it sends the kill signal to the procs.

Expected Results:  No hang.. .system reboots without intervention.

Additional info:

This is considered a possible loss of data since the drives do not get 
unmounted.  Otherwise it would be rated "normal".

Comment 1 Larry Troan 2003-07-22 09:23:51 EDT
------- Additional Comment #31 From Ernie Petrides on 2003-07-14 21:00 -------
I've been able to reproduce a SIGSTOP/SIGCONT signal racing problem,
which I believe is identical to the S01reboot/killall5 problem that
is reported above, using the following trivial program:

#include <signal.h>

        int p, s;

        p = getpid();
        signal(SIGTERM, SIG_IGN);
        if (fork() == 0) {
                kill(p, SIGSTOP);
                kill(p, SIGTERM);
                kill(p, SIGCONT);
        } else {

If this program is run in an infinite loop from a shell, every once
in a while the shell reports that the parent process of the program
has become suspended due to a signal (instead of it terminating as
normal).  If this were to happen to the shell running the S01reboot
script, then a hang would occur during the reboot.

My theory is that the SIGSTOP and SIGCONT signals are being handled
out of order due to an SMP race condition in the kernel's signal
handling code.  But I need to investigate this further.

Thanks go to Robert Hentosh for the valuable debugging clues.

------- Additional Comment #32 From Ernie Petrides on 2003-07-18 20:36 -------
This problem was caused by an SMP race condition in the kernel's handling
of SIGSTOP and SIGCONT signals, which are used by /sbin/killall5, which is
invoked by the S01reboot script.  A kernel patch has been created to fix the
locking in the kernel routines get_signal_to_deliver() and do_signal_stop(),
and this patch is currently under review.

Once this patch (or any potentially revised version) has been committed to
the B2 patch pool for RHEL 3, a follow-up comment will be posted here and
this bug's status will be change to "modified".

------- Additional Comment #33 From Larry Troan on 2003-07-22 09:04 -------
Event posted 07-21-2003 05:45pm by rhentosh with duration of 0.00        
In bugzilla 90509 it mentions that there is a fix for taroon.  Please make sure
that this fix is in GinGin branch also.
Comment 2 Michael K. Johnson 2003-08-07 13:16:43 EDT
Created attachment 93487 [details]
taroon patch to fix this issue

Since this patch doesn't appear separately in bugzilla, attaching it
Comment 3 Dave Jones 2003-08-07 16:28:19 EDT
Can you please test the kernel at
http://people.redhat.com/davej/.kernels/2.4.20-19.9.3/ which includes this patch 
and let me know how that works out ?
Comment 4 Michael K. Johnson 2003-08-11 13:40:50 EDT
What was the result of the weekend testing on this issue?
Comment 5 Dale Kaisner 2003-08-12 09:34:11 EDT
This fix was not tested over the weekend due to focus on Bugzilla# 100739.  
Testing was started on 8/11/03.  Expect high-confidence results by EOB on 
8/12/03 and final approval by 8/14/03.
Comment 6 Michael K. Johnson 2003-08-18 15:03:29 EDT
Following up on phone conversation, this appears to be fixed, will be
in next official errata kernel.
Comment 7 Matt Domsch 2003-08-18 15:24:58 EDT
Yes, we confirm it is fixed in 2.4.20-19.3 and above, thanks.
Comment 8 Larry Troan 2003-08-20 17:02:03 EDT
Opening up per Dell request (rh)

Note You need to log in before you can comment on or make changes to this bug.