Bug 102504

Summary: cannot reboot on Dell 6450 with RHEL 3
Product: Red Hat Enterprise Linux 3 Reporter: Suhua Ding <suhua.ding>
Component: kernelAssignee: Norm Murray <nmurray>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: high    
Version: 3.0CC: anderson, bmaly, cogel, coughlan, dledford, greg.marsden, john, lwoodman, o.zaplinski, peterm, petrides, riel, tao, tburke, ttsig, van.okamura, wwlinuxengineering, zachary_reneau
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0437 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-07-20 13:12:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 181405, 186960    
Attachments:
Description Flags
perf33 configuration (perf lab machine used by Brian Brock)
none
output from dmidecode
none
dmesg from 2.4.21-3.EL smp
none
console output from 2.4.21-3.ELsmp
none
console output from 2.4.21-3.EL (UP)
none
console output from 2.4.9-e.3smp
none
console output from 2.4.9-e.24smp
none
console output from 2.4.21-3.ELphro (panic on mount /)
none
conole output from 2.4.21-3.ELsmp (rpm package built by jmoyer)
none
dmiscan.patch none

Description Suhua Ding 2003-08-16 00:36:57 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.75 [en] (WinNT; U)

Description of problem:
I cannot reboot the DEll 6450 when RH 3 AS Beta 1 is installed. With RH 2.1AS I can reboot properly. 

It display a message "Restarting system" and hangs after that. I have to manually do a hard boot (switch off and on) to restart the 
server. 

I also tried 2.4.21-1.1931.2.349.entsmp

Here is lsmod output

-bash-2.05b$ /sbin/lsmod
Module                  Size  Used by    Not tainted
parport_pc             19204   1 (autoclean)
lp                      9252   0 (autoclean)
parport                39104   1 (autoclean) [parport_pc lp]
ide-cd                 35776   0 (autoclean)
cdrom                  34176   0 (autoclean) [ide-cd]
autofs                 13780   0 (autoclean) (unused)
e100                   59140   1
floppy                 59120   0 (autoclean)
microcode               5248   0 (autoclean)
keybdev                 2976   0 (unused)
mousedev                5688   0 (unused)
hid                    22436   0 (unused)
input                   6208   0 [keybdev mousedev hid]
usb-ohci               23720   0 (unused)
usbcore                83168   1 [hid usb-ohci]
megaraid               31404  12
aic7xxx               165616   3
sd_mod                 13744  30
scsi_mod              116776   3 [megaraid aic7xxx sd_mod]
-bash-2.05b$ 


Version-Release number of selected component (if applicable):
2.4.21-1.1931.2.389.entsmp

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL 3 on dell 6450.
2. After successful startup, reboot machine (machine will hang)
3.
    

Additional info

Comment 2 Tim Burke 2003-08-20 19:46:14 UTC
On the weekly Oracle call, they stated tha this occurs on 2 different 6450.


Comment 5 Brian Brock 2003-08-21 16:02:17 UTC
Apologies if I've typo'ed... didn't have console loggin set up on that box yet,
so log is prone to manual mistakes.


This is what I see on console with shutdown of a pe6450 running Taroon-B1-i386-AS:


Please stand by while rebooting the system...
md: stopping all md devices.
flushing ide devices: hda
GDT: Flushing all host drives .. invalid kernel-mode pagefault 2!
[addr:00000000, eip:f880e565]
 
Pdi/TGid: 2141/2141, comm:               reboot
EIP: 0060:[<f880e565>] CPU: 2
EIP is at scsi_build_commandblocks [scsi_mod] 0x25
(2.4.21-1.1931.2.349.2.2.entsmp)
 ESP: 0000:00000000 EFLAGS: 00010002    Not tainted
EAX: 00000000 EBX: 39fc2e00 ECX: 00000000 EDX: 39fc2e18
ESI: 39fc2e00 EDI: 00000000 EBP: 0c843b98 DS: 0068 ES: 0068 FS: 0000 GS: 0033
CR0: 8005003b CR2: 00000000 CR3: 00101000 CR4: 000006f0
Call Trace:   [<f88102f5>] scsi_get_host_dev_Rsmp_7d186429 [scsi_mod] 0x65
(0xc843b68)
[<f886cfae>] gdth_flush [gdth] 0x3e (0cx843b80)
[<021becd3>] poke_blanked_console [kernel] 0x53 (0xc843c58)
[<021bdec6>] vt_console_print [kernel] 0x226 (0xc843c64)
[<021281e3>] __call_console_drivers [kernel] 0x63 (0xc843c94)
[<021282e3>] call_console_drvers [kernel] 0x63 (0xc843cb0)
[<02128603>] printk [kernel] 0x143 (0xc843ce8)
[<f886d941>] .rodata.str1.32 [gdth] 0xe1 (0xc843cf4)
[<f88711b8>] gdth_notifier [gdth] 0x0 (0xc843cf4)
[<f886d0b8>] gdth_halt [gdth] 0x58 (0xc843d08)
[<021b43b0>] extract_entropy [kernel] 0x1e9 (0xc843d28)
[<f88c8f42>] rh_send_irq [usb-ohci] 0x82 (0xc843d3c)
[<021b9a71>] scrup [kernel] 0x121 (0xc843d80)
[<021becd3>] poke_blanked_console [kernel] 0x53 (0xc843dc0)
[<021bdec6>] vt_console_print [kernel] 0x226 (0xc943dcc)
[<0214fa6b>] kmem_cache_free_one [kerenl] 0xfb (0cx843de0)
[<021281e3>] __call_console_drivers [kernel] 0x63 (0xc843dfc)
[<021282e3>] call_console_drivers [kernel] 0x63 (0xc843e18)
[<02128603>] printk [kernel] 0x143 (0cx843e50)
[<021e991c>] ide_notify_reboot [kernel] 0x7c (0xc843e70)
[<f88711b8>] gdth_notifier [gdth] 0x0 (0xc843e7c)
[<02137ec8>] notifier_call_chain [kernel] 0x2d (0xc843e8c)
[<f88711b9>] gdth_notifier [gdth] 0x0 (0xc843e90)
[<02137ec8>] sys_reboot [kernel] 0x118 (0xc843ea8)
[<02140b63>] handle_mm_fault [kernel] 0xf3 (0xc843ec0)
[<0211f5cc>] do_page_fault [kernel] 0x1bc (0xc843ef4)
[<02179110>] dput [kernel] 0x30 (0xc843f64)
[<0216176b>] __fput [kernel] 0xbb (0xc843f94)
[<0215fa9e>] filp_close [kernel] 0x8e (0xc843f94)
[<0215fb46>] sys_close [kernel] 0x66 (0xc843fb0)
 
invalid operand: 0000
parport_pc lp parport ide-cd cdrom autofs acenic e1000 e100 floppy microcode
keybdev mousedev hid input usb-ohci usbcore ext3 jbd gdth aic7xxx sd_mod scsi_mod
CPU:    2
EIP:    0060:[<0211f488>]    Not tainted
EFLAGS: 00010002
 
EIP is at do_page_fault [kernel] 0x78 (2.4.21-1.1931.2.349.2.2.entsmp)
eax: 00000001   ebx: 00000000   ecx: 00000001   edx: 02375e14
esi: 00000002   edi: 0211f410   ebp: 00000002   esp: 9c843a40
ds: 00068  es: 0068   ss: 0068
Process reboot (pid: 2141, stackpage-0c843000)
Stack: 0c844000 00000002 00000000 f880e565 00000000 00000000 00000000 00000000
       00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
       00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Call Trace:   [<f880e565>] scsi_build_commandblocks [scsi_mod] 0x25 (0xc843a4c)
[<021b40a6>] SHATransform [kernel] 0x26 (0xc843ab4)
[<021b3f79>] add_timer_randomness [kernel] 0xd9 (0xc843ad4)
[<0211f410>] do_page_fault [kernel] 0x0 (0xc843af8)
[<f880e565>] scsi_build_commandblocks [scsi_mod] 0x25 (0xc843b34)
[<f88102f5>] scsi_get_host_dev_Rsmp_7d186429 [scsi_mod] 0x65 (0xc843b68)
...
rest of stack trace appears identical
 
Code:  Bad EIP value.
INIT: no more processes left in this runlevel


Comment 6 Tom Coughlan 2003-08-21 20:02:26 UTC
Based on the call trace, this looks like the "gdth oops" problem that was fixed
in AS2.1. Note that this patch is only needed if the iorl patch is present, and
Taroon started out life without the iorl patch.

The patch from the Pensacola stream is linux-2.4.9-gdthoops.patch. The patch is
not in Taroon.

Brian, I don't have this hardware.  Can I send you a driver and have you test
it? Let me know the kernel version and type. 


Comment 7 Brian Brock 2003-08-21 20:21:57 UTC
I'll be glad to test drivers.  URL or mail will work.

I've tested: 
2.4.21-1.1931.2.349.2.2.ent and .entsmp
2.4.21-1.1931.2.399.ent and .entsmp

the .ent kernels work fine (no oops, reboot as expected).  the .entsmp kernels
oops on shutdown.  I've got full console logs from each test run now, too. 
Please let me know if those are useful and how you'd like them (in-line,
attachment, mail).

Comment 8 Doug Ledford 2003-08-21 20:39:24 UTC
Fix for this oops (same as gdth-oops fix in as2.1) sent to Rik.

Comment 9 Arjan van de Ven 2003-08-21 22:01:18 UTC
I don't think Oracle had a gdth card in their box....
this was just blocking our testing

Comment 10 Brian Brock 2003-08-21 22:53:43 UTC
2.4.21-1.1931.2.405.entsmp hangs immediately after:

md: stopping all md devices.
flushing ide devices: hda
GDT: Flushing all host drives .. Starting timer : 0 0
[end of console output]

No call trace, num-lock key still responds, so system isn't hard frozen.

version 2.4.21-1.1931.2.405.ent:
flushing ide devices: hda
GDT: Flushing all host drives .. Starting timer : 0 0
Starting timer : 0 0
Done.
Restarting system.
[and then the system restarts]

Comment 12 Brian Brock 2003-08-25 17:32:16 UTC
The affected system that I'm looking at is running aic7xxx and gdth, although
the gdth driver is for an adapter with no drives.

Running without gdth loaded, and with aic7xxx loaded and used in
2.4.21-1.1931.2.405.entsmp results in a hang here:

md: stopping all md devices.
flushing ide devices: hda
Restarting system.
[hang]

system has to be manually power cycled.

Comment 14 Tim Burke 2003-08-25 23:59:28 UTC
Added capability for Dell to look at this bugzilla.


Comment 15 Arjan van de Ven 2003-08-26 09:39:11 UTC
Brian, can you try kernel commandline options like reboot=c  reboot=w etc to see
if they make any difference ? 

Comment 16 Larry Troan 2003-08-26 13:39:43 UTC
Created attachment 93933 [details]
perf33 configuration (perf lab machine used by Brian Brock)

Comment 17 Larry Troan 2003-08-26 14:01:39 UTC
John Hull (Dell), list of reboot options requested by Arjan above.

The following should be used on the kernel line of the active kernel (grub.conf)
in the order given -- MKJ suggested this for another IHV bug with reboot that
turned out to be a BIOS problem.

reboot=w
reboot=c
reboot=b
reboot=h
reboot=s (first CPU)
reboot=s1 (second CPU on multi-CPU machine)
reboot=s2 (third CPU)
 :
reboot=sx (x=n-1 CPU)


Comment 18 Brian Brock 2003-08-26 14:20:21 UTC
4 hard drives are attached to the first aic7829 controller, each looks like:

blk: queue f7fc4c18, I/O limit 524287Mb (mask 0x7fffffffff)
  Vendor: QUANTUM   Model: ATLAS10K3_18_SCA  Rev: 120G
  Type:   Direct-Access                      ANSI SCSI revision: 03

CD-ROM drive is connected to the IDE controller
eth1 is the only configured physical network interface.




Comment 19 Larry Troan 2003-08-26 15:54:17 UTC
FROM JOHN HULL AT DELL (PROBLEMS WRITING TO BUGZILLA):
We haven't tried yet, mainly because I didn't understand what we were
supposed to be looking for, but also because our lab has been down. We'll
look at this, but my guess is that it's a BIOS problem.

I tried to update the Bugzilla to request info, but I didn't have
permission. Could someone find out from Oracle what BIOS level they're
running, and if they've tried older/newer BIOSes? 

	John

BRIAN, SAME QUESTION ON BIOSes for you..... what level are you running?

Comment 20 Brian Brock 2003-08-26 16:25:59 UTC
checking on BIOS level now.

adding 'reboot=b' to the kernel command line causes the system to properly
restart on reboot.

Comment 21 Brian Brock 2003-08-26 16:31:11 UTC
BIOS revision A02

Comment 22 Arjan van de Ven 2003-08-26 16:39:15 UTC
can you attache dmidecode output? that way we can list this box as "needs reboot=b" 

Comment 23 Larry Troan 2003-08-26 21:10:26 UTC
Just heard from John Hull at Dell. The current BIOS level for the 6450 is A12
(Brian is at A02) and John believes this is why we're having a problem. Passed
the info on to Brian Brock and he is going to download the later BIOS version
from Dell's web site and try to recreate the problem again.

Comment 24 Brian Brock 2003-08-26 22:36:10 UTC
I can't reproduce a working setup with the machine... 'reboot=b' on the kernel
command line is insufficient, I made a mistake.   I'll post dmidecode output in
case it's useful and also am grabbing the BIOS update from Dell.

Comment 25 Brian Brock 2003-08-26 22:38:05 UTC
Created attachment 93966 [details]
output from dmidecode

Comment 26 Brian Brock 2003-08-27 15:50:00 UTC
updating the buios to A12 does not immediately help.

Comment 27 Brian Brock 2003-08-27 16:04:13 UTC
Which of the files on Dell's ftp site contain the firmware updates?

I've applied the BIOS updates but the system is giving warnings on boot:

Embedded server management firmware revision 5.25
!!***** Warning: Firmware is out-of-date, please update... *****
Primary system backplane controller firmware revision 1.16
!!***** Warning: Firmware is out-of-date, please update... *****
Power supply paralleling board firmware revision 2.37
!!***** Warning: Firmware is out-of-date, please update... *****


Comment 28 Brian Brock 2003-08-27 18:44:00 UTC
updating the firmware doesn't help, either.

With BIOS A12 and recent firmware, this system is behaving identically, and
hanging on shutdown.  no kernel command line options of the form 'reboot=X' make
a difference.

Comment 31 Brian Brock 2003-09-25 15:04:16 UTC
retested with kernel-2.4.21-3.EL

with no options, the UP kernel reboots fine, but the smp and hugemem kernels
fail to reboot upon shutdown.  didn't try 'reboot=' options (that takes about 2
hours of testing to detect a partial or complete failure), but I'll be glad to
do so if it's relavent testing.

Comment 32 Brian Brock 2003-09-25 15:39:21 UTC
Created attachment 94720 [details]
dmesg from 2.4.21-3.EL smp

can't get a complete dmesg output... note that the top is cropped off.

Comment 33 Brian Brock 2003-09-25 16:34:55 UTC
Created attachment 94723 [details]
console output from 2.4.21-3.ELsmp

output is complete, not cropped.

system doesn't reboot after shutdown.

Comment 34 Brian Brock 2003-09-25 16:36:15 UTC
Created attachment 94724 [details]
console output from 2.4.21-3.EL (UP)

system reboots properly after shutdown.

Comment 35 Brian Brock 2003-09-25 16:39:08 UTC
Created attachment 94725 [details]
console output from 2.4.9-e.3smp

system reboots properly on shutdown.

Comment 36 Brian Brock 2003-09-25 16:40:19 UTC
Created attachment 94726 [details]
console output from 2.4.9-e.24smp

system reboots properly on shutdown.

Comment 37 Brian Brock 2003-09-25 21:51:16 UTC
Created attachment 94743 [details]
console output from 2.4.21-3.ELphro (panic on mount /)

doesn't boot, looking for real problem in panic.

Comment 38 Brian Brock 2003-09-26 14:08:48 UTC
Created attachment 94757 [details]
conole output from 2.4.21-3.ELsmp (rpm package built by jmoyer)

hangs on shutdown.

Comment 42 Doug Ledford 2004-01-14 23:52:06 UTC
There are two different problem reports here.  One is an original bug
report from Oracle, all the internal testing though has a gdth
controller.  We know we fixed the gdth problem already.  So, if this
stull happens with the RHEL3 U1 beta kernel, then we need to know that
in order to work on finding out what the problem is.  Otherwise, the
problem should be fixed.  Setting bug to NEEDINFO until Oracle can
either confirm or deny that the issue is fixed with the U1 beta kernel.

Comment 43 Van Okamura 2004-03-22 20:32:24 UTC
Problem is reproducible on a Dell 6450 running RHEL 3 QU 2. System
gets to "Restarting system" and hangs, no oops output or system dumps.  

[root@palnx3 root]# uname -a
Linux palnx3 2.4.21-11.ELsmp #1 SMP Mon Mar 8 23:32:56 EST 2004 i686
i686 i386 GNU/Linux
[root@palnx3 root]# lsmod
Module                  Size  Used by    Not tainted
parport_pc             18852   1  (autoclean)
lp                      9124   0  (autoclean)
parport                38816   1  (autoclean) [parport_pc lp]
autofs                 13620   0  (autoclean) (unused)
e100                   58468   1
floppy                 57488   0  (autoclean)
microcode               6848   0  (autoclean)
ext3                   89960   4
jbd                    55060   4  [ext3]
megaraid               30604   0  (unused)
aic7xxx               162064   5
sd_mod                 13360  10
scsi_mod              112552   3  [megaraid aic7xxx sd_mod]
[root@palnx3 root]# cat /proc/cmdline
ro root=/dev/sda5



Comment 45 Bob Johnson 2004-06-07 17:47:29 UTC
Van, is this still an active issue ?

Comment 46 Greg Marsden 2004-06-07 23:17:03 UTC
This bug is affecting our tests systems running on 6450s, which must
be power cycled manually after it hangs at the "Rebooting system" prompt. 

As was noted above, this hang happens only when running the -smp
kernel and the system will reboot when using the -up kernel.

The hang is not related to the gdth module (as dell does not require
this particular module).

Comment 47 Doug Ledford 2004-06-07 23:50:00 UTC
Can you boot the machine with the kernel command line option
nmi_watchdog=1 and then try to reboot the machine?  If there is an SMP
deadlock of some sort, the nmi watchdog should catch it and the oops
would tell us what lock in particular it is spinning on.

Comment 48 Greg Marsden 2004-06-08 00:49:30 UTC
Tried with 15.ELsmp and nmi_watchdog=1; system froze on 
"Restarting system." with no oops.

I've tried instrumenting (as in prink debugging) the reboot code, and
it gets as far as sending the right codes to the bios, but nothing
happens. 

Comment 49 Greg Marsden 2004-06-08 00:57:15 UTC
More specifically, in arch/i386/kernel/process.c the kernel gets past
the SMP specific code in machine_restart before freezing

Comment 51 Doug Ledford 2004-06-09 16:11:38 UTC
Well, I found something suspicious in the reboot code.  Specifically,
in machine_restart, we try to verify the reboot_cpu value to make sure
it's a valid processor, but I think the test has a thinko that keeps
it from working properly.  Specifically, we do this:

        int cpuid;

        cpuid = GET_APIC_ID(apic_read(APIC_ID));

        if (reboot_smp) {

                /* check to see if reboot_cpu is valid 
                   if its not, default to the BSP */
                if ((reboot_cpu == -1) ||
                      (reboot_cpu > (NR_CPUS -1))  ||
                      !(phys_cpu_present_map & (1<<cpuid)))
                        reboot_cpu = boot_cpu_physical_apicid;

The problem I see here is that we are checking phys_cpu_present_map
against 1<<cpuid which is whatever CPU this code gets run on and is
always true and which doesn't do what we want which is make sure that
the reboot_cpu is valid.  I suspect that the test above should be
rewritten to something like this:

if ((reboot_cpu < 0) ||
    (reboot_cpu > (NR_CPUS - 1)) ||
    !(phys_cpu_present_map & (1<<reboot_cpu)))
        reboot_cpu = boot_cpu_physical_apicid;

Greg, could you try making that change in your kernel sources there
and see if that makes any difference to whether or not the machine
reboots properly?  (Since you said you were instrumenting the reboot
code already I figured this would be a 10 minute test for you ;-)


Comment 52 Greg Marsden 2004-06-09 23:57:54 UTC
That sounds very plausible (one of the things I was checking was
whether that test passed, which of course it did...)

Building in that patch now...

Comment 53 Doug Ledford 2004-06-10 00:01:23 UTC
Note: that change didn't solve things here.

I'm looking for some data on this issue.  Specifically, I need to know
what machines it does and does not happen on, how much ram those
machines have, and which kernel specifically is failing.  I suspect
that this *might* be related to the kernel and the RAM size of the
machine in question.  It also might be related to the e820 RAM map. 
If I can get data on both a work and a failing system to look for
differences, that would be very helpful.

Comment 54 Greg Marsden 2004-06-10 00:26:02 UTC
The machine in question is a standard Dell 6450, which is a 4-way PIII
system with 4 GB of ram.  

Tried booting with mem=512, but that didn't help. 

Comment 55 Doug Ledford 2004-06-10 19:32:02 UTC
OK, I've been able to resolve the problem here.  First, the suggested
change in comment 51 is correct, but not required to solve the problem
(it is however required to keep people from passing a bad cpu number
as part of the reboot=s<number> command line option, just plain
reboot=s means to use whatever CPU is the boot CPU, but you do have
the option of giving it a specific CPU number instead and the change
in comment 51 makes sure that the passed in CPU number is valid). 
What solved the problem here is to use the kernel command line option
reboot=s,b (aka, SMP reboot, switch to boot processor before
proceeding, then proceed with a BIOS reboot).  Neither the s or b
options are sufficient by themselves, it has to be both in order for
it to reboot reliably.  If people can try this on their effected
machines and verify that it solves the problem on all the broken
hardware and that this isn't just a case of "Oops, we got lucky it
worked on ours but it doesn't solve yours", then I'll code up a DMI
blacklist patch for U3 that should make the problem go away without
special command line options as of U3 or later.



Comment 56 Greg Marsden 2004-06-10 21:07:37 UTC
reboot=s,b does not solve the problem on my machine (this was with the
fix from comment 51 as well).  I had tried forcing the bios reset in
the past, but that did not seem to resolve the problem...

Comment 57 Dave Anderson 2004-07-14 19:41:48 UTC
*** Bug 127689 has been marked as a duplicate of this bug. ***

Comment 58 Ernie Petrides 2004-07-14 21:01:40 UTC
Greg and/or Suhua, do you wish to keep this bug report restricted
to the Oracle group?  It would be useful if dups of this problem
could be coalesced into a single report.  But if you prefer to keep
this bug private, then we could continue the investigation under
the other bug id.

Just let me know what you prefer.  Thanks.  -ernie


Comment 59 Dave Anderson 2004-07-27 14:00:05 UTC

Forwarded message from duplicate case 127609:

------- Additional Comments From ttsig  2004-07-23 15:37
-------
I still am unable to post comments on Bug 102504, presumably because
it is for the Beta (I get the message "You are not permitted to edit
bugs in product Red Hat Enterprise Linux Beta").

I am interested to know what steps I should take next to assist with
resolving this issue.  We are upgrading two of our 6450's from 2 to 4
CPU's tonight.  Currently both of these systems will reboot with the
"reboot=s,b" parameter but our 4 CPU system will not.  We are
anticaipating that after the upgrade we will then have 3 systems that
fail to reboot.

Is there a debug kernel we need to try?

Thanks,
Tom



Comment 60 Greg Marsden 2004-07-27 17:59:44 UTC
Changed product to Red Hat Enterprise Linux.

Any progress on this issue?

Comment 61 Tom Sightler 2004-07-27 19:26:05 UTC
We proceded with upgrading both of our 2 CPU 6450's to 4 CPU's last
week, and, as predicted, these system now both experience the "no
reboot" issue.  They worked fine when they had only 2 CPU's.  We now
have a total of three systems with this problem.

As a workaround I have discovered that the Dell Server Administrator
can be installed and you can use the "Auto Recovery" feature, which is
designed to detect a hung OS and restart the computer automatically. 
The DSA detects a system sitting at the "Restarting system..." prompt
as a hung OS and uses the embeded system management processor to power
cycle the system.  It's crude, but currently the only workaround.

I'm waiting for suggestions on how I can assist in gathering
information for this.  I posted a reasonable amount of information
about kernels that work/don't work in Bug 127609.  I've been
attempting to compare RH9/FC1 kernels since the RH9 kernel fails and
the FC1 kernel works, but they're actually pretty different.

My next plan is to try vanilla 2.4.21 and then start applying patches.

Later,
Tom


Comment 62 Greg Marsden 2004-08-05 00:25:10 UTC
> >    Can't you boot with maxcpus set to 2 instead of pulling out the
cpus?


The maxcpus=2 trick does not resolve the reboot issue.

The maxcpus=2 limitation corresponds only to the number of processors
seen by the linux scheduler, but in the boot procedure it's clear that
the kernel still sees all four CPUs and will not reboot. 

Comment 64 Ernie Petrides 2004-10-04 23:21:30 UTC
*** Bug 134555 has been marked as a duplicate of this bug. ***

Comment 65 Berthold Cogel 2004-11-16 17:01:33 UTC
RedHat Support told us to set 'reboot=bios' as kernel parameter.
For our 2- and 4-processor machines this works fine. 


Comment 66 Doug Ledford 2004-12-03 11:27:14 UTC
I've seen different magic incantations of the reboot= boot parameter
work for different machines.  The real problem here is that when the
machine locks up, I have no way of knowing *where* it's locking up at.
 I don't know if we are still in the linux code or if we have returned
to the BIOS code already or what.  That makes debugging very
difficult.  I'm putting this on the blocker list for the next RHEL3
update, but it's still iffy whether or not I'll be able to find the
true root cause and get a fix that works for everyone.

Comment 68 Larry Woodman 2005-02-02 03:31:48 UTC
I now have a Dell 6450 in-house so I will be able to debug this
problem now.

Larry Woodman

Comment 69 Zachary Reneau 2005-02-17 02:15:28 UTC
I am on hand to assist on this issue from Dell's end, but as yet I have had no 
response to my email to either the RedHat techs working this issue or from our 
RedHat rep. Our customer has requested we investigate RedHat's responsiveness 
on this matter and assist as needed.

Comment 70 Dave Anderson 2005-03-01 18:39:00 UTC
This is a copy of my last update to IT #50767:

This is the latest information concerning this case:

The 4-cpu 6450 we have reboots successfully, with no special "reboot"
command line arguments, with these kernels:

 AS2.1 smp
 RHEL3 uniprocessor
 upstream 2.4.29 smp kernel

So it appears to be specific to the RHEL3 smp kernel.

This is the latest debug status.

When no special "reboot=" is done, the last thing done by the both
the up and smp kernels is the following code sequence in
machine_restart(), where the reboot_mode code is written to c0000472,
later followed by the "pulse reset low":

       if(!reboot_thru_bios) {
               /* rebooting needs to touch the page at absolute addr 0 */
               *((unsigned short *)__va(0x472)) = reboot_mode;
               for (;;) {
                       int i;
                       for (i=0; i<100; i++) {
                               kb_wait();
                               udelay(50);
                               outb(0xfe,0x64);         /* pulse reset
low */
                               udelay(50);
                       }
                       /* That didn't work - force a triple fault.. */
                       __asm__ __volatile__("lidt %0": :"m" (no_idt));
                       __asm__ __volatile__("int3");
               }
       }

The reboot_mode defaults to 0 (cold) or can be configured to 0x1234
(warm) using reboot=w.  All debugging has been done without changing
it from 0.

Given that the RHEL3 up kernel works using the code path above, I've
been trying to narrow down the possible reason for the RHEL3 smp
kernel failing by injecting debug code that prematurely call the
machine_restart() function during init-time.

First data point: by calling the machine_restart() function before and
after smp_init() is called during boot-time, the RHEL3 smp kernel
reboots OK *before* smp_init() gets called, but hangs if called
*after* smp_init() was run.  Trying to narrow it down further, I
applied the machine_restart() calls at various points in the smp
initialization sequence, specifically in smp_boot_cpus() which does
all the work.

The second data point of interest is that this assignment would cause
the quick machine_restart() call to fail when called just after the
assignment:

 boot_cpu_logical_apicid = logical_smp_processor_id();

which equates to:

 static __inline int logical_smp_processor_id(void)
 {
         /* we don't want to mark this access volatile - bad code
generation */
         return GET_APIC_LOGICAL_ID(*(unsigned long
*)(APIC_BASE+APIC_LDR));
 }

This is the very first access to the APIC.  Secondly, if I
avoided the read of APIC_BASE+APIC_LDR and just assigned a 0 to
boot_cpu_logical_apicid, I could then immediately call the
machine_restart() function, and it rebooted OK.  So simply reading
from APIC_BASE+APIC_LDR a single time is enough to make the
reboot sequence fail.

However, if I continue injecting calls to machine_restart() after:

(1) kludging boot_cpu_logical_apicid to 0 (which is what it always
    would come back as from the register read) and therefore avoiding
    the APIC read.
(2) and let the kernel run a bit farther in smp_boot_cpus(),

it again starts failing the quick reboot call as soon as this code
was run:

 verify_local_APIC();

at which time it would start hanging again.  This is not surprising
since verify_local_APIC() does a bunch of APIC reads and writes to
APIC_BASE+<whatever>:

int __init verify_local_APIC(void)
{
       unsigned int reg0, reg1;

       /*
        * The version register is read-only in a real APIC.
        */
       reg0 = apic_read(APIC_LVR);
       Dprintk("Getting VERSION: %x\n", reg0);
       apic_write(APIC_LVR, reg0 ^ APIC_LVR_MASK);
       reg1 = apic_read(APIC_LVR);
       Dprintk("Getting VERSION: %x\n", reg1);
...

and again, I verified that when I kludged boot_cpu_logical_apicid to
0, and the "reg0" apic_read() above then became the very first APIC
read, that first APIC_LVR read would cause a quick call to
machine_restart() to hang.

So, for whatever reason, as soon as the SMP kernel reads the
APIC, machine_restart() will hang from that point on.  But that
obviously doesn't solve anything, or point to a bug AFAICT.

So, I've started looking at the differences between the RHEL3
kernel and 2.4.29 in the smp_init() path as well as the
machine_restart() function.  There was some discussion about
machine_restart(), but replacing the RHEL3 version with the 2.4.29
does not help, although the changes were minimal.  There are
signficant changes in the smp_boot_cpus() function, and basically
grasping at straws, I'd thought it might be worth testing out the
changes in the 2.4.29 tree.  But, to be honest here, I'm
not sure whether that's the way to go -- but have no other ideas to
work with.

Comment 80 Issue Tracker 2005-08-02 15:28:48 UTC
From User-Agent: XML-RPC

Dell L3 confirmed that engineering will not support this system and so this
case can be closed at this time.


Internal Status set to 'Resolved'
Status set to: Closed by Client

Resolution set to: 'Closed by Client'

This event sent from IssueTracker by sbenjamin
 issue 66146

Comment 92 Samuel Benjamin 2006-03-29 15:51:52 UTC
Created attachment 126997 [details]
dmiscan.patch

Dell tells me they have provided this patch to address the reboot problem. I do
not think RH engineering has reviewed it. Please review and add to U8. This
bugzilla is linked to multiple issue trackers reported by customers and by Dell
engineering.

Comment 96 Ernie Petrides 2006-04-22 09:02:37 UTC
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.9.EL).


Comment 97 Tom Sightler 2006-04-23 03:26:52 UTC
What fix was added?  Was it simply the patch that is posted in this Bugzilla?  If 
so I suspect that this will not fully fix the problem.  The patch appears to do 
nothing more that automatically set the "set_bios_reboot" flag which I think is 
the equivalent to "reboot=b" which does work for some configurations but doesn't 
work for some.  In our case our 2-CPU systems would reboot after setting 
"reboot=b,s" but our 4-CPU systems would still hang.

Is the actual patch actually more involved than that or am I misinterpreting what 
the patch does.

Later,
Tom


Comment 98 Ernie Petrides 2006-04-23 04:28:04 UTC
Hi, Tom.  The patch in comment #92 is what was committed to U8, which
as you guessed, simply sets the "reboot_thru_bios" via set_bios_reboot().
This is equivalent to using the "reboot=b" boot option, which as far as
we know works for Dell PowerEdge 6400 and 6450 systems.

If you have a system with a different model name/number, please have
Customer Support file a new Issue Tracker.  If you know that one of the
two systems I've listed above won't reboot successfully using "reboot=b",
please try to supply more details in this BZ and we'll try to address it
during U8 beta.

Thanks in advance.


Comment 99 Tom Sightler 2006-04-24 03:23:08 UTC
Well, almost two years ago I opened Bug 127689 which was eventually closed as a 
duplicate of this bug (this bug was marked private at the time).  In that Bug I 
documented that our 6450's with 2-CPU's would reboot with "reboot=b,s" but that 
our 4-CPU system still hung.

We currently have only one 6450 left in production and it runs RHEL4, which also 
has the hang problem.  We still have one 6450 left in the lab that runs RHEL3 and 
is currently running U7.  While in the office for mainenance today I double 
checked and can say 100% for sure the reboot=b does not correct the problem on 
this system.  The system is a Dell 6450, 4 700Mhz PIII processors with 4GB of 
RAM.  I do think it's one BIOS revision behind and I will test this tomorrow, but 
I suspect that since reboot=b doesn't work on this system or the 6450 running 
RHEL4 then I don't hold out much hope for this patch working on my systems.

Later,
Tom


Comment 100 Ernie Petrides 2006-04-27 20:50:01 UTC
*** Bug 175759 has been marked as a duplicate of this bug. ***

Comment 104 Red Hat Bugzilla 2006-07-20 13:12:57 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0437.html