Bug 102504
Summary: | cannot reboot on Dell 6450 with RHEL 3 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Suhua Ding <suhua.ding> |
Component: | kernel | Assignee: | Norm Murray <nmurray> |
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 3.0 | CC: | anderson, bmaly, cogel, coughlan, dledford, greg.marsden, john, lwoodman, o.zaplinski, peterm, petrides, riel, tao, tburke, ttsig, van.okamura, wwlinuxengineering, zachary_reneau |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHSA-2006-0437 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-07-20 13:12:57 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 181405, 186960 | ||
Attachments: |
Description
Suhua Ding
2003-08-16 00:36:57 UTC
On the weekly Oracle call, they stated tha this occurs on 2 different 6450. Apologies if I've typo'ed... didn't have console loggin set up on that box yet, so log is prone to manual mistakes. This is what I see on console with shutdown of a pe6450 running Taroon-B1-i386-AS: Please stand by while rebooting the system... md: stopping all md devices. flushing ide devices: hda GDT: Flushing all host drives .. invalid kernel-mode pagefault 2! [addr:00000000, eip:f880e565] Pdi/TGid: 2141/2141, comm: reboot EIP: 0060:[<f880e565>] CPU: 2 EIP is at scsi_build_commandblocks [scsi_mod] 0x25 (2.4.21-1.1931.2.349.2.2.entsmp) ESP: 0000:00000000 EFLAGS: 00010002 Not tainted EAX: 00000000 EBX: 39fc2e00 ECX: 00000000 EDX: 39fc2e18 ESI: 39fc2e00 EDI: 00000000 EBP: 0c843b98 DS: 0068 ES: 0068 FS: 0000 GS: 0033 CR0: 8005003b CR2: 00000000 CR3: 00101000 CR4: 000006f0 Call Trace: [<f88102f5>] scsi_get_host_dev_Rsmp_7d186429 [scsi_mod] 0x65 (0xc843b68) [<f886cfae>] gdth_flush [gdth] 0x3e (0cx843b80) [<021becd3>] poke_blanked_console [kernel] 0x53 (0xc843c58) [<021bdec6>] vt_console_print [kernel] 0x226 (0xc843c64) [<021281e3>] __call_console_drivers [kernel] 0x63 (0xc843c94) [<021282e3>] call_console_drvers [kernel] 0x63 (0xc843cb0) [<02128603>] printk [kernel] 0x143 (0xc843ce8) [<f886d941>] .rodata.str1.32 [gdth] 0xe1 (0xc843cf4) [<f88711b8>] gdth_notifier [gdth] 0x0 (0xc843cf4) [<f886d0b8>] gdth_halt [gdth] 0x58 (0xc843d08) [<021b43b0>] extract_entropy [kernel] 0x1e9 (0xc843d28) [<f88c8f42>] rh_send_irq [usb-ohci] 0x82 (0xc843d3c) [<021b9a71>] scrup [kernel] 0x121 (0xc843d80) [<021becd3>] poke_blanked_console [kernel] 0x53 (0xc843dc0) [<021bdec6>] vt_console_print [kernel] 0x226 (0xc943dcc) [<0214fa6b>] kmem_cache_free_one [kerenl] 0xfb (0cx843de0) [<021281e3>] __call_console_drivers [kernel] 0x63 (0xc843dfc) [<021282e3>] call_console_drivers [kernel] 0x63 (0xc843e18) [<02128603>] printk [kernel] 0x143 (0cx843e50) [<021e991c>] ide_notify_reboot [kernel] 0x7c (0xc843e70) [<f88711b8>] gdth_notifier [gdth] 0x0 (0xc843e7c) [<02137ec8>] notifier_call_chain [kernel] 0x2d (0xc843e8c) [<f88711b9>] gdth_notifier [gdth] 0x0 (0xc843e90) [<02137ec8>] sys_reboot [kernel] 0x118 (0xc843ea8) [<02140b63>] handle_mm_fault [kernel] 0xf3 (0xc843ec0) [<0211f5cc>] do_page_fault [kernel] 0x1bc (0xc843ef4) [<02179110>] dput [kernel] 0x30 (0xc843f64) [<0216176b>] __fput [kernel] 0xbb (0xc843f94) [<0215fa9e>] filp_close [kernel] 0x8e (0xc843f94) [<0215fb46>] sys_close [kernel] 0x66 (0xc843fb0) invalid operand: 0000 parport_pc lp parport ide-cd cdrom autofs acenic e1000 e100 floppy microcode keybdev mousedev hid input usb-ohci usbcore ext3 jbd gdth aic7xxx sd_mod scsi_mod CPU: 2 EIP: 0060:[<0211f488>] Not tainted EFLAGS: 00010002 EIP is at do_page_fault [kernel] 0x78 (2.4.21-1.1931.2.349.2.2.entsmp) eax: 00000001 ebx: 00000000 ecx: 00000001 edx: 02375e14 esi: 00000002 edi: 0211f410 ebp: 00000002 esp: 9c843a40 ds: 00068 es: 0068 ss: 0068 Process reboot (pid: 2141, stackpage-0c843000) Stack: 0c844000 00000002 00000000 f880e565 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Call Trace: [<f880e565>] scsi_build_commandblocks [scsi_mod] 0x25 (0xc843a4c) [<021b40a6>] SHATransform [kernel] 0x26 (0xc843ab4) [<021b3f79>] add_timer_randomness [kernel] 0xd9 (0xc843ad4) [<0211f410>] do_page_fault [kernel] 0x0 (0xc843af8) [<f880e565>] scsi_build_commandblocks [scsi_mod] 0x25 (0xc843b34) [<f88102f5>] scsi_get_host_dev_Rsmp_7d186429 [scsi_mod] 0x65 (0xc843b68) ... rest of stack trace appears identical Code: Bad EIP value. INIT: no more processes left in this runlevel Based on the call trace, this looks like the "gdth oops" problem that was fixed in AS2.1. Note that this patch is only needed if the iorl patch is present, and Taroon started out life without the iorl patch. The patch from the Pensacola stream is linux-2.4.9-gdthoops.patch. The patch is not in Taroon. Brian, I don't have this hardware. Can I send you a driver and have you test it? Let me know the kernel version and type. I'll be glad to test drivers. URL or mail will work. I've tested: 2.4.21-1.1931.2.349.2.2.ent and .entsmp 2.4.21-1.1931.2.399.ent and .entsmp the .ent kernels work fine (no oops, reboot as expected). the .entsmp kernels oops on shutdown. I've got full console logs from each test run now, too. Please let me know if those are useful and how you'd like them (in-line, attachment, mail). Fix for this oops (same as gdth-oops fix in as2.1) sent to Rik. I don't think Oracle had a gdth card in their box.... this was just blocking our testing 2.4.21-1.1931.2.405.entsmp hangs immediately after: md: stopping all md devices. flushing ide devices: hda GDT: Flushing all host drives .. Starting timer : 0 0 [end of console output] No call trace, num-lock key still responds, so system isn't hard frozen. version 2.4.21-1.1931.2.405.ent: flushing ide devices: hda GDT: Flushing all host drives .. Starting timer : 0 0 Starting timer : 0 0 Done. Restarting system. [and then the system restarts] The affected system that I'm looking at is running aic7xxx and gdth, although the gdth driver is for an adapter with no drives. Running without gdth loaded, and with aic7xxx loaded and used in 2.4.21-1.1931.2.405.entsmp results in a hang here: md: stopping all md devices. flushing ide devices: hda Restarting system. [hang] system has to be manually power cycled. Added capability for Dell to look at this bugzilla. Brian, can you try kernel commandline options like reboot=c reboot=w etc to see if they make any difference ? Created attachment 93933 [details]
perf33 configuration (perf lab machine used by Brian Brock)
John Hull (Dell), list of reboot options requested by Arjan above. The following should be used on the kernel line of the active kernel (grub.conf) in the order given -- MKJ suggested this for another IHV bug with reboot that turned out to be a BIOS problem. reboot=w reboot=c reboot=b reboot=h reboot=s (first CPU) reboot=s1 (second CPU on multi-CPU machine) reboot=s2 (third CPU) : reboot=sx (x=n-1 CPU) 4 hard drives are attached to the first aic7829 controller, each looks like: blk: queue f7fc4c18, I/O limit 524287Mb (mask 0x7fffffffff) Vendor: QUANTUM Model: ATLAS10K3_18_SCA Rev: 120G Type: Direct-Access ANSI SCSI revision: 03 CD-ROM drive is connected to the IDE controller eth1 is the only configured physical network interface. FROM JOHN HULL AT DELL (PROBLEMS WRITING TO BUGZILLA): We haven't tried yet, mainly because I didn't understand what we were supposed to be looking for, but also because our lab has been down. We'll look at this, but my guess is that it's a BIOS problem. I tried to update the Bugzilla to request info, but I didn't have permission. Could someone find out from Oracle what BIOS level they're running, and if they've tried older/newer BIOSes? John BRIAN, SAME QUESTION ON BIOSes for you..... what level are you running? checking on BIOS level now. adding 'reboot=b' to the kernel command line causes the system to properly restart on reboot. BIOS revision A02 can you attache dmidecode output? that way we can list this box as "needs reboot=b" Just heard from John Hull at Dell. The current BIOS level for the 6450 is A12 (Brian is at A02) and John believes this is why we're having a problem. Passed the info on to Brian Brock and he is going to download the later BIOS version from Dell's web site and try to recreate the problem again. I can't reproduce a working setup with the machine... 'reboot=b' on the kernel command line is insufficient, I made a mistake. I'll post dmidecode output in case it's useful and also am grabbing the BIOS update from Dell. Created attachment 93966 [details]
output from dmidecode
updating the buios to A12 does not immediately help. Which of the files on Dell's ftp site contain the firmware updates? I've applied the BIOS updates but the system is giving warnings on boot: Embedded server management firmware revision 5.25 !!***** Warning: Firmware is out-of-date, please update... ***** Primary system backplane controller firmware revision 1.16 !!***** Warning: Firmware is out-of-date, please update... ***** Power supply paralleling board firmware revision 2.37 !!***** Warning: Firmware is out-of-date, please update... ***** updating the firmware doesn't help, either. With BIOS A12 and recent firmware, this system is behaving identically, and hanging on shutdown. no kernel command line options of the form 'reboot=X' make a difference. retested with kernel-2.4.21-3.EL with no options, the UP kernel reboots fine, but the smp and hugemem kernels fail to reboot upon shutdown. didn't try 'reboot=' options (that takes about 2 hours of testing to detect a partial or complete failure), but I'll be glad to do so if it's relavent testing. Created attachment 94720 [details]
dmesg from 2.4.21-3.EL smp
can't get a complete dmesg output... note that the top is cropped off.
Created attachment 94723 [details]
console output from 2.4.21-3.ELsmp
output is complete, not cropped.
system doesn't reboot after shutdown.
Created attachment 94724 [details]
console output from 2.4.21-3.EL (UP)
system reboots properly after shutdown.
Created attachment 94725 [details]
console output from 2.4.9-e.3smp
system reboots properly on shutdown.
Created attachment 94726 [details]
console output from 2.4.9-e.24smp
system reboots properly on shutdown.
Created attachment 94743 [details]
console output from 2.4.21-3.ELphro (panic on mount /)
doesn't boot, looking for real problem in panic.
Created attachment 94757 [details]
conole output from 2.4.21-3.ELsmp (rpm package built by jmoyer)
hangs on shutdown.
There are two different problem reports here. One is an original bug report from Oracle, all the internal testing though has a gdth controller. We know we fixed the gdth problem already. So, if this stull happens with the RHEL3 U1 beta kernel, then we need to know that in order to work on finding out what the problem is. Otherwise, the problem should be fixed. Setting bug to NEEDINFO until Oracle can either confirm or deny that the issue is fixed with the U1 beta kernel. Problem is reproducible on a Dell 6450 running RHEL 3 QU 2. System gets to "Restarting system" and hangs, no oops output or system dumps. [root@palnx3 root]# uname -a Linux palnx3 2.4.21-11.ELsmp #1 SMP Mon Mar 8 23:32:56 EST 2004 i686 i686 i386 GNU/Linux [root@palnx3 root]# lsmod Module Size Used by Not tainted parport_pc 18852 1 (autoclean) lp 9124 0 (autoclean) parport 38816 1 (autoclean) [parport_pc lp] autofs 13620 0 (autoclean) (unused) e100 58468 1 floppy 57488 0 (autoclean) microcode 6848 0 (autoclean) ext3 89960 4 jbd 55060 4 [ext3] megaraid 30604 0 (unused) aic7xxx 162064 5 sd_mod 13360 10 scsi_mod 112552 3 [megaraid aic7xxx sd_mod] [root@palnx3 root]# cat /proc/cmdline ro root=/dev/sda5 Van, is this still an active issue ? This bug is affecting our tests systems running on 6450s, which must be power cycled manually after it hangs at the "Rebooting system" prompt. As was noted above, this hang happens only when running the -smp kernel and the system will reboot when using the -up kernel. The hang is not related to the gdth module (as dell does not require this particular module). Can you boot the machine with the kernel command line option nmi_watchdog=1 and then try to reboot the machine? If there is an SMP deadlock of some sort, the nmi watchdog should catch it and the oops would tell us what lock in particular it is spinning on. Tried with 15.ELsmp and nmi_watchdog=1; system froze on "Restarting system." with no oops. I've tried instrumenting (as in prink debugging) the reboot code, and it gets as far as sending the right codes to the bios, but nothing happens. More specifically, in arch/i386/kernel/process.c the kernel gets past the SMP specific code in machine_restart before freezing Well, I found something suspicious in the reboot code. Specifically, in machine_restart, we try to verify the reboot_cpu value to make sure it's a valid processor, but I think the test has a thinko that keeps it from working properly. Specifically, we do this: int cpuid; cpuid = GET_APIC_ID(apic_read(APIC_ID)); if (reboot_smp) { /* check to see if reboot_cpu is valid if its not, default to the BSP */ if ((reboot_cpu == -1) || (reboot_cpu > (NR_CPUS -1)) || !(phys_cpu_present_map & (1<<cpuid))) reboot_cpu = boot_cpu_physical_apicid; The problem I see here is that we are checking phys_cpu_present_map against 1<<cpuid which is whatever CPU this code gets run on and is always true and which doesn't do what we want which is make sure that the reboot_cpu is valid. I suspect that the test above should be rewritten to something like this: if ((reboot_cpu < 0) || (reboot_cpu > (NR_CPUS - 1)) || !(phys_cpu_present_map & (1<<reboot_cpu))) reboot_cpu = boot_cpu_physical_apicid; Greg, could you try making that change in your kernel sources there and see if that makes any difference to whether or not the machine reboots properly? (Since you said you were instrumenting the reboot code already I figured this would be a 10 minute test for you ;-) That sounds very plausible (one of the things I was checking was whether that test passed, which of course it did...) Building in that patch now... Note: that change didn't solve things here. I'm looking for some data on this issue. Specifically, I need to know what machines it does and does not happen on, how much ram those machines have, and which kernel specifically is failing. I suspect that this *might* be related to the kernel and the RAM size of the machine in question. It also might be related to the e820 RAM map. If I can get data on both a work and a failing system to look for differences, that would be very helpful. The machine in question is a standard Dell 6450, which is a 4-way PIII system with 4 GB of ram. Tried booting with mem=512, but that didn't help. OK, I've been able to resolve the problem here. First, the suggested change in comment 51 is correct, but not required to solve the problem (it is however required to keep people from passing a bad cpu number as part of the reboot=s<number> command line option, just plain reboot=s means to use whatever CPU is the boot CPU, but you do have the option of giving it a specific CPU number instead and the change in comment 51 makes sure that the passed in CPU number is valid). What solved the problem here is to use the kernel command line option reboot=s,b (aka, SMP reboot, switch to boot processor before proceeding, then proceed with a BIOS reboot). Neither the s or b options are sufficient by themselves, it has to be both in order for it to reboot reliably. If people can try this on their effected machines and verify that it solves the problem on all the broken hardware and that this isn't just a case of "Oops, we got lucky it worked on ours but it doesn't solve yours", then I'll code up a DMI blacklist patch for U3 that should make the problem go away without special command line options as of U3 or later. reboot=s,b does not solve the problem on my machine (this was with the fix from comment 51 as well). I had tried forcing the bios reset in the past, but that did not seem to resolve the problem... *** Bug 127689 has been marked as a duplicate of this bug. *** Greg and/or Suhua, do you wish to keep this bug report restricted to the Oracle group? It would be useful if dups of this problem could be coalesced into a single report. But if you prefer to keep this bug private, then we could continue the investigation under the other bug id. Just let me know what you prefer. Thanks. -ernie Forwarded message from duplicate case 127609: ------- Additional Comments From ttsig 2004-07-23 15:37 ------- I still am unable to post comments on Bug 102504, presumably because it is for the Beta (I get the message "You are not permitted to edit bugs in product Red Hat Enterprise Linux Beta"). I am interested to know what steps I should take next to assist with resolving this issue. We are upgrading two of our 6450's from 2 to 4 CPU's tonight. Currently both of these systems will reboot with the "reboot=s,b" parameter but our 4 CPU system will not. We are anticaipating that after the upgrade we will then have 3 systems that fail to reboot. Is there a debug kernel we need to try? Thanks, Tom Changed product to Red Hat Enterprise Linux. Any progress on this issue? We proceded with upgrading both of our 2 CPU 6450's to 4 CPU's last week, and, as predicted, these system now both experience the "no reboot" issue. They worked fine when they had only 2 CPU's. We now have a total of three systems with this problem. As a workaround I have discovered that the Dell Server Administrator can be installed and you can use the "Auto Recovery" feature, which is designed to detect a hung OS and restart the computer automatically. The DSA detects a system sitting at the "Restarting system..." prompt as a hung OS and uses the embeded system management processor to power cycle the system. It's crude, but currently the only workaround. I'm waiting for suggestions on how I can assist in gathering information for this. I posted a reasonable amount of information about kernels that work/don't work in Bug 127609. I've been attempting to compare RH9/FC1 kernels since the RH9 kernel fails and the FC1 kernel works, but they're actually pretty different. My next plan is to try vanilla 2.4.21 and then start applying patches. Later, Tom > > Can't you boot with maxcpus set to 2 instead of pulling out the
cpus?
The maxcpus=2 trick does not resolve the reboot issue.
The maxcpus=2 limitation corresponds only to the number of processors
seen by the linux scheduler, but in the boot procedure it's clear that
the kernel still sees all four CPUs and will not reboot.
*** Bug 134555 has been marked as a duplicate of this bug. *** RedHat Support told us to set 'reboot=bios' as kernel parameter. For our 2- and 4-processor machines this works fine. I've seen different magic incantations of the reboot= boot parameter work for different machines. The real problem here is that when the machine locks up, I have no way of knowing *where* it's locking up at. I don't know if we are still in the linux code or if we have returned to the BIOS code already or what. That makes debugging very difficult. I'm putting this on the blocker list for the next RHEL3 update, but it's still iffy whether or not I'll be able to find the true root cause and get a fix that works for everyone. I now have a Dell 6450 in-house so I will be able to debug this problem now. Larry Woodman I am on hand to assist on this issue from Dell's end, but as yet I have had no response to my email to either the RedHat techs working this issue or from our RedHat rep. Our customer has requested we investigate RedHat's responsiveness on this matter and assist as needed. This is a copy of my last update to IT #50767: This is the latest information concerning this case: The 4-cpu 6450 we have reboots successfully, with no special "reboot" command line arguments, with these kernels: AS2.1 smp RHEL3 uniprocessor upstream 2.4.29 smp kernel So it appears to be specific to the RHEL3 smp kernel. This is the latest debug status. When no special "reboot=" is done, the last thing done by the both the up and smp kernels is the following code sequence in machine_restart(), where the reboot_mode code is written to c0000472, later followed by the "pulse reset low": if(!reboot_thru_bios) { /* rebooting needs to touch the page at absolute addr 0 */ *((unsigned short *)__va(0x472)) = reboot_mode; for (;;) { int i; for (i=0; i<100; i++) { kb_wait(); udelay(50); outb(0xfe,0x64); /* pulse reset low */ udelay(50); } /* That didn't work - force a triple fault.. */ __asm__ __volatile__("lidt %0": :"m" (no_idt)); __asm__ __volatile__("int3"); } } The reboot_mode defaults to 0 (cold) or can be configured to 0x1234 (warm) using reboot=w. All debugging has been done without changing it from 0. Given that the RHEL3 up kernel works using the code path above, I've been trying to narrow down the possible reason for the RHEL3 smp kernel failing by injecting debug code that prematurely call the machine_restart() function during init-time. First data point: by calling the machine_restart() function before and after smp_init() is called during boot-time, the RHEL3 smp kernel reboots OK *before* smp_init() gets called, but hangs if called *after* smp_init() was run. Trying to narrow it down further, I applied the machine_restart() calls at various points in the smp initialization sequence, specifically in smp_boot_cpus() which does all the work. The second data point of interest is that this assignment would cause the quick machine_restart() call to fail when called just after the assignment: boot_cpu_logical_apicid = logical_smp_processor_id(); which equates to: static __inline int logical_smp_processor_id(void) { /* we don't want to mark this access volatile - bad code generation */ return GET_APIC_LOGICAL_ID(*(unsigned long *)(APIC_BASE+APIC_LDR)); } This is the very first access to the APIC. Secondly, if I avoided the read of APIC_BASE+APIC_LDR and just assigned a 0 to boot_cpu_logical_apicid, I could then immediately call the machine_restart() function, and it rebooted OK. So simply reading from APIC_BASE+APIC_LDR a single time is enough to make the reboot sequence fail. However, if I continue injecting calls to machine_restart() after: (1) kludging boot_cpu_logical_apicid to 0 (which is what it always would come back as from the register read) and therefore avoiding the APIC read. (2) and let the kernel run a bit farther in smp_boot_cpus(), it again starts failing the quick reboot call as soon as this code was run: verify_local_APIC(); at which time it would start hanging again. This is not surprising since verify_local_APIC() does a bunch of APIC reads and writes to APIC_BASE+<whatever>: int __init verify_local_APIC(void) { unsigned int reg0, reg1; /* * The version register is read-only in a real APIC. */ reg0 = apic_read(APIC_LVR); Dprintk("Getting VERSION: %x\n", reg0); apic_write(APIC_LVR, reg0 ^ APIC_LVR_MASK); reg1 = apic_read(APIC_LVR); Dprintk("Getting VERSION: %x\n", reg1); ... and again, I verified that when I kludged boot_cpu_logical_apicid to 0, and the "reg0" apic_read() above then became the very first APIC read, that first APIC_LVR read would cause a quick call to machine_restart() to hang. So, for whatever reason, as soon as the SMP kernel reads the APIC, machine_restart() will hang from that point on. But that obviously doesn't solve anything, or point to a bug AFAICT. So, I've started looking at the differences between the RHEL3 kernel and 2.4.29 in the smp_init() path as well as the machine_restart() function. There was some discussion about machine_restart(), but replacing the RHEL3 version with the 2.4.29 does not help, although the changes were minimal. There are signficant changes in the smp_boot_cpus() function, and basically grasping at straws, I'd thought it might be worth testing out the changes in the 2.4.29 tree. But, to be honest here, I'm not sure whether that's the way to go -- but have no other ideas to work with. From User-Agent: XML-RPC Dell L3 confirmed that engineering will not support this system and so this case can be closed at this time. Internal Status set to 'Resolved' Status set to: Closed by Client Resolution set to: 'Closed by Client' This event sent from IssueTracker by sbenjamin issue 66146 Created attachment 126997 [details]
dmiscan.patch
Dell tells me they have provided this patch to address the reboot problem. I do
not think RH engineering has reviewed it. Please review and add to U8. This
bugzilla is linked to multiple issue trackers reported by customers and by Dell
engineering.
A fix for this problem has just been committed to the RHEL3 U8 patch pool this evening (in kernel version 2.4.21-40.9.EL). What fix was added? Was it simply the patch that is posted in this Bugzilla? If so I suspect that this will not fully fix the problem. The patch appears to do nothing more that automatically set the "set_bios_reboot" flag which I think is the equivalent to "reboot=b" which does work for some configurations but doesn't work for some. In our case our 2-CPU systems would reboot after setting "reboot=b,s" but our 4-CPU systems would still hang. Is the actual patch actually more involved than that or am I misinterpreting what the patch does. Later, Tom Hi, Tom. The patch in comment #92 is what was committed to U8, which as you guessed, simply sets the "reboot_thru_bios" via set_bios_reboot(). This is equivalent to using the "reboot=b" boot option, which as far as we know works for Dell PowerEdge 6400 and 6450 systems. If you have a system with a different model name/number, please have Customer Support file a new Issue Tracker. If you know that one of the two systems I've listed above won't reboot successfully using "reboot=b", please try to supply more details in this BZ and we'll try to address it during U8 beta. Thanks in advance. Well, almost two years ago I opened Bug 127689 which was eventually closed as a duplicate of this bug (this bug was marked private at the time). In that Bug I documented that our 6450's with 2-CPU's would reboot with "reboot=b,s" but that our 4-CPU system still hung. We currently have only one 6450 left in production and it runs RHEL4, which also has the hang problem. We still have one 6450 left in the lab that runs RHEL3 and is currently running U7. While in the office for mainenance today I double checked and can say 100% for sure the reboot=b does not correct the problem on this system. The system is a Dell 6450, 4 700Mhz PIII processors with 4GB of RAM. I do think it's one BIOS revision behind and I will test this tomorrow, but I suspect that since reboot=b doesn't work on this system or the 6450 running RHEL4 then I don't hold out much hope for this patch working on my systems. Later, Tom *** Bug 175759 has been marked as a duplicate of this bug. *** An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0437.html |