127689 – Reboot fails on Dell PowerEdge 6450

Bug 127689 - Reboot fails on Dell PowerEdge 6450

Summary: Reboot fails on Dell PowerEdge 6450

Keywords:
Status:	CLOSED DUPLICATE of bug 102504
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Anderson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	RHEL3U8CanFix
TreeView+	depends on / blocked

Reported:	2004-07-12 16:37 UTC by Tom Sightler
Modified:	2007-11-30 22:07 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-02-21 19:04:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
fix committed to RHEL3 U8 for this bug (835 bytes, patch) 2006-04-23 04:14 UTC, Ernie Petrides	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0437	0	normal	SHIPPED_LIVE	Important: Updated kernel packages for Red Hat Enterprise Linux 3 Update 8	2006-07-20 13:11:00 UTC

Description Tom Sightler 2004-07-12 16:37:15 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; Linux i686; U) Opera 7.51  [en]

Description of problem:
We have three Dell PowerEdge 6450 servers that were upgrded from RHEL 
AS 2.1 to 3.0 several months ago.  After the upgrade it was 
discovered that all of these servers would hang when attempting to 
reboot.  After some research we discovered several reports on the web 
about the same issue and the fix seemed was to add "reboot=b,s" to 
the boot command line.  This indeed did fix the issue for two of the 
three servers, however, the third server continued to fail to reboot.

The only difference between the servers that would reboot and the 
servers that wouldn't is that the one server that fails has 4 CPU's 
while the others only have 2 CPU's.

I continued to try several different variations of the "reboot=" 
option such as "reboot=b,s0", "reboot=b,s1", etc., hoping that 
perhaps linux was simply selecting the incorrect processor to preform 
the reboot, however, no option that I tried corrected this issue.  We 
also tried several other combinations with other "reboot=" options 
such as w, c, and h.  Nothing has succeeded in getting this issue 
resolved.

For additional testing I tried the following kernels and list their 
success or failure:

Redhat AS 2.1 -- 2.4.9-e.38         -- Works
Redhat 9      -- 2.4.20-31.9        -- Fails
Fedora Core 1 -- 2.4.22-1.2197.nptl -- Works
Redhat AS 3   -- 2.4.21-15.EL UP    -- Works

I tested several variants of the Redhat AS kernels, all SMP version 
failed, from 2.4.21-4.EL through the latest 2.4.21-15.0.3.EL, 
however, all UP kernel rebooted without issues.

There are other reports of the issue that can be turned up with a 
quick search on Google, some have success with "reboot=b,s" others do 
not.  I'm very suspcious that the people who do not have success are 
people with 4 CPU's.

Please let me know what other information needs to be provided.
 


Version-Release number of selected component (if applicable):
kerne-smp-2.4.21-15.EL

How reproducible:
Always

Steps to Reproduce:
1. Boot Dell PowerEdge 6450 with for processors with any AS 3 kernel
2. Type 'reboot' at command line

    

Actual Results:  System will hang at "System Rebooting..."

Expected Results:  System should reboot

Additional info:

We have worked around this issue by installing Dell Server 
Administrator which can detect a hung OS and use the systems embedded 
service processor to power cycle the system.  Interestingly it 
detects this state as a hung OS and preforms the recovery.  Its a 
crude workaround that shouldn't be required and adds an extra five 
minutes to an already long reboot process (these systems POST very 
slowly) but at least it allows us to reboot the server remotely even 
with this kernel bug.

Comment 1 Dave Anderson 2004-07-12 18:23:09 UTC

Tom,

I don't have one of these machines to work with, so I'll have to
work through you.

One question re: the Dell Server Administrator.  Is it possible for
it to report the PC of each processor?  If this is a kernel-specific
problem, I would first like to rule out the possibility that the
IPI sent out by the rebooting cpu is not being received by one of
the other cpus.  If any of the processors for whatever reason are
sitting in a spin_lock_irq(), then they won't ever respond to the IPI,
and the rebooting system would block forever in machine_restart()
and act as you describe.  If you can get the PC of each cpu, it's
possible that one of the cpus may show that it is operating in an
address range that can be identified as a spin lock text area.

If not, will you be able to run debug RHEL3 kernels that I create? 
I'd like to add a bunch of printk's in the machine_restart() function
to figure out what's going on.

Dave Anderson

Comment 2 Tom Sightler 2004-07-13 01:01:39 UTC

Unfortunately I don't think that Dell Server Admin can get at that 
level of information, at least via any user accessible method that I 
can find.

I guess that leaves us with the option of running a debug kernel, 
which I can do, but only during limited times as the system is a 
production Oracle box.  That being said, we plan to upgrade the other 
two system to 4 CPU's this week and I'm anticipating that after we do 
that they will experience the same issue.  If that turns out to be 
the case I can probably move the services of one of the servers to 
one of our lab servers temporarily which would free up a system to 
test with.  In the meantime I can schedule times to test the reboot 
functionality on the existing server, but that probably means only 
one good test a day.  

I'm almost sure that the original beta kernels for RHEL 3 didn't have 
this problem.  I may see if I still have one of those lying around  
just to test the reboot functionality as it might give us another 
data point that is closer to the current kernel than the RH9 or FC1 
kernels.  Then we could run some diff to see what changed.

Later,
Tom

Comment 3 Dave Anderson 2004-07-13 17:09:02 UTC

Ok -- if you want to test an earlier RHEL3 kernel version, I can
make it available for you.

Comment 4 Greg Marsden 2004-07-14 19:27:59 UTC

This is a duplicate of bug 102504
(havent tried with the betas)

Comment 5 Dave Anderson 2004-07-14 19:41:45 UTC

Thanks, Greg -- closing this as a duplicate.

*** This bug has been marked as a duplicate of 102504 ***

Comment 6 Tom Sightler 2004-07-14 20:43:17 UTC

How do I get access to that bug?  I can view it but cannot add
comments or add myself to the CC: list.  It appears to be restricted
to group members.

I missed it during my search because it was files against the Beta. 
Sorry.

Thanks,
Tom

Comment 7 Dave Anderson 2004-07-14 20:55:36 UTC

You are already on its cc: list, so you'll receive all
subsequent input into the case.

As to the restriction, it does appear to be restricted to
Red Hat development, but since you are now on the cc: list,
you are allowed to view it.  I don't personally know how
to change that behavior, but I can add your comments.

Comment 8 Tom Sightler 2004-07-23 19:37:38 UTC

I still am unable to post comments on Bug 102504, presumably because
it is for the Beta (I get the message "You are not permitted to edit
bugs in product Red Hat Enterprise Linux Beta").

I am interested to know what steps I should take next to assist with
resolving this issue.  We are upgrading two of our 6450's from 2 to 4
CPU's tonight.  Currently both of these systems will reboot with the
"reboot=s,b" parameter but our 4 CPU system will not.  We are
anticaipating that after the upgrade we will then have 3 systems that
fail to reboot.

Is there a debug kernel we need to try?

Thanks,
Tom

Comment 9 Lance A. Brown 2004-08-21 15:07:51 UTC

I, too, am experiencing this problem.  I have several Dell 6450s with
4 processors in each that fail to recycle after outputting the
'restarting system' message.  They are running RH Enterprise Linux AS
3 Update 2.  Is there a solution for this problem, perhaps in bug
#102504 that I cannot at present access.

Comment 10 Dave Anderson 2004-08-23 12:28:28 UTC

Not yet.

Comment 11 Marc Deslauriers 2004-09-16 22:57:41 UTC

Has this been resolved in update 3?

I have to install a 6450 with 4 cpus at a customer location soon. If
this is still an issue, I'll just install RHEL2.1...

Comment 12 Tom Sightler 2004-09-17 02:17:54 UTC

I don't believe the issue is resolved, it certainly doesn't seem to 
be for me, on top of this I've had random lockups and multiple 
servers after upgrading to the 2.4.21-20.EL kernels in U3 and am in 
the process of reverting to the previous kernels.

You can easily work around the reboot issue with the Dell Server 
Administrator Auto Recovery feature, but I can't argue with running 
RHEL 2.1 unless you really need some of the RHEL 3 features.  I ran 
2.1 for quite a while on my 6450's and they were solid.  Since 
upgrading to RHEL 3 over nine months ago we've had nothing but 
trouble with every kernel release having some bug that seems to make 
it worse than the last one, I sometimes wish it was easy to go back.

Later,
Tom

Comment 13 Dan Arnold 2004-10-21 15:34:16 UTC

I took a drive that had AS 3.2 installed in a Dell PE 1650 and 
installed in my 6450. The 6450 would reboot with the 1650 drive. The 
1650 would NOT reboot with the 6450 drive.

Comment 14 Joe Beiter 2004-12-07 16:48:22 UTC

In addition to seeing the previously reported behavior on 6450s, I'm
also seeing this on our 1600s. All with 4 processors.

This is a "resolved duplicate"? Some one please tell Redhat's support
staff so they can tell me the fix.

Comment 15 Dave Anderson 2004-12-07 16:58:46 UTC

This bugzilla was closed as a duplicate of another open bugzilla.
Unfortunately the problem at hand is not resolved.

Comment 16 nathan r. hruby 2005-02-09 16:54:37 UTC

PING: metoo.  Please unclassify the tracker for this.

Comment 18 Greg 2006-01-24 04:17:54 UTC

Has you guys fix this problem.   I am running centos 3.6 
and it is doing the same thing to me with 2 cpus and i just added 4.

Comment 19 Red Hat Bugzilla 2006-02-21 19:04:25 UTC

Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

Comment 22 Ernie Petrides 2006-04-22 09:03:01 UTC

A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.9.EL).

Comment 23 Greg 2006-04-22 16:08:32 UTC

(In reply to comment #22)
> A fix for this problem has just been committed to the RHEL3 U8
> patch pool this evening (in kernel version 2.4.21-40.9.EL).
> 
How did you fix it?

Comment 24 Ernie Petrides 2006-04-23 04:14:31 UTC

Created attachment 128122 [details]
fix committed to RHEL3 U8 for this bug

Hi, Greg.  The attached patch is what was committed to U8.  It simply
adds "black list" entries for the Dell PowerEdge 6400 and 6450 systems
that make reboots go through the BIOS (via setting "reboot_thru_bios").

Comment 25 Ernie Petrides 2006-04-28 21:43:26 UTC

Adding a couple dozen bugs to CanFix list so I can complete the stupid advisory.

Note You need to log in before you can comment on or make changes to this bug.