200421 – Reboot and Halt fails on Dell PowerEdge 6450

Bug 200421 - Reboot and Halt fails on Dell PowerEdge 6450

Summary: Reboot and Halt fails on Dell PowerEdge 6450

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	MassClosed
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-07-27 16:15 UTC by Jason Fertig
Modified:	2008-01-20 04:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-01-20 04:37:25 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jason Fertig 2006-07-27 16:15:24 UTC

I'm experiencing the exact same symptoms described in Bug 127689 except I'm
running Fedora 5.  My system is a PowerEdge 6450 with 4 x Xeon 900MHz
processors, 2GB RAM, and a PERC 2/DC card attached to a JBOD running on
kernel-smp-2.6.17-1.2157_FC5.

As described in the RHEL 3 bug, adding "reboot=b,s" HAS fixed the problem.  But
I thought this should be brought out into the open anyway.

If any other information is needed to verify and monkey with this, just let me
know.  

Thanks.
Jason



+++ This bug was initially created as a clone of Bug #127689 +++

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; Linux i686; U) Opera 7.51  [en]

Description of problem:
We have three Dell PowerEdge 6450 servers that were upgrded from RHEL 
AS 2.1 to 3.0 several months ago.  After the upgrade it was 
discovered that all of these servers would hang when attempting to 
reboot.  After some research we discovered several reports on the web 
about the same issue and the fix seemed was to add "reboot=b,s" to 
the boot command line.  This indeed did fix the issue for two of the 
three servers, however, the third server continued to fail to reboot.

The only difference between the servers that would reboot and the 
servers that wouldn't is that the one server that fails has 4 CPU's 
while the others only have 2 CPU's.

I continued to try several different variations of the "reboot=" 
option such as "reboot=b,s0", "reboot=b,s1", etc., hoping that 
perhaps linux was simply selecting the incorrect processor to preform 
the reboot, however, no option that I tried corrected this issue.  We 
also tried several other combinations with other "reboot=" options 
such as w, c, and h.  Nothing has succeeded in getting this issue 
resolved.

For additional testing I tried the following kernels and list their 
success or failure:

Redhat AS 2.1 -- 2.4.9-e.38         -- Works
Redhat 9      -- 2.4.20-31.9        -- Fails
Fedora Core 1 -- 2.4.22-1.2197.nptl -- Works
Redhat AS 3   -- 2.4.21-15.EL UP    -- Works

I tested several variants of the Redhat AS kernels, all SMP version 
failed, from 2.4.21-4.EL through the latest 2.4.21-15.0.3.EL, 
however, all UP kernel rebooted without issues.

There are other reports of the issue that can be turned up with a 
quick search on Google, some have success with "reboot=b,s" others do 
not.  I'm very suspcious that the people who do not have success are 
people with 4 CPU's.

Please let me know what other information needs to be provided.
 


Version-Release number of selected component (if applicable):
kerne-smp-2.4.21-15.EL

How reproducible:
Always

Steps to Reproduce:
1. Boot Dell PowerEdge 6450 with for processors with any AS 3 kernel
2. Type 'reboot' at command line

    

Actual Results:  System will hang at "System Rebooting..."

Expected Results:  System should reboot

Additional info:

We have worked around this issue by installing Dell Server 
Administrator which can detect a hung OS and use the systems embedded 
service processor to power cycle the system.  Interestingly it 
detects this state as a hung OS and preforms the recovery.  Its a 
crude workaround that shouldn't be required and adds an extra five 
minutes to an already long reboot process (these systems POST very 
slowly) but at least it allows us to reboot the server remotely even 
with this kernel bug.

-- Additional comment from anderson on 2004-07-12 14:23 EST --

Tom,

I don't have one of these machines to work with, so I'll have to
work through you.

One question re: the Dell Server Administrator.  Is it possible for
it to report the PC of each processor?  If this is a kernel-specific
problem, I would first like to rule out the possibility that the
IPI sent out by the rebooting cpu is not being received by one of
the other cpus.  If any of the processors for whatever reason are
sitting in a spin_lock_irq(), then they won't ever respond to the IPI,
and the rebooting system would block forever in machine_restart()
and act as you describe.  If you can get the PC of each cpu, it's
possible that one of the cpus may show that it is operating in an
address range that can be identified as a spin lock text area.

If not, will you be able to run debug RHEL3 kernels that I create? 
I'd like to add a bunch of printk's in the machine_restart() function
to figure out what's going on.

Dave Anderson




-- Additional comment from ttsig on 2004-07-12 21:01 EST --
Unfortunately I don't think that Dell Server Admin can get at that 
level of information, at least via any user accessible method that I 
can find.

I guess that leaves us with the option of running a debug kernel, 
which I can do, but only during limited times as the system is a 
production Oracle box.  That being said, we plan to upgrade the other 
two system to 4 CPU's this week and I'm anticipating that after we do 
that they will experience the same issue.  If that turns out to be 
the case I can probably move the services of one of the servers to 
one of our lab servers temporarily which would free up a system to 
test with.  In the meantime I can schedule times to test the reboot 
functionality on the existing server, but that probably means only 
one good test a day.  

I'm almost sure that the original beta kernels for RHEL 3 didn't have 
this problem.  I may see if I still have one of those lying around  
just to test the reboot functionality as it might give us another 
data point that is closer to the current kernel than the RH9 or FC1 
kernels.  Then we could run some diff to see what changed.

Later,
Tom


-- Additional comment from anderson on 2004-07-13 13:09 EST --

Ok -- if you want to test an earlier RHEL3 kernel version, I can
make it available for you.

-- Additional comment from greg.marsden on 2004-07-14 15:27 EST --
This is a duplicate of bug 102504
(havent tried with the betas)

-- Additional comment from anderson on 2004-07-14 15:41 EST --

Thanks, Greg -- closing this as a duplicate.

*** This bug has been marked as a duplicate of 102504 ***

-- Additional comment from ttsig on 2004-07-14 16:43 EST --
How do I get access to that bug?  I can view it but cannot add
comments or add myself to the CC: list.  It appears to be restricted
to group members.

I missed it during my search because it was files against the Beta. 
Sorry.

Thanks,
Tom


-- Additional comment from anderson on 2004-07-14 16:55 EST --

You are already on its cc: list, so you'll receive all
subsequent input into the case.

As to the restriction, it does appear to be restricted to
Red Hat development, but since you are now on the cc: list,
you are allowed to view it.  I don't personally know how
to change that behavior, but I can add your comments.

-- Additional comment from ttsig on 2004-07-23 15:37 EST --
I still am unable to post comments on Bug 102504, presumably because
it is for the Beta (I get the message "You are not permitted to edit
bugs in product Red Hat Enterprise Linux Beta").

I am interested to know what steps I should take next to assist with
resolving this issue.  We are upgrading two of our 6450's from 2 to 4
CPU's tonight.  Currently both of these systems will reboot with the
"reboot=s,b" parameter but our 4 CPU system will not.  We are
anticaipating that after the upgrade we will then have 3 systems that
fail to reboot.

Is there a debug kernel we need to try?

Thanks,
Tom


-- Additional comment from lance on 2004-08-21 11:07 EST --
I, too, am experiencing this problem.  I have several Dell 6450s with
4 processors in each that fail to recycle after outputting the
'restarting system' message.  They are running RH Enterprise Linux AS
3 Update 2.  Is there a solution for this problem, perhaps in bug
#102504 that I cannot at present access.

-- Additional comment from anderson on 2004-08-23 08:28 EST --
Not yet.

-- Additional comment from marcdeslauriers on 2004-09-16 18:57 EST --
Has this been resolved in update 3?

I have to install a 6450 with 4 cpus at a customer location soon. If
this is still an issue, I'll just install RHEL2.1...

-- Additional comment from ttsig on 2004-09-16 22:17 EST --
I don't believe the issue is resolved, it certainly doesn't seem to 
be for me, on top of this I've had random lockups and multiple 
servers after upgrading to the 2.4.21-20.EL kernels in U3 and am in 
the process of reverting to the previous kernels.

You can easily work around the reboot issue with the Dell Server 
Administrator Auto Recovery feature, but I can't argue with running 
RHEL 2.1 unless you really need some of the RHEL 3 features.  I ran 
2.1 for quite a while on my 6450's and they were solid.  Since 
upgrading to RHEL 3 over nine months ago we've had nothing but 
trouble with every kernel release having some bug that seems to make 
it worse than the last one, I sometimes wish it was easy to go back.

Later,
Tom


-- Additional comment from darnold on 2004-10-21 11:34 EST --
I took a drive that had AS 3.2 installed in a Dell PE 1650 and 
installed in my 6450. The 6450 would reboot with the 1650 drive. The 
1650 would NOT reboot with the 6450 drive.

-- Additional comment from joe.beiter on 2004-12-07 11:48 EST --
In addition to seeing the previously reported behavior on 6450s, I'm
also seeing this on our 1600s. All with 4 processors.

This is a "resolved duplicate"? Some one please tell Redhat's support
staff so they can tell me the fix.



-- Additional comment from anderson on 2004-12-07 11:58 EST --

This bugzilla was closed as a duplicate of another open bugzilla.
Unfortunately the problem at hand is not resolved.  


-- Additional comment from nhruby.edu on 2005-02-09 11:54 EST --
PING: metoo.  Please unclassify the tracker for this.

-- Additional comment from sales on 2006-01-23 23:17 EST --
Has you guys fix this problem.   I am running centos 3.6 
and it is doing the same thing to me with 2 cpus and i just added 4.


-- Additional comment from bugzilla on 2006-02-21 14:04 EST --
Changed to 'CLOSED' state since 'RESOLVED' has been deprecated.

-- Additional comment from petrides on 2006-04-22 05:03 EST --
A fix for this problem has just been committed to the RHEL3 U8
patch pool this evening (in kernel version 2.4.21-40.9.EL).


-- Additional comment from sales on 2006-04-22 12:08 EST --
(In reply to comment #22)
> A fix for this problem has just been committed to the RHEL3 U8
> patch pool this evening (in kernel version 2.4.21-40.9.EL).
> 
How did you fix it?

-- Additional comment from petrides on 2006-04-23 00:14 EST --
Created an attachment (id=128122)
fix committed to RHEL3 U8 for this bug

Hi, Greg.  The attached patch is what was committed to U8.  It simply
adds "black list" entries for the Dell PowerEdge 6400 and 6450 systems
that make reboots go through the BIOS (via setting "reboot_thru_bios").

-- Additional comment from petrides on 2006-04-28 17:43 EST --
Adding a couple dozen bugs to CanFix list so I can complete the stupid advisory.

Comment 1 Dave Jones 2006-10-16 20:21:11 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 2 Jon Stanley 2008-01-20 04:37:25 UTC

(this is a mass-close to kernel bugs in NEEDINFO state)

As indicated previously there has been no update on the progress of this bug
therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue
still occurs for you and I will try to assist in its resolution. Thank you for
taking the time to report the initial bug.

If you believe that this bug was closed in error, please feel free to reopen
this bug.

Note You need to log in before you can comment on or make changes to this bug.