Bug 438741 - kernel panic due to HP Watchdog firing (hpwdt)
kernel panic due to HP Watchdog firing (hpwdt)
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
All Linux
high Severity high
: rc
: ---
Assigned To: Prarit Bhargava
Martin Jenner
: Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-24 15:46 EDT by Mike Miller (OS Dev)
Modified: 2008-05-21 11:12 EDT (History)
3 users (show)

See Also:
Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 11:12:25 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Complete boot log prior to panic. (25.46 KB, application/octet-stream)
2008-03-24 15:46 EDT, Mike Miller (OS Dev)
no flags Details
RHEL5 fix for this issue (1.22 KB, patch)
2008-03-25 11:37 EDT, Prarit Bhargava
no flags Details | Diff
Turn off hpwdt compile in RHEL5 (550 bytes, patch)
2008-04-01 15:09 EDT, Prarit Bhargava
no flags Details | Diff

  None (edit)
Description Mike Miller (OS Dev) 2008-03-24 15:46:07 EDT
Description of problem:
Kernel panic when booting a non-Xen kernel under rhel5.2 snapshot 1.

Version-Release number of selected component (if applicable):
kernel version 2.6.18-85.el5.

How reproducible:
Every time.

Steps to Reproduce:
1. Install snapshot 1
2. Boot the non-Xen from grub
  
Actual results:
Kernel panic - not syncing: An NMI occurred, please see the Integrated
Management Log for details.

Kernel panic - not syncing: An NMI occurred, please see the Integrated
Management Log for details.

Kernel panic - not syncing: An NMI occurred, please see the Integrated
Management Log for details.

  <0>Kernel panic - not syncing: An NMI occurred, please see the Integrated
Management Log for details.

Kernel panic - not syncing: An NMI occurred, please see the Integrated
Management Log for details.

Kernel panic - not syncing: An NMI occurred, please see the Integrated
Management Log for details.

   BUG: warning at drivers/char/vt.c:3361/do_unblank_screen() (Not tainted)

Call Trace:
 <NMI>  [<ffffffff80199f9e>] do_unblank_screen+0x56/0x132
 [<ffffffff80080573>] bust_spinlocks+0x1c/0x46
 [<ffffffff8008f3d9>] panic+0x88/0x1eb
 [<ffffffff8821243d>] :hpwdt:hpwdt_pretimeout+0x85/0x8c
 [<ffffffff80066b91>] notifier_call_chain+0x20/0x32
 [<ffffffff80065567>] default_do_nmi+0x67/0x214
 [<ffffffff800659d8>] do_nmi+0x43/0x61
 [<ffffffff80064e47>] nmi+0x7f/0x88
 [<ffffffff80056bd7>] mwait_idle+0x0/0x4a
 [<ffffffff80056c0d>] mwait_idle+0x36/0x4a
 <<EOE>>  [<ffffffff80048a90>] cpu_idle+0x95/0xb8
 [<ffffffff803d9801>] start_kernel+0x220/0x225
 [<ffffffff803d922f>] _sinittext+0x22f/0x236


Expected results:
Kernel should boot without any panics.

Additional info:
I suspect this is being caused by the hpwdt driver. I've had similar problems
when using upstream kernels. System Info: 

Proliant ML570 P60 (07/28/2006)
4 XEON @ 3.00GHz/800MHz (Dual-Core, 2x2MB L2)
P400 boot controller in x8 slot
Comment 1 Mike Miller (OS Dev) 2008-03-24 15:46:07 EDT
Created attachment 298930 [details]
Complete boot log prior to panic.
Comment 2 Prarit Bhargava 2008-03-25 08:38:38 EDT
Tony, I've seen unknown NMI messages on a few HP systems in the past few weeks.
 It's almost as if the NMI was randomly firing and no event code was passed along.

Any ideas on what could be causing this?

P.
Comment 3 Prarit Bhargava 2008-03-25 09:06:40 EDT
Tony, for example from hp-dl360g5-01.rhts.boston.redhat.com:

[root@hp-dl360g5-01 ~]# dmesg | grep NMI
ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[   64.272201] testing NMI watchdog ... OK.
[   76.815676] hpwdt: An NMI occurred, but unable to determine source.
[   76.815681] hpwdt: An NMI occurred, but unable to determine source.
[   84.192152] hpwdt: An NMI occurred, but unable to determine source.
[   86.178343] hpwdt: An NMI occurred, but unable to determine source.
[   98.718973] hpwdt: An NMI occurred, but unable to determine source.
[  108.080338] hpwdt: An NMI occurred, but unable to determine source.
[  146.458958] hpwdt: An NMI occurred, but unable to determine source.
[  196.274314] hpwdt: An NMI occurred, but unable to determine source.

Like I said, it seems like the NMI is randomly firing on this system ...

P.
Comment 4 Prarit Bhargava 2008-03-25 09:12:40 EDT
Mike (Miller) -- is there anything useful in the ILO log?

Just curious...

P.
Comment 5 Prarit Bhargava 2008-03-25 11:29:40 EDT
After examining this code with dzickus we came to the following conclusion --
the hpwdt code is busted.

Currently, the code does the following

Is this interrupt mine?

Yes -- okay, panic.

No -- Print out a message that this NMI isn't mine and stop all future NMIs from
occurring.

The code should actually do:

Is this interrupt mine?

Yes -- okay, panic.

No.  Do nothing and return so that the next registered NMI handler can look at it.

[Tested] patch coming soon,

P.
Comment 7 Prarit Bhargava 2008-03-25 11:37:56 EDT
Created attachment 299040 [details]
RHEL5 fix for this issue

Tony, please review ASAP.  As it stands now, NMI is broken on all HP systems.
Comment 8 RHEL Product and Program Management 2008-03-25 11:53:03 EDT
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 10 Mike Miller (OS Dev) 2008-03-26 14:33:36 EDT
Actually I think the problem is related to pre-G5 system and the hpwdt. hpdwt
should not on any HP system unless it's at least G5. I've notified the maintainer.
Comment 11 Mike Miller (OS Dev) 2008-03-26 14:34:51 EDT
Prarit, Nothing useful in the logs since hpdwt should not even run on this G4
server.
Comment 12 Prarit Bhargava 2008-03-27 08:21:00 EDT
(In reply to comment #10)
> Actually I think the problem is related to pre-G5 system and the hpwdt. hpdwt
> should not on any HP system unless it's at least G5. I've notified the maintainer.

Thanks Mike -- I've been speaking with Tom over private email myself.  Hopefully
we can get a quick patch in to resolve this issue.

Tom is basically saying that the driver should only load if nmi_watchdog = 0. 
I'm not 100% convinced this is the right thing to do -- OTOH, it's HP's driver,
so you get to do what you want with it ;)

P.
Comment 16 Don Zickus 2008-04-01 14:17:31 EDT
Per discussions with management, we are going to disable compiling the hpwdt
driver.  Hopefully the issues will be worked out and well tested in time for 5.3.

The code will remain in the our code path, so HP will be allowed to compile it
out of tree and ship it to customers if they feel inclined to do so.

Comment 17 Prarit Bhargava 2008-04-01 15:09:56 EDT
Created attachment 299947 [details]
Turn off hpwdt compile in RHEL5
Comment 18 Peter Martuccelli 2008-04-07 12:25:36 EDT
This is not a RHEL 5.2 blocker.  Moved the issue out to R5.3, depending on the
stability of the driver we can review for inclusion then.
Comment 19 RHEL Product and Program Management 2008-04-07 12:33:24 EDT
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 22 Don Zickus 2008-04-09 14:44:34 EDT
in kernel-2.6.18-89.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 25 errata-xmlrpc 2008-05-21 11:12:25 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.