Description of problem: Kernel panic when booting a non-Xen kernel under rhel5.2 snapshot 1. Version-Release number of selected component (if applicable): kernel version 2.6.18-85.el5. How reproducible: Every time. Steps to Reproduce: 1. Install snapshot 1 2. Boot the non-Xen from grub Actual results: Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. <0>Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. Kernel panic - not syncing: An NMI occurred, please see the Integrated Management Log for details. BUG: warning at drivers/char/vt.c:3361/do_unblank_screen() (Not tainted) Call Trace: <NMI> [<ffffffff80199f9e>] do_unblank_screen+0x56/0x132 [<ffffffff80080573>] bust_spinlocks+0x1c/0x46 [<ffffffff8008f3d9>] panic+0x88/0x1eb [<ffffffff8821243d>] :hpwdt:hpwdt_pretimeout+0x85/0x8c [<ffffffff80066b91>] notifier_call_chain+0x20/0x32 [<ffffffff80065567>] default_do_nmi+0x67/0x214 [<ffffffff800659d8>] do_nmi+0x43/0x61 [<ffffffff80064e47>] nmi+0x7f/0x88 [<ffffffff80056bd7>] mwait_idle+0x0/0x4a [<ffffffff80056c0d>] mwait_idle+0x36/0x4a <<EOE>> [<ffffffff80048a90>] cpu_idle+0x95/0xb8 [<ffffffff803d9801>] start_kernel+0x220/0x225 [<ffffffff803d922f>] _sinittext+0x22f/0x236 Expected results: Kernel should boot without any panics. Additional info: I suspect this is being caused by the hpwdt driver. I've had similar problems when using upstream kernels. System Info: Proliant ML570 P60 (07/28/2006) 4 XEON @ 3.00GHz/800MHz (Dual-Core, 2x2MB L2) P400 boot controller in x8 slot
Created attachment 298930 [details] Complete boot log prior to panic.
Tony, I've seen unknown NMI messages on a few HP systems in the past few weeks. It's almost as if the NMI was randomly firing and no event code was passed along. Any ideas on what could be causing this? P.
Tony, for example from hp-dl360g5-01.rhts.boston.redhat.com: [root@hp-dl360g5-01 ~]# dmesg | grep NMI ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1]) [ 64.272201] testing NMI watchdog ... OK. [ 76.815676] hpwdt: An NMI occurred, but unable to determine source. [ 76.815681] hpwdt: An NMI occurred, but unable to determine source. [ 84.192152] hpwdt: An NMI occurred, but unable to determine source. [ 86.178343] hpwdt: An NMI occurred, but unable to determine source. [ 98.718973] hpwdt: An NMI occurred, but unable to determine source. [ 108.080338] hpwdt: An NMI occurred, but unable to determine source. [ 146.458958] hpwdt: An NMI occurred, but unable to determine source. [ 196.274314] hpwdt: An NMI occurred, but unable to determine source. Like I said, it seems like the NMI is randomly firing on this system ... P.
Mike (Miller) -- is there anything useful in the ILO log? Just curious... P.
After examining this code with dzickus we came to the following conclusion -- the hpwdt code is busted. Currently, the code does the following Is this interrupt mine? Yes -- okay, panic. No -- Print out a message that this NMI isn't mine and stop all future NMIs from occurring. The code should actually do: Is this interrupt mine? Yes -- okay, panic. No. Do nothing and return so that the next registered NMI handler can look at it. [Tested] patch coming soon, P.
Created attachment 299040 [details] RHEL5 fix for this issue Tony, please review ASAP. As it stands now, NMI is broken on all HP systems.
This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP.
Actually I think the problem is related to pre-G5 system and the hpwdt. hpdwt should not on any HP system unless it's at least G5. I've notified the maintainer.
Prarit, Nothing useful in the logs since hpdwt should not even run on this G4 server.
(In reply to comment #10) > Actually I think the problem is related to pre-G5 system and the hpwdt. hpdwt > should not on any HP system unless it's at least G5. I've notified the maintainer. Thanks Mike -- I've been speaking with Tom over private email myself. Hopefully we can get a quick patch in to resolve this issue. Tom is basically saying that the driver should only load if nmi_watchdog = 0. I'm not 100% convinced this is the right thing to do -- OTOH, it's HP's driver, so you get to do what you want with it ;) P.
Per discussions with management, we are going to disable compiling the hpwdt driver. Hopefully the issues will be worked out and well tested in time for 5.3. The code will remain in the our code path, so HP will be allowed to compile it out of tree and ship it to customers if they feel inclined to do so.
Created attachment 299947 [details] Turn off hpwdt compile in RHEL5
This is not a RHEL 5.2 blocker. Moved the issue out to R5.3, depending on the stability of the driver we can review for inclusion then.
in kernel-2.6.18-89.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html