Bug 461671 - [RHEL5] nmi: crash during kdump kernel boot
[RHEL5] nmi: crash during kdump kernel boot
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.4
All Linux
medium Severity medium
: beta
: ---
Assigned To: Aristeu Rozanski
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-09-09 15:45 EDT by Aristeu Rozanski
Modified: 2009-01-20 15:07 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-01-20 15:07:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
backported patch (1.73 KB, patch)
2008-09-22 17:18 EDT, Aristeu Rozanski
no flags Details | Diff

  None (edit)
Description Aristeu Rozanski 2008-09-09 15:45:15 EDT
Description of problem:
During a kdump kernel was loaded, Dave Anderson's burn machine hit another
problem with the NMI watchdog patches, also present upstream.

dump copied from the screen:
<NMI> [<ffffffff8009008d>] panic+0x1da/0x1eb
[<ffffffff8019e996>] do_unblank_screen+0x1b/0x132
[<ffffffff80064fe2>] oops_end+0x51/0x53
[<ffffffff8006b903>] die+0x3a/0x44
[<ffffffff80065587>] do_general_protection+0xfe/0x107
[<ffffffff8005dde9>] error_exit+0x0/0x84
[<ffffffff8007ef75>] write_watchdog_counter+0x2d/0x31
[<ffffffff8007f1f9>] lapic_wd_event+0x30/0x3f
[<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3
[<ffffffff80065611>] default_do_nmi+0x81/0x225
[<ffffffff8006587e>] do_nmi+0x43/0x61
[<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9
<<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c
[<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a
[<ffffffff80076d84>] setup_local_APIC+0x17b/0x187
[<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361
[<ffffffff803ef8bf>] init+0x62/0x2f7
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff801703ca>] acpi_ds_init_one_object+0x0/0x80
[<ffffffff803ef85d>] init+0x0/0x2f7
[<ffffffff8005dfa7>] child_rip+0x0/0x11

(original photo attached)
Comment 1 Aristeu Rozanski 2008-09-09 17:03:20 EDT
The theory behind this is that an NMI from the first kernel's NMI watchdog was
delivered right when the NMI watchdog of the kdump kernel was being initialized:

[<ffffffff8006587e>] do_nmi+0x43/0x61
[<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9
<<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c
[<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a

The problem is that:
void setup_apic_nmi_watchdog(void)
{       
        if (__get_cpu_var(wd_enabled) == 1)
                return;

        switch (nmi_watchdog) {
        case NMI_LOCAL_APIC:
                __get_cpu_var(wd_enabled) = 1;
                if (lapic_watchdog_init(nmi_hz) < 0) {
                        __get_cpu_var(wd_enabled) = 0;
                        return;
                }

and
int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
{       
        int sum, touched = 0, rc = 0;           
(...)
        /* see if the nmi watchdog went off */
        if (!__get_cpu_var(wd_enabled))
                return rc;
        switch (nmi_watchdog) {
        case NMI_LOCAL_APIC:
                rc |= lapic_wd_event(nmi_hz);
                break;

So, the NMI arrived between setting wd_enabled as 1for that processor and
actually configuring it and thus it was detected as a NMI watchdog NMI and the
p4_rearm() ended up called causing the crash on write_watchdog_counter().

Another possibility is that the setup_p4_watchdog triggered the NMI while
setting the registers.
Comment 2 Aristeu Rozanski 2008-09-10 17:33:19 EDT
managed to reproduce it, full oops:
general protection fault: 0000 [1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.18-109.el5.nmi2 #1
RIP: 0010:[<ffffffff8007f394>]  [<ffffffff8007f394>] write_watchdog_counter+0x64/0x6a
RSP: 0000:ffffffff8042fe78  EFLAGS: 00010093
RAX: 00000000ffd56125 RBX: 00000000ffffffff RCX: 0000000000000000
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802fbbdc
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000003e
R10: ffffffff803e1520 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000030 R14: 000000000000030c R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff803b4000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b8838c85000 CR3: 0000000001001000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo ffff81003fe68000, task ffff81003fe5d7a0)
Stack:  0000000000000000 0000000000000030 ffffffff8042ff58 ffffffff8007f0d6
 0000000000000001 ffffffff80065a5e ffffffff8042ff58 ffffffff8029b53b
 0000000000000030 0000000200000002 0000000000000000 0000000000000030
Call Trace:
 <NMI>  [<ffffffff8007f0d6>] lapic_wd_event+0x38/0x3f
 [<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3
 [<ffffffff80065611>] default_do_nmi+0x81/0x225
 [<ffffffff8006587e>] do_nmi+0x43/0x61
 [<ffffffff80064ed7>] nmi+0x7f/0x88
 [<ffffffff8007f6c2>] setup_p4_watchdog+0xca/0xe9
 <<EOE>>  [<ffffffff8007f1e9>] lapic_watchdog_init+0x1b/0x3c
 [<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a
 [<ffffffff80076d84>] setup_local_APIC+0x17b/0x187
 [<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361
 [<ffffffff803ef8bf>] init+0x62/0x2f7
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff80170444>] acpi_ds_init_one_object+0x0/0x80
 [<ffffffff803ef85d>] init+0x0/0x2f7
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

caused by wd->perfctr_msr being zero.

Another thing worth of notice: the system may have N CPUs all of them with
the performance counters in use, generating NMIs.

Easily reproducible on Dave's machine
Comment 4 RHEL Product and Program Management 2008-09-22 16:53:41 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 5 Aristeu Rozanski 2008-09-22 17:18:49 EDT
Created attachment 317419 [details]
backported patch
Comment 7 Don Zickus 2008-09-30 12:01:52 EDT
in kernel-2.6.18-117.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 12 errata-xmlrpc 2009-01-20 15:07:56 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.