Bug 461671 - [RHEL5] nmi: crash during kdump kernel boot
[RHEL5] nmi: crash during kdump kernel boot
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: beta
: ---
Assigned To: Aristeu Rozanski
Martin Jenner
Depends On:
  Show dependency treegraph
Reported: 2008-09-09 15:45 EDT by Aristeu Rozanski
Modified: 2009-01-20 15:07 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-01-20 15:07:56 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
backported patch (1.73 KB, patch)
2008-09-22 17:18 EDT, Aristeu Rozanski
no flags Details | Diff

  None (edit)
Description Aristeu Rozanski 2008-09-09 15:45:15 EDT
Description of problem:
During a kdump kernel was loaded, Dave Anderson's burn machine hit another
problem with the NMI watchdog patches, also present upstream.

dump copied from the screen:
<NMI> [<ffffffff8009008d>] panic+0x1da/0x1eb
[<ffffffff8019e996>] do_unblank_screen+0x1b/0x132
[<ffffffff80064fe2>] oops_end+0x51/0x53
[<ffffffff8006b903>] die+0x3a/0x44
[<ffffffff80065587>] do_general_protection+0xfe/0x107
[<ffffffff8005dde9>] error_exit+0x0/0x84
[<ffffffff8007ef75>] write_watchdog_counter+0x2d/0x31
[<ffffffff8007f1f9>] lapic_wd_event+0x30/0x3f
[<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3
[<ffffffff80065611>] default_do_nmi+0x81/0x225
[<ffffffff8006587e>] do_nmi+0x43/0x61
[<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9
<<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c
[<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a
[<ffffffff80076d84>] setup_local_APIC+0x17b/0x187
[<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361
[<ffffffff803ef8bf>] init+0x62/0x2f7
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff801703ca>] acpi_ds_init_one_object+0x0/0x80
[<ffffffff803ef85d>] init+0x0/0x2f7
[<ffffffff8005dfa7>] child_rip+0x0/0x11

(original photo attached)
Comment 1 Aristeu Rozanski 2008-09-09 17:03:20 EDT
The theory behind this is that an NMI from the first kernel's NMI watchdog was
delivered right when the NMI watchdog of the kdump kernel was being initialized:

[<ffffffff8006587e>] do_nmi+0x43/0x61
[<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9
<<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c
[<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a

The problem is that:
void setup_apic_nmi_watchdog(void)
        if (__get_cpu_var(wd_enabled) == 1)

        switch (nmi_watchdog) {
        case NMI_LOCAL_APIC:
                __get_cpu_var(wd_enabled) = 1;
                if (lapic_watchdog_init(nmi_hz) < 0) {
                        __get_cpu_var(wd_enabled) = 0;

int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
        int sum, touched = 0, rc = 0;           
        /* see if the nmi watchdog went off */
        if (!__get_cpu_var(wd_enabled))
                return rc;
        switch (nmi_watchdog) {
        case NMI_LOCAL_APIC:
                rc |= lapic_wd_event(nmi_hz);

So, the NMI arrived between setting wd_enabled as 1for that processor and
actually configuring it and thus it was detected as a NMI watchdog NMI and the
p4_rearm() ended up called causing the crash on write_watchdog_counter().

Another possibility is that the setup_p4_watchdog triggered the NMI while
setting the registers.
Comment 2 Aristeu Rozanski 2008-09-10 17:33:19 EDT
managed to reproduce it, full oops:
general protection fault: 0000 [1] SMP
last sysfs file:
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.18-109.el5.nmi2 #1
RIP: 0010:[<ffffffff8007f394>]  [<ffffffff8007f394>] write_watchdog_counter+0x64/0x6a
RSP: 0000:ffffffff8042fe78  EFLAGS: 00010093
RAX: 00000000ffd56125 RBX: 00000000ffffffff RCX: 0000000000000000
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802fbbdc
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000003e
R10: ffffffff803e1520 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000030 R14: 000000000000030c R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff803b4000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b8838c85000 CR3: 0000000001001000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo ffff81003fe68000, task ffff81003fe5d7a0)
Stack:  0000000000000000 0000000000000030 ffffffff8042ff58 ffffffff8007f0d6
 0000000000000001 ffffffff80065a5e ffffffff8042ff58 ffffffff8029b53b
 0000000000000030 0000000200000002 0000000000000000 0000000000000030
Call Trace:
 <NMI>  [<ffffffff8007f0d6>] lapic_wd_event+0x38/0x3f
 [<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3
 [<ffffffff80065611>] default_do_nmi+0x81/0x225
 [<ffffffff8006587e>] do_nmi+0x43/0x61
 [<ffffffff80064ed7>] nmi+0x7f/0x88
 [<ffffffff8007f6c2>] setup_p4_watchdog+0xca/0xe9
 <<EOE>>  [<ffffffff8007f1e9>] lapic_watchdog_init+0x1b/0x3c
 [<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a
 [<ffffffff80076d84>] setup_local_APIC+0x17b/0x187
 [<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361
 [<ffffffff803ef8bf>] init+0x62/0x2f7
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff80170444>] acpi_ds_init_one_object+0x0/0x80
 [<ffffffff803ef85d>] init+0x0/0x2f7
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

caused by wd->perfctr_msr being zero.

Another thing worth of notice: the system may have N CPUs all of them with
the performance counters in use, generating NMIs.

Easily reproducible on Dave's machine
Comment 4 RHEL Product and Program Management 2008-09-22 16:53:41 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 5 Aristeu Rozanski 2008-09-22 17:18:49 EDT
Created attachment 317419 [details]
backported patch
Comment 7 Don Zickus 2008-09-30 12:01:52 EDT
in kernel-2.6.18-117.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 12 errata-xmlrpc 2009-01-20 15:07:56 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.