Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 461671

Summary:

[RHEL5] nmi: crash during kdump kernel boot

Product:

Red Hat Enterprise Linux 5

Reporter:

Aristeu Rozanski <arozansk>

Component:

kernel

Assignee:

Aristeu Rozanski <arozansk>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

5.4

CC:

anderson, dzickus, lwang, qcai

Target Milestone:

beta

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-01-20 20:07:56 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
backported patch	none

Description Aristeu Rozanski 2008-09-09 19:45:15 UTC

Description of problem:
During a kdump kernel was loaded, Dave Anderson's burn machine hit another
problem with the NMI watchdog patches, also present upstream.

dump copied from the screen:
<NMI> [<ffffffff8009008d>] panic+0x1da/0x1eb
[<ffffffff8019e996>] do_unblank_screen+0x1b/0x132
[<ffffffff80064fe2>] oops_end+0x51/0x53
[<ffffffff8006b903>] die+0x3a/0x44
[<ffffffff80065587>] do_general_protection+0xfe/0x107
[<ffffffff8005dde9>] error_exit+0x0/0x84
[<ffffffff8007ef75>] write_watchdog_counter+0x2d/0x31
[<ffffffff8007f1f9>] lapic_wd_event+0x30/0x3f
[<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3
[<ffffffff80065611>] default_do_nmi+0x81/0x225
[<ffffffff8006587e>] do_nmi+0x43/0x61
[<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9
<<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c
[<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a
[<ffffffff80076d84>] setup_local_APIC+0x17b/0x187
[<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361
[<ffffffff803ef8bf>] init+0x62/0x2f7
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff801703ca>] acpi_ds_init_one_object+0x0/0x80
[<ffffffff803ef85d>] init+0x0/0x2f7
[<ffffffff8005dfa7>] child_rip+0x0/0x11

(original photo attached)

Comment 1 Aristeu Rozanski 2008-09-09 21:03:20 UTC

The theory behind this is that an NMI from the first kernel's NMI watchdog was
delivered right when the NMI watchdog of the kdump kernel was being initialized:

[<ffffffff8006587e>] do_nmi+0x43/0x61
[<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9
<<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c
[<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a

The problem is that:
void setup_apic_nmi_watchdog(void)
{       
        if (__get_cpu_var(wd_enabled) == 1)
                return;

        switch (nmi_watchdog) {
        case NMI_LOCAL_APIC:
                __get_cpu_var(wd_enabled) = 1;
                if (lapic_watchdog_init(nmi_hz) < 0) {
                        __get_cpu_var(wd_enabled) = 0;
                        return;
                }

and
int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
{       
        int sum, touched = 0, rc = 0;           
(...)
        /* see if the nmi watchdog went off */
        if (!__get_cpu_var(wd_enabled))
                return rc;
        switch (nmi_watchdog) {
        case NMI_LOCAL_APIC:
                rc |= lapic_wd_event(nmi_hz);
                break;

So, the NMI arrived between setting wd_enabled as 1for that processor and
actually configuring it and thus it was detected as a NMI watchdog NMI and the
p4_rearm() ended up called causing the crash on write_watchdog_counter().

Another possibility is that the setup_p4_watchdog triggered the NMI while
setting the registers.

Comment 2 Aristeu Rozanski 2008-09-10 21:33:19 UTC

managed to reproduce it, full oops:
general protection fault: 0000 [1] SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.18-109.el5.nmi2 #1
RIP: 0010:[<ffffffff8007f394>]  [<ffffffff8007f394>] write_watchdog_counter+0x64/0x6a
RSP: 0000:ffffffff8042fe78  EFLAGS: 00010093
RAX: 00000000ffd56125 RBX: 00000000ffffffff RCX: 0000000000000000
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802fbbdc
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000003e
R10: ffffffff803e1520 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000030 R14: 000000000000030c R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff803b4000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b8838c85000 CR3: 0000000001001000 CR4: 00000000000006e0
Process swapper (pid: 1, threadinfo ffff81003fe68000, task ffff81003fe5d7a0)
Stack:  0000000000000000 0000000000000030 ffffffff8042ff58 ffffffff8007f0d6
 0000000000000001 ffffffff80065a5e ffffffff8042ff58 ffffffff8029b53b
 0000000000000030 0000000200000002 0000000000000000 0000000000000030
Call Trace:
 <NMI>  [<ffffffff8007f0d6>] lapic_wd_event+0x38/0x3f
 [<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3
 [<ffffffff80065611>] default_do_nmi+0x81/0x225
 [<ffffffff8006587e>] do_nmi+0x43/0x61
 [<ffffffff80064ed7>] nmi+0x7f/0x88
 [<ffffffff8007f6c2>] setup_p4_watchdog+0xca/0xe9
 <<EOE>>  [<ffffffff8007f1e9>] lapic_watchdog_init+0x1b/0x3c
 [<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a
 [<ffffffff80076d84>] setup_local_APIC+0x17b/0x187
 [<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361
 [<ffffffff803ef8bf>] init+0x62/0x2f7
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff80170444>] acpi_ds_init_one_object+0x0/0x80
 [<ffffffff803ef85d>] init+0x0/0x2f7
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

caused by wd->perfctr_msr being zero.

Another thing worth of notice: the system may have N CPUs all of them with
the performance counters in use, generating NMIs.

Easily reproducible on Dave's machine

Comment 4 RHEL Program Management 2008-09-22 20:53:41 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Aristeu Rozanski 2008-09-22 21:18:49 UTC

Created attachment 317419 [details]
backported patch

Comment 7 Don Zickus 2008-09-30 16:01:52 UTC

in kernel-2.6.18-117.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 12 errata-xmlrpc 2009-01-20 20:07:56 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html