Bug 461671
Summary: | [RHEL5] nmi: crash during kdump kernel boot | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Aristeu Rozanski <arozansk> | ||||
Component: | kernel | Assignee: | Aristeu Rozanski <arozansk> | ||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 5.4 | CC: | anderson, dzickus, lwang, qcai | ||||
Target Milestone: | beta | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-01-20 20:07:56 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Aristeu Rozanski
2008-09-09 19:45:15 UTC
The theory behind this is that an NMI from the first kernel's NMI watchdog was delivered right when the NMI watchdog of the kdump kernel was being initialized: [<ffffffff8006587e>] do_nmi+0x43/0x61 [<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9 <<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c [<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a The problem is that: void setup_apic_nmi_watchdog(void) { if (__get_cpu_var(wd_enabled) == 1) return; switch (nmi_watchdog) { case NMI_LOCAL_APIC: __get_cpu_var(wd_enabled) = 1; if (lapic_watchdog_init(nmi_hz) < 0) { __get_cpu_var(wd_enabled) = 0; return; } and int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { int sum, touched = 0, rc = 0; (...) /* see if the nmi watchdog went off */ if (!__get_cpu_var(wd_enabled)) return rc; switch (nmi_watchdog) { case NMI_LOCAL_APIC: rc |= lapic_wd_event(nmi_hz); break; So, the NMI arrived between setting wd_enabled as 1for that processor and actually configuring it and thus it was detected as a NMI watchdog NMI and the p4_rearm() ended up called causing the crash on write_watchdog_counter(). Another possibility is that the setup_p4_watchdog triggered the NMI while setting the registers. managed to reproduce it, full oops: general protection fault: 0000 [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-109.el5.nmi2 #1 RIP: 0010:[<ffffffff8007f394>] [<ffffffff8007f394>] write_watchdog_counter+0x64/0x6a RSP: 0000:ffffffff8042fe78 EFLAGS: 00010093 RAX: 00000000ffd56125 RBX: 00000000ffffffff RCX: 0000000000000000 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802fbbdc RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000003e R10: ffffffff803e1520 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000030 R14: 000000000000030c R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff803b4000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b8838c85000 CR3: 0000000001001000 CR4: 00000000000006e0 Process swapper (pid: 1, threadinfo ffff81003fe68000, task ffff81003fe5d7a0) Stack: 0000000000000000 0000000000000030 ffffffff8042ff58 ffffffff8007f0d6 0000000000000001 ffffffff80065a5e ffffffff8042ff58 ffffffff8029b53b 0000000000000030 0000000200000002 0000000000000000 0000000000000030 Call Trace: <NMI> [<ffffffff8007f0d6>] lapic_wd_event+0x38/0x3f [<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3 [<ffffffff80065611>] default_do_nmi+0x81/0x225 [<ffffffff8006587e>] do_nmi+0x43/0x61 [<ffffffff80064ed7>] nmi+0x7f/0x88 [<ffffffff8007f6c2>] setup_p4_watchdog+0xca/0xe9 <<EOE>> [<ffffffff8007f1e9>] lapic_watchdog_init+0x1b/0x3c [<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a [<ffffffff80076d84>] setup_local_APIC+0x17b/0x187 [<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361 [<ffffffff803ef8bf>] init+0x62/0x2f7 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff80170444>] acpi_ds_init_one_object+0x0/0x80 [<ffffffff803ef85d>] init+0x0/0x2f7 [<ffffffff8005dfa7>] child_rip+0x0/0x11 caused by wd->perfctr_msr being zero. Another thing worth of notice: the system may have N CPUs all of them with the performance counters in use, generating NMIs. Easily reproducible on Dave's machine This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 317419 [details]
backported patch
in kernel-2.6.18-117.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |