+++ This bug was initially created as a clone of Bug #461671 +++ Description of problem: During a kdump kernel was loaded, Dave Anderson's burn machine hit another problem with the NMI watchdog patches, also present upstream. dump copied from the screen: <NMI> [<ffffffff8009008d>] panic+0x1da/0x1eb [<ffffffff8019e996>] do_unblank_screen+0x1b/0x132 [<ffffffff80064fe2>] oops_end+0x51/0x53 [<ffffffff8006b903>] die+0x3a/0x44 [<ffffffff80065587>] do_general_protection+0xfe/0x107 [<ffffffff8005dde9>] error_exit+0x0/0x84 [<ffffffff8007ef75>] write_watchdog_counter+0x2d/0x31 [<ffffffff8007f1f9>] lapic_wd_event+0x30/0x3f [<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3 [<ffffffff80065611>] default_do_nmi+0x81/0x225 [<ffffffff8006587e>] do_nmi+0x43/0x61 [<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9 <<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c [<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a [<ffffffff80076d84>] setup_local_APIC+0x17b/0x187 [<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361 [<ffffffff803ef8bf>] init+0x62/0x2f7 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff801703ca>] acpi_ds_init_one_object+0x0/0x80 [<ffffffff803ef85d>] init+0x0/0x2f7 [<ffffffff8005dfa7>] child_rip+0x0/0x11 (original photo attached) --- Additional comment from arozansk on 2008-09-09 17:03:20 EDT --- The theory behind this is that an NMI from the first kernel's NMI watchdog was delivered right when the NMI watchdog of the kdump kernel was being initialized: [<ffffffff8006587e>] do_nmi+0x43/0x61 [<ffffffff8007f648>] setup_p4_watchdog+0xca/0xe9 <<EOE>> [<ffffffff8007f30c>] lapic_nmi_watchdog_init+0x1b/0x3c [<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a The problem is that: void setup_apic_nmi_watchdog(void) { if (__get_cpu_var(wd_enabled) == 1) return; switch (nmi_watchdog) { case NMI_LOCAL_APIC: __get_cpu_var(wd_enabled) = 1; if (lapic_watchdog_init(nmi_hz) < 0) { __get_cpu_var(wd_enabled) = 0; return; } and int __kprobes nmi_watchdog_tick(struct pt_regs * regs, unsigned reason) { int sum, touched = 0, rc = 0; (...) /* see if the nmi watchdog went off */ if (!__get_cpu_var(wd_enabled)) return rc; switch (nmi_watchdog) { case NMI_LOCAL_APIC: rc |= lapic_wd_event(nmi_hz); break; So, the NMI arrived between setting wd_enabled as 1for that processor and actually configuring it and thus it was detected as a NMI watchdog NMI and the p4_rearm() ended up called causing the crash on write_watchdog_counter(). Another possibility is that the setup_p4_watchdog triggered the NMI while setting the registers. --- Additional comment from arozansk on 2008-09-10 17:33:19 EDT --- managed to reproduce it, full oops: general protection fault: 0000 [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-109.el5.nmi2 #1 RIP: 0010:[<ffffffff8007f394>] [<ffffffff8007f394>] write_watchdog_counter+0x64/0x6a RSP: 0000:ffffffff8042fe78 EFLAGS: 00010093 RAX: 00000000ffd56125 RBX: 00000000ffffffff RCX: 0000000000000000 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802fbbdc RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000003e R10: ffffffff803e1520 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000030 R14: 000000000000030c R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff803b4000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002b8838c85000 CR3: 0000000001001000 CR4: 00000000000006e0 Process swapper (pid: 1, threadinfo ffff81003fe68000, task ffff81003fe5d7a0) Stack: 0000000000000000 0000000000000030 ffffffff8042ff58 ffffffff8007f0d6 0000000000000001 ffffffff80065a5e ffffffff8042ff58 ffffffff8029b53b 0000000000000030 0000000200000002 0000000000000000 0000000000000030 Call Trace: <NMI> [<ffffffff8007f0d6>] lapic_wd_event+0x38/0x3f [<ffffffff80065a5e>] nmi_watchdog_tick+0x1c2/0x1d3 [<ffffffff80065611>] default_do_nmi+0x81/0x225 [<ffffffff8006587e>] do_nmi+0x43/0x61 [<ffffffff80064ed7>] nmi+0x7f/0x88 [<ffffffff8007f6c2>] setup_p4_watchdog+0xca/0xe9 <<EOE>> [<ffffffff8007f1e9>] lapic_watchdog_init+0x1b/0x3c [<ffffffff80077518>] setup_apic_nmi_watchdog+0x42/0x8a [<ffffffff80076d84>] setup_local_APIC+0x17b/0x187 [<ffffffff803f93b2>] smp_prepare_cpus+0x34b/0x361 [<ffffffff803ef8bf>] init+0x62/0x2f7 [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff80170444>] acpi_ds_init_one_object+0x0/0x80 [<ffffffff803ef85d>] init+0x0/0x2f7 [<ffffffff8005dfa7>] child_rip+0x0/0x11 caused by wd->perfctr_msr being zero. Another thing worth of notice: the system may have N CPUs all of them with the performance counters in use, generating NMIs. Easily reproducible on Dave's machine ---- since the code is almost the same on RHEL-4, this needs to be fixed on RHEL-4 too.
Created attachment 317617 [details] RHEL5 version of the patch
since RHEL-4 won't boot another kernel, this is not an issue. closing