Description of problem: Repeated warnings like this in dmesg output, always on cpu 5 or 17 (which are the same core 2 of processor 1). Seems to correlate with processes hanging (users complaining of "locking up" and similar). [140242.459997] ------------[ cut here ]------------ [140242.460008] WARNING: at kernel/watchdog.c:242 watchdog_overflow_callback+0x9a/0xc0() [140242.460009] Hardware name: ProLiant DL380 G7 [140242.460010] Watchdog detected hard LOCKUP on cpu 5 [140242.460011] Modules linked in: nfsv3 nfsv4 nfs dns_resolver fscache bnep bluetooth rfkill ip6t_REJECT xt_pkttype nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ip6table_filter xt_addrtype ip6_tables binfmt_misc sfc iTCO_wdt iTCO_vendor_support bnx2 i7core_edac edac_core hpilo lpc_ich mfd_core mdio coretemp kvm_intel kvm acpi_power_meter microcode serio_raw hpwdt uinput nfsd auth_rpcgss nfs_acl lockd sunrpc crc32c_intel ghash_clmulni_intel hpsa radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [140242.460040] Pid: 6559, comm: sort Tainted: G W 3.6.9-2.fc17.x86_64 #1 [140242.460041] Call Trace: [140242.460043] <NMI> [<ffffffff8105c8ef>] warn_slowpath_common+0x7f/0xc0 [140242.460051] [<ffffffff8105c9e6>] warn_slowpath_fmt+0x46/0x50 [140242.460053] [<ffffffff810e7a80>] ? touch_nmi_watchdog+0x80/0x80 [140242.460055] [<ffffffff810e7b1a>] watchdog_overflow_callback+0x9a/0xc0 [140242.460059] [<ffffffff81122c1d>] __perf_event_overflow+0x9d/0x230 [140242.460063] [<ffffffff81025057>] ? x86_perf_event_set_period+0xd7/0x160 [140242.460065] [<ffffffff811237f4>] perf_event_overflow+0x14/0x20 [140242.460068] [<ffffffff8102a603>] intel_pmu_handle_irq+0x193/0x310 [140242.460073] [<ffffffff81620aed>] perf_event_nmi_handler+0x1d/0x20 [140242.460075] [<ffffffff816202a1>] nmi_handle.isra.0+0x51/0x80 [140242.460077] [<ffffffff816203b0>] do_nmi+0xe0/0x350 [140242.460078] [<ffffffff8161f8c0>] end_repeat_nmi+0x1e/0x2e [140242.460080] <<EOE>> [140242.460081] ---[ end trace 64b7725d35e6a9e9 ]--- Version-Release number of selected component (if applicable): Running kernel-3.6.9-2.fc17.x86_64, but this may be a BIOS or hardware issue for all I know. How reproducible: Always on one machine, never on another almost identical one (different RAM and network cards). Additional info: Processors are Xeon X5660 @ 2.8GHz, "stepping 2, microcode 0x14" sayeth /proc/cpuinfo.
Created attachment 660438 [details] dmesg log
Note this machine has a Solarflare card and 64 lockups so far, whereas the similar one has no such card and zero lockups. Perhaps the driver module is the problem?
Sfc module definitely a suspect. Did 'rmmod sfc' yesterday and no more lock-ups since then: # dmesg | tail -32 [223779.374854] ---[ end trace 64b7725d35e6a9ee ]--- [239091.928539] ------------[ cut here ]------------ [239091.928550] WARNING: at kernel/watchdog.c:242 watchdog_overflow_callback+0x9a/0xc0() [239091.928551] Hardware name: ProLiant DL380 G7 [239091.928552] Watchdog detected hard LOCKUP on cpu 17 [239091.928553] Modules linked in: nfsv3 nfsv4 nfs dns_resolver fscache bnep bluetooth rfkill ip6t_REJECT xt_pkttype nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ip6table_filter xt_addrtype ip6_tables binfmt_misc sfc iTCO_wdt iTCO_vendor_support bnx2 i7core_edac edac_core hpilo lpc_ich mfd_core mdio coretemp kvm_intel kvm acpi_power_meter microcode serio_raw hpwdt uinput nfsd auth_rpcgss nfs_acl lockd sunrpc crc32c_intel ghash_clmulni_intel hpsa radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [239091.928581] Pid: 0, comm: swapper/17 Tainted: G W 3.6.9-2.fc17.x86_64 #1 [239091.928582] Call Trace: [239091.928584] <NMI> [<ffffffff8105c8ef>] warn_slowpath_common+0x7f/0xc0 [239091.928592] [<ffffffff8105c9e6>] warn_slowpath_fmt+0x46/0x50 [239091.928594] [<ffffffff810e7a80>] ? touch_nmi_watchdog+0x80/0x80 [239091.928596] [<ffffffff810e7b1a>] watchdog_overflow_callback+0x9a/0xc0 [239091.928600] [<ffffffff81122c1d>] __perf_event_overflow+0x9d/0x230 [239091.928604] [<ffffffff81025057>] ? x86_perf_event_set_period+0xd7/0x160 [239091.928606] [<ffffffff811237f4>] perf_event_overflow+0x14/0x20 [239091.928609] [<ffffffff8102a603>] intel_pmu_handle_irq+0x193/0x310 [239091.928614] [<ffffffff81620aed>] perf_event_nmi_handler+0x1d/0x20 [239091.928616] [<ffffffff816202a1>] nmi_handle.isra.0+0x51/0x80 [239091.928618] [<ffffffff816203b0>] do_nmi+0xe0/0x350 [239091.928620] [<ffffffff8161f8c0>] end_repeat_nmi+0x1e/0x2e [239091.928625] [<ffffffff81040dd6>] ? read_hpet+0x16/0x20 [239091.928627] [<ffffffff81040dd6>] ? read_hpet+0x16/0x20 [239091.928628] [<ffffffff81040dd6>] ? read_hpet+0x16/0x20 [239091.928629] <<EOE>> [<ffffffff810ac14c>] getnstimeofday+0x4c/0xe0 [239091.928635] [<ffffffff810ac1f6>] ktime_get_real+0x16/0x50 [239091.928639] [<ffffffff81338953>] intel_idle+0xc3/0x150 [239091.928642] [<ffffffff814cbb69>] cpuidle_enter+0x19/0x20 [239091.928644] [<ffffffff814cc1f9>] cpuidle_idle_call+0xa9/0x250 [239091.928646] [<ffffffff8101d52f>] cpu_idle+0xaf/0x120 [239091.928650] [<ffffffff8160dbb4>] start_secondary+0x23e/0x240 [239091.928651] ---[ end trace 64b7725d35e6a9ef ]--- [249579.085720] Solarflare NET driver unloading
Dunno how to boost Priority or Severity in bugzilla, but this is fairly serious for us, at least.
Two more lockups today, even with sfc unloaded. :(
Can you attach the output of the NMI backtraces without the sfc module loaded? Also, both of your backtraces show there was some other warning that happened before the NMI. Do you have a full dmesg showing all of them?
What seems strange is there is no backtrace beyond the NMI exception frame. Usually there is a stack which was interrupted by the NMI hardlock up. I get the impression this might be a nested NMI that just repeats over and over until the lockup occurs. Can you provide the output of 'cat /proc/interrupts|grep NMI'? Provide a couple of those outputs over time, hopefully before a lockup and after. As for your other comment about still seeing lockups with sfc unloaded can you blacklist the driver instead and reboot. I am worried that unloading the driver left the firmware in a funny state and it still might fire random NMIs due to some pci walking or some other accidental poking of the device. Cheers, Don
Hi, The full dmesg log is attached above and the first lockup shows the kernel "Not tainted". We swapped around some hardware and now I can't reproduce the problem, so I'm going to mark this as closed. Thank you.
Created attachment 676529 [details] dmesg output showing LOCKUP Lockup happened again, with machine under decent load (80+). No solarflare card or driver in sight.
Reopened due to re-occurence, new dmesg log attached.
Hi, I am wondering if your system is corrupted somehow because we should be able to see backtraces on the cpus. Only 2 cpus seem to have valid backtraces. Do you know how to setup kdump? That might help for this case. Cheers, Don
Created attachment 676593 [details] Screenshot I installed kexec-tools and this happened soon after (/var/crash is empty).
The hpwdt timer went off and panic'd the system. That driver still has some quirks associated with it that are fixed in 3.7/3.8 I believe. Either that or an unknown NMI happened. I believe you still have to add 'crashkernel=128M' to the kernel command line and reboot the machine to pick that up for kdump to start working (it needs to allocate memory for the second kernel). Cheers, Don
I got a crash dump (kernel-3.7.2-204.fc18.x86_64). Now what? $ ls -lh /var/crash/127.0.0.1-2013.01.21-20:50:38/ total 1.2G -rw-------. 1 root root 6 Jan 22 04:33 analyzer -rw-------. 1 root root 6 Jan 22 04:33 component -rw-------. 1 root root 10 Jan 22 04:33 time -rw-------. 1 root root 6 Jan 22 04:33 type -rw-------. 1 root root 1 Jan 22 04:33 uid -rw-------. 1 root root 40 Jan 22 04:33 uuid -rw-------. 1 root root 1.2G Jan 21 15:52 vmcore
Hi, Can you add the following to /etc/kdump.conf core_collector makedumpfile -c --message-level 1 -d 31 that 1.2G vmcore file size is huge. The above command should shrink that to less than 100MB I believe (unless you have a TB of memory??). Then you need to install the kernel-debuginfo and crash package. This will bring in the vmlinux which is used as input into the crash utility. Once you have done all that, run the following #crash /usr/lib/debug/lib/modules/`uname -r`/vmlinux <path to vmcore> inside crash run 'bt -a' and attach that output. Let's see what crash thinks the backtrace is. Cheers, Don
Created attachment 688611 [details] Non stop lockups
Created attachment 688612 [details] Log of crash just now. # ls -lh /var/crash/127.0.0.1-2013.01.27-19:57:27/ total 236M -rw-------. 1 root root 6 Jan 27 15:02 analyzer -rw-------. 1 root root 6 Jan 27 15:02 component -rw-------. 1 root root 10 Jan 27 15:02 time -rw-------. 1 root root 6 Jan 27 15:02 type -rw-------. 1 root root 1 Jan 27 15:02 uid -rw-------. 1 root root 40 Jan 27 15:02 uuid -rw-------. 1 root root 236M Jan 27 14:58 vmcore
Created attachment 688729 [details] Another crash log.
Created attachment 688730 [details] And another...
Hi, Thanks for the logs. It seems like the hardlockups are coming while executing in userspace. Which is a little odd. Can you run the 'log' command under crash and attach that output here? Thanks, Don
Also, do you have kernel-tools installed? I want to make sure cpuspeed isn't running. If kernel-tools is installed, you might need to do a 'systemctl stop cpupower.service' or something like that. This is to make sure the watchdog kicks after 10 seconds instead of minutes (because of cpu speed throttling). Cheers, Don
Created attachment 689121 [details] Crash 'log' output Kernel-tools is installed but cpupower doesn't seem to be running. # systemctl status cpupower.service cpupower.service - Configure CPU power related settings Loaded: loaded (/usr/lib/systemd/system/cpupower.service; disabled) Active: inactive (dead) I 'stop'ped the service and 'remove'd the package now. Crash 'log' output attached: ( echo log; echo quit ) | crash /usr/lib/debug/lib/modules/3.7.4-204.fc18.x86_64/vmlinux /var/crash/127.0.0.1-2013.01.27-19:57:27/vmcore | tee /tmp/log1.txt
Created attachment 689122 [details] Crash 'log' output from second vmcore Output from: ( echo log; echo quit ) | crash /usr/lib/debug/lib/modules/3.7.4-204.fc18.x86_64/vmlinux /var/crash/127.0.0.1-2013.01.27-22:19:44//vmcore | tee /tmp/log2.txt
Created attachment 689123 [details] Crash 'log' output from third vmcore Output from: ( echo log; echo quit ) | crash /usr/lib/debug/lib/modules/3.7.4-204.fc18.x86_64/vmlinux /var/crash/127.0.0.1-2013.01.28-03:14:42//vmcore | tee /tmp/log3.txt
Are you still seeing this with the 3.9.4 kernel in updates-testing?
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.