Bug 885351 - Watchdog detected hard LOCKUP
Summary: Watchdog detected hard LOCKUP
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 17
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-12-08 19:49 UTC by r3obh
Modified: 2013-06-17 18:29 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2013-06-17 18:29:40 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg log (114.82 KB, text/x-log)
2012-12-09 21:05 UTC, r3obh
no flags Details
dmesg output showing LOCKUP (150.86 KB, text/x-log)
2013-01-10 20:31 UTC, r3obh
no flags Details
Screenshot (422.17 KB, image/jpeg)
2013-01-11 01:16 UTC, r3obh
no flags Details
Non stop lockups (250.35 KB, text/x-log)
2013-01-27 20:47 UTC, r3obh
no flags Details
Log of crash just now. (20.92 KB, text/x-log)
2013-01-27 20:49 UTC, r3obh
no flags Details
Another crash log. (20.92 KB, text/x-log)
2013-01-28 03:32 UTC, r3obh
no flags Details
And another... (20.91 KB, text/x-log)
2013-01-28 03:33 UTC, r3obh
no flags Details
Crash 'log' output (247.96 KB, text/plain)
2013-01-28 16:55 UTC, r3obh
no flags Details
Crash 'log' output from second vmcore (117.53 KB, text/plain)
2013-01-28 16:56 UTC, r3obh
no flags Details
Crash 'log' output from third vmcore (210.23 KB, text/plain)
2013-01-28 16:57 UTC, r3obh
no flags Details

Description r3obh 2012-12-08 19:49:06 UTC
Description of problem:
Repeated warnings like this in dmesg output, always on cpu 5 or 17 (which are the same core 2 of processor 1).  Seems to correlate with processes hanging (users complaining of "locking up" and similar).

[140242.459997] ------------[ cut here ]------------
[140242.460008] WARNING: at kernel/watchdog.c:242 watchdog_overflow_callback+0x9a/0xc0()
[140242.460009] Hardware name: ProLiant DL380 G7
[140242.460010] Watchdog detected hard LOCKUP on cpu 5
[140242.460011] Modules linked in: nfsv3 nfsv4 nfs dns_resolver fscache bnep bluetooth rfkill ip6t_REJECT xt_pkttype nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ip6table_filter xt_addrtype ip6_tables binfmt_misc sfc iTCO_wdt iTCO_vendor_support bnx2 i7core_edac edac_core hpilo lpc_ich mfd_core mdio coretemp kvm_intel kvm acpi_power_meter microcode serio_raw hpwdt uinput nfsd auth_rpcgss nfs_acl lockd sunrpc crc32c_intel ghash_clmulni_intel hpsa radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core
[140242.460040] Pid: 6559, comm: sort Tainted: G        W    3.6.9-2.fc17.x86_64 #1
[140242.460041] Call Trace:
[140242.460043]  <NMI>  [<ffffffff8105c8ef>] warn_slowpath_common+0x7f/0xc0
[140242.460051]  [<ffffffff8105c9e6>] warn_slowpath_fmt+0x46/0x50
[140242.460053]  [<ffffffff810e7a80>] ? touch_nmi_watchdog+0x80/0x80
[140242.460055]  [<ffffffff810e7b1a>] watchdog_overflow_callback+0x9a/0xc0
[140242.460059]  [<ffffffff81122c1d>] __perf_event_overflow+0x9d/0x230
[140242.460063]  [<ffffffff81025057>] ? x86_perf_event_set_period+0xd7/0x160
[140242.460065]  [<ffffffff811237f4>] perf_event_overflow+0x14/0x20
[140242.460068]  [<ffffffff8102a603>] intel_pmu_handle_irq+0x193/0x310
[140242.460073]  [<ffffffff81620aed>] perf_event_nmi_handler+0x1d/0x20
[140242.460075]  [<ffffffff816202a1>] nmi_handle.isra.0+0x51/0x80
[140242.460077]  [<ffffffff816203b0>] do_nmi+0xe0/0x350
[140242.460078]  [<ffffffff8161f8c0>] end_repeat_nmi+0x1e/0x2e
[140242.460080]  <<EOE>> 
[140242.460081] ---[ end trace 64b7725d35e6a9e9 ]---


Version-Release number of selected component (if applicable):
Running kernel-3.6.9-2.fc17.x86_64, but this may be a BIOS or hardware issue for all I know.

How reproducible:
Always on one machine, never on another almost identical one (different RAM and network cards).

Additional info:
Processors are Xeon X5660 @ 2.8GHz, "stepping 2, microcode 0x14" sayeth /proc/cpuinfo.

Comment 1 r3obh 2012-12-09 21:05:10 UTC
Created attachment 660438 [details]
dmesg log

Comment 2 r3obh 2012-12-09 21:15:00 UTC
Note this machine has a Solarflare card and 64 lockups so far, whereas the similar one has no such card and zero lockups.  Perhaps the driver module is the problem?

Comment 3 r3obh 2012-12-10 13:36:58 UTC
Sfc module definitely a suspect.  Did 'rmmod sfc' yesterday and no more lock-ups since then:

# dmesg | tail -32
[223779.374854] ---[ end trace 64b7725d35e6a9ee ]---
[239091.928539] ------------[ cut here ]------------
[239091.928550] WARNING: at kernel/watchdog.c:242 watchdog_overflow_callback+0x9a/0xc0()
[239091.928551] Hardware name: ProLiant DL380 G7
[239091.928552] Watchdog detected hard LOCKUP on cpu 17
[239091.928553] Modules linked in: nfsv3 nfsv4 nfs dns_resolver fscache bnep bluetooth rfkill ip6t_REJECT xt_pkttype nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ip6table_filter xt_addrtype ip6_tables binfmt_misc sfc iTCO_wdt iTCO_vendor_support bnx2 i7core_edac edac_core hpilo lpc_ich mfd_core mdio coretemp kvm_intel kvm acpi_power_meter microcode serio_raw hpwdt uinput nfsd auth_rpcgss nfs_acl lockd sunrpc crc32c_intel ghash_clmulni_intel hpsa radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core
[239091.928581] Pid: 0, comm: swapper/17 Tainted: G        W    3.6.9-2.fc17.x86_64 #1
[239091.928582] Call Trace:
[239091.928584]  <NMI>  [<ffffffff8105c8ef>] warn_slowpath_common+0x7f/0xc0
[239091.928592]  [<ffffffff8105c9e6>] warn_slowpath_fmt+0x46/0x50
[239091.928594]  [<ffffffff810e7a80>] ? touch_nmi_watchdog+0x80/0x80
[239091.928596]  [<ffffffff810e7b1a>] watchdog_overflow_callback+0x9a/0xc0
[239091.928600]  [<ffffffff81122c1d>] __perf_event_overflow+0x9d/0x230
[239091.928604]  [<ffffffff81025057>] ? x86_perf_event_set_period+0xd7/0x160
[239091.928606]  [<ffffffff811237f4>] perf_event_overflow+0x14/0x20
[239091.928609]  [<ffffffff8102a603>] intel_pmu_handle_irq+0x193/0x310
[239091.928614]  [<ffffffff81620aed>] perf_event_nmi_handler+0x1d/0x20
[239091.928616]  [<ffffffff816202a1>] nmi_handle.isra.0+0x51/0x80
[239091.928618]  [<ffffffff816203b0>] do_nmi+0xe0/0x350
[239091.928620]  [<ffffffff8161f8c0>] end_repeat_nmi+0x1e/0x2e
[239091.928625]  [<ffffffff81040dd6>] ? read_hpet+0x16/0x20
[239091.928627]  [<ffffffff81040dd6>] ? read_hpet+0x16/0x20
[239091.928628]  [<ffffffff81040dd6>] ? read_hpet+0x16/0x20
[239091.928629]  <<EOE>>  [<ffffffff810ac14c>] getnstimeofday+0x4c/0xe0
[239091.928635]  [<ffffffff810ac1f6>] ktime_get_real+0x16/0x50
[239091.928639]  [<ffffffff81338953>] intel_idle+0xc3/0x150
[239091.928642]  [<ffffffff814cbb69>] cpuidle_enter+0x19/0x20
[239091.928644]  [<ffffffff814cc1f9>] cpuidle_idle_call+0xa9/0x250
[239091.928646]  [<ffffffff8101d52f>] cpu_idle+0xaf/0x120
[239091.928650]  [<ffffffff8160dbb4>] start_secondary+0x23e/0x240
[239091.928651] ---[ end trace 64b7725d35e6a9ef ]---
[249579.085720] Solarflare NET driver unloading

Comment 4 r3obh 2012-12-10 16:10:31 UTC
Dunno how to boost Priority or Severity in bugzilla, but this is fairly serious for us, at least.

Comment 5 r3obh 2012-12-10 23:09:01 UTC
Two more lockups today, even with sfc unloaded.  :(

Comment 6 Josh Boyer 2013-01-07 20:56:28 UTC
Can you attach the output of the NMI backtraces without the sfc module loaded?  

Also, both of your backtraces show there was some other warning that happened before the NMI.  Do you have a full dmesg showing all of them?

Comment 7 Don Zickus 2013-01-08 15:33:54 UTC
What seems strange is there is no backtrace beyond the NMI exception frame.  Usually there is a stack which was interrupted by the NMI hardlock up.

I get the impression this might be a nested NMI that just repeats over and over until the lockup occurs.

Can you provide the output of 'cat /proc/interrupts|grep NMI'? Provide a couple of those outputs over time, hopefully before a lockup and after.

As for your other comment about still seeing lockups with sfc unloaded can you blacklist the driver instead and reboot.  I am worried that unloading the driver left the firmware in a funny state and it still might fire random NMIs due to some pci walking or some other accidental poking of the device.

Cheers,
Don

Comment 8 r3obh 2013-01-08 16:13:00 UTC
Hi,

The full dmesg log is attached above and the first lockup shows the kernel "Not tainted".

We swapped around some hardware and now I can't reproduce the problem, so I'm going to mark this as closed.

Thank you.

Comment 9 r3obh 2013-01-10 20:31:06 UTC
Created attachment 676529 [details]
dmesg output showing LOCKUP

Lockup happened again, with machine under decent load (80+).
No solarflare card or driver in sight.

Comment 10 r3obh 2013-01-10 20:33:49 UTC
Reopened due to re-occurence, new dmesg log attached.

Comment 11 Don Zickus 2013-01-10 21:49:51 UTC
Hi,

I am wondering if your system is corrupted somehow because we should be able to see backtraces on the cpus.  Only 2 cpus seem to have valid backtraces.

Do you know how to setup kdump?  That might help for this case.

Cheers,
Don

Comment 12 r3obh 2013-01-11 01:16:11 UTC
Created attachment 676593 [details]
Screenshot

I installed kexec-tools and this happened soon after (/var/crash is empty).

Comment 13 Don Zickus 2013-01-11 14:03:38 UTC
The hpwdt timer went off and panic'd the system.  That driver still has some quirks associated with it that are fixed in 3.7/3.8 I believe.  Either that or an unknown NMI happened.

I believe you still have to add 'crashkernel=128M' to the kernel command line and reboot the machine to pick that up for kdump to start working (it needs to allocate memory for the second kernel).

Cheers,
Don

Comment 14 r3obh 2013-01-22 09:42:24 UTC
I got a crash dump (kernel-3.7.2-204.fc18.x86_64).  Now what?

$ ls -lh /var/crash/127.0.0.1-2013.01.21-20:50:38/
total 1.2G
-rw-------. 1 root root    6 Jan 22 04:33 analyzer
-rw-------. 1 root root    6 Jan 22 04:33 component
-rw-------. 1 root root   10 Jan 22 04:33 time
-rw-------. 1 root root    6 Jan 22 04:33 type
-rw-------. 1 root root    1 Jan 22 04:33 uid
-rw-------. 1 root root   40 Jan 22 04:33 uuid
-rw-------. 1 root root 1.2G Jan 21 15:52 vmcore

Comment 15 Don Zickus 2013-01-22 16:08:21 UTC
Hi,

Can you add the following to /etc/kdump.conf

core_collector makedumpfile -c --message-level 1 -d 31

that 1.2G vmcore file size is huge.  The above command should shrink that to less than 100MB I believe (unless you have a TB of memory??).

Then you need to install the kernel-debuginfo and crash package.  This will bring in the vmlinux which is used as input into the crash utility.

Once you have done all that, run the following

#crash /usr/lib/debug/lib/modules/`uname -r`/vmlinux <path to vmcore>

inside crash run 'bt -a' and attach that output.

Let's see what crash thinks the backtrace is.

Cheers,
Don

Comment 16 r3obh 2013-01-27 20:47:57 UTC
Created attachment 688611 [details]
Non stop lockups

Comment 17 r3obh 2013-01-27 20:49:07 UTC
Created attachment 688612 [details]
Log of crash just now.

# ls -lh /var/crash/127.0.0.1-2013.01.27-19:57:27/
total 236M
-rw-------. 1 root root    6 Jan 27 15:02 analyzer
-rw-------. 1 root root    6 Jan 27 15:02 component
-rw-------. 1 root root   10 Jan 27 15:02 time
-rw-------. 1 root root    6 Jan 27 15:02 type
-rw-------. 1 root root    1 Jan 27 15:02 uid
-rw-------. 1 root root   40 Jan 27 15:02 uuid
-rw-------. 1 root root 236M Jan 27 14:58 vmcore

Comment 18 r3obh 2013-01-28 03:32:41 UTC
Created attachment 688729 [details]
Another crash log.

Comment 19 r3obh 2013-01-28 03:33:41 UTC
Created attachment 688730 [details]
And another...

Comment 20 Don Zickus 2013-01-28 15:51:35 UTC
Hi,

Thanks for the logs.  It seems like the hardlockups are coming while executing in userspace.  Which is a little odd.

Can you run the 'log' command under crash and attach that output here?

Thanks,
Don

Comment 21 Don Zickus 2013-01-28 16:15:48 UTC
Also, do you have kernel-tools installed?  I want to make sure cpuspeed isn't running.

If kernel-tools is installed, you might need to do a 'systemctl stop cpupower.service' or something like that.  This is to make sure the watchdog kicks after 10 seconds instead of minutes (because of cpu speed throttling).

Cheers,
Don

Comment 22 r3obh 2013-01-28 16:55:43 UTC
Created attachment 689121 [details]
Crash 'log' output

Kernel-tools is installed but cpupower doesn't seem to be running.

# systemctl status cpupower.service
cpupower.service - Configure CPU power related settings
	  Loaded: loaded (/usr/lib/systemd/system/cpupower.service; disabled)
	  Active: inactive (dead)


I 'stop'ped the service and 'remove'd the package now.

Crash 'log' output attached:

( echo log; echo quit ) | crash /usr/lib/debug/lib/modules/3.7.4-204.fc18.x86_64/vmlinux /var/crash/127.0.0.1-2013.01.27-19:57:27/vmcore | tee /tmp/log1.txt

Comment 23 r3obh 2013-01-28 16:56:44 UTC
Created attachment 689122 [details]
Crash 'log' output from second vmcore

Output from:

( echo log; echo quit ) | crash /usr/lib/debug/lib/modules/3.7.4-204.fc18.x86_64/vmlinux /var/crash/127.0.0.1-2013.01.27-22:19:44//vmcore | tee /tmp/log2.txt

Comment 24 r3obh 2013-01-28 16:57:43 UTC
Created attachment 689123 [details]
Crash 'log' output from third vmcore

Output from:

( echo log; echo quit ) | crash /usr/lib/debug/lib/modules/3.7.4-204.fc18.x86_64/vmlinux /var/crash/127.0.0.1-2013.01.28-03:14:42//vmcore | tee /tmp/log3.txt

Comment 25 Josh Boyer 2013-06-03 19:08:50 UTC
Are you still seeing this with the 3.9.4 kernel in updates-testing?

Comment 26 Josh Boyer 2013-06-17 18:29:40 UTC
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.