Bug 837889 - Watchdog detected hard LOCKUP on cpu <any>
Summary: Watchdog detected hard LOCKUP on cpu <any>
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 17
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-07-05 17:54 UTC by joshua
Modified: 2012-07-06 17:03 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-06 17:03:49 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
partial dmesg output (2.56 KB, text/plain)
2012-07-05 17:55 UTC, joshua
no flags Details
full dmesg output (245.74 KB, text/plain)
2012-07-05 22:00 UTC, joshua
no flags Details

Description joshua 2012-07-05 17:54:47 UTC
Description of problem:

We are running fahclient-7.1.52-1.x86_64.rpm downloaded from https://fah-web.stanford.edu/projects/FAHClient/wiki/BetaRelease ... it pushes CPUs to their limit.  We are wondering if the following though is expected behavior:

# dmesg | grep LOCKUP
[   93.310472] Watchdog detected hard LOCKUP on cpu 6
[  106.896147] Watchdog detected hard LOCKUP on cpu 22
[  244.927592] Watchdog detected hard LOCKUP on cpu 29
[  258.348613] Watchdog detected hard LOCKUP on cpu 13
[  432.637392] Watchdog detected hard LOCKUP on cpu 22
[  471.441122] Watchdog detected hard LOCKUP on cpu 6
[  588.473065] Watchdog detected hard LOCKUP on cpu 13
[  639.885548] Watchdog detected hard LOCKUP on cpu 29
[  771.974199] Watchdog detected hard LOCKUP on cpu 22
[  810.430084] Watchdog detected hard LOCKUP on cpu 6
[  884.192560] Watchdog detected hard LOCKUP on cpu 13
[ 1002.166988] Watchdog detected hard LOCKUP on cpu 29
[ 1100.792232] Watchdog detected hard LOCKUP on cpu 22
[ 1139.488615] Watchdog detected hard LOCKUP on cpu 6
[ 1189.393973] Watchdog detected hard LOCKUP on cpu 13
[ 1345.850268] Watchdog detected hard LOCKUP on cpu 29
[ 1456.170354] Watchdog detected hard LOCKUP on cpu 22
[ 1487.704738] Watchdog detected hard LOCKUP on cpu 13
[ 1519.448265] Watchdog detected hard LOCKUP on cpu 6
etc

We also have a lot of this:

<snip>
[ 1519.448343]  [<ffffffff815f4782>] __atomic_notifier_call_chain+0x12/0x20
[ 1519.448346]  [<ffffffff815f47a6>] atomic_notifier_call_chain+0x16/0x20
[ 1519.448348]  [<ffffffff8102b3d5>] mce_log+0x25/0x170
[ 1519.448351]  [<ffffffff8102c53b>] mce_log_therm_throt_event+0x2b/0x40
[ 1519.448354]  [<ffffffff8102daf8>] intel_thermal_interrupt+0x178/0x1a0
[ 1519.448357]  [<ffffffff8102db71>] smp_thermal_interrupt+0x21/0x40
[ 1519.448360]  [<ffffffff815f97ba>] thermal_interrupt+0x6a/0x70
[ 1519.448361]  <EOI>  [<ffffffff810cf5f5>] ? __audit_syscall_exit+0x65/0x450
[ 1519.448368]  [<ffffffff815f8a90>] sysret_audit+0x17/0x21
[ 1519.448369] ---[ end trace 28bd6fe5c8656efc ]---
[ 1528.234694] CPU6: Package temperature above threshold, cpu clock throttled (t
otal events = 237843)
[ 1528.340327] CPU6: Package temperature/speed normal

... then see SATA write errors.


Is this BIOS/Motherboard or the kernel throttling the CPUs?  If it is the kernel, is this a bug?

Comment 1 joshua 2012-07-05 17:55:22 UTC
We are using kernel 3.4.3-1.fc17.x86_64

Comment 2 joshua 2012-07-05 17:55:54 UTC
Created attachment 596463 [details]
partial dmesg output

Comment 3 Josh Boyer 2012-07-05 18:29:51 UTC
Don, Mauro, any thoughts here?

Comment 4 Don Zickus 2012-07-05 19:52:45 UTC
Hi,

I would need to the the dmesg output (or /var/log/messages) from the beginning.  From what I can see from the dmesg snippet above is that a thermal event happened on the Intel Xeon (running Sandy Bridge processors).  The mce_log kicked in to service it.  During that service the rcu_read_lock blocked inside an irq routine for a very long time and caused the watchdog to time out and panic.

The question is why is the rcu_read_lock blocking.  Probably because of the multiple events happening.  But that doesn't explain why the first one hung, which is why I would need to see the whole dmesg log.  These snippets seem to be residual lockups.

Cheers,
Don

Comment 5 Mauro Carvalho Chehab 2012-07-05 20:21:32 UTC
Thermal events may be generating SMI calls. A known issue is that some BIOSes take too long to handle those events. Maybe this is the case: while the mcelog
handler routine were written to be fast, if BIOS is adding too much delay to
handle it, the chances that a subsequent NMI IRQ could happen increases.

I suggest to measure the amount of time that SMI is spending to handle such events. While I never used it, I know real time people have some tools to measure
the SMI latency noise.

Comment 6 joshua 2012-07-05 22:00:32 UTC
Created attachment 596525 [details]
full dmesg output

Comment 7 joshua 2012-07-05 22:02:03 UTC
Uploaded full dmesg output.  Not sure if this is relevant, but this machine has a boat-load of NICs in it.

Comment 8 Don Zickus 2012-07-06 03:31:39 UTC
After looking at the dmesg log, it seems most events recovered except for a few like the 'Core temperature' ones.  And looking through the code, it doesn't seem to have any slowpaths.  I wouldn't be surprised if SMI was involved.

The Real Time folks tools were just guessing at where SMIs were last I heard. They scattered calls like rdtsc around known fastpaths and if things were longer than expected, they assumed it was an SMI and reported the latency.

We could do the same here.  I think if we place rdtsc's at the beginning and end of say intel_thermal_interrupt and report any differences greater than 1 second.  That would instantly tell us.

I'll see if I can dig up a Fedora tree to create a patch and then try and run something through koji (though I don't think I ever tried building a kernel on koji).

Cheers,
Don

Comment 9 joshua 2012-07-06 17:03:49 UTC
Hold up guys, it seems like a hardware issue.  We have several other identical servers that don't have this problem when put under the exact same load, and this one had it's heat sink installed improperly.

Sorry about that!  So this is what it looks like when it really is a thermal issue.


Note You need to log in before you can comment on or make changes to this bug.