Bug 837889 - Watchdog detected hard LOCKUP on cpu <any>
Watchdog detected hard LOCKUP on cpu <any>
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
17
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-05 13:54 EDT by joshua
Modified: 2012-07-06 13:03 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-07-06 13:03:49 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
partial dmesg output (2.56 KB, text/plain)
2012-07-05 13:55 EDT, joshua
no flags Details
full dmesg output (245.74 KB, text/plain)
2012-07-05 18:00 EDT, joshua
no flags Details

  None (edit)
Description joshua 2012-07-05 13:54:47 EDT
Description of problem:

We are running fahclient-7.1.52-1.x86_64.rpm downloaded from https://fah-web.stanford.edu/projects/FAHClient/wiki/BetaRelease ... it pushes CPUs to their limit.  We are wondering if the following though is expected behavior:

# dmesg | grep LOCKUP
[   93.310472] Watchdog detected hard LOCKUP on cpu 6
[  106.896147] Watchdog detected hard LOCKUP on cpu 22
[  244.927592] Watchdog detected hard LOCKUP on cpu 29
[  258.348613] Watchdog detected hard LOCKUP on cpu 13
[  432.637392] Watchdog detected hard LOCKUP on cpu 22
[  471.441122] Watchdog detected hard LOCKUP on cpu 6
[  588.473065] Watchdog detected hard LOCKUP on cpu 13
[  639.885548] Watchdog detected hard LOCKUP on cpu 29
[  771.974199] Watchdog detected hard LOCKUP on cpu 22
[  810.430084] Watchdog detected hard LOCKUP on cpu 6
[  884.192560] Watchdog detected hard LOCKUP on cpu 13
[ 1002.166988] Watchdog detected hard LOCKUP on cpu 29
[ 1100.792232] Watchdog detected hard LOCKUP on cpu 22
[ 1139.488615] Watchdog detected hard LOCKUP on cpu 6
[ 1189.393973] Watchdog detected hard LOCKUP on cpu 13
[ 1345.850268] Watchdog detected hard LOCKUP on cpu 29
[ 1456.170354] Watchdog detected hard LOCKUP on cpu 22
[ 1487.704738] Watchdog detected hard LOCKUP on cpu 13
[ 1519.448265] Watchdog detected hard LOCKUP on cpu 6
etc

We also have a lot of this:

<snip>
[ 1519.448343]  [<ffffffff815f4782>] __atomic_notifier_call_chain+0x12/0x20
[ 1519.448346]  [<ffffffff815f47a6>] atomic_notifier_call_chain+0x16/0x20
[ 1519.448348]  [<ffffffff8102b3d5>] mce_log+0x25/0x170
[ 1519.448351]  [<ffffffff8102c53b>] mce_log_therm_throt_event+0x2b/0x40
[ 1519.448354]  [<ffffffff8102daf8>] intel_thermal_interrupt+0x178/0x1a0
[ 1519.448357]  [<ffffffff8102db71>] smp_thermal_interrupt+0x21/0x40
[ 1519.448360]  [<ffffffff815f97ba>] thermal_interrupt+0x6a/0x70
[ 1519.448361]  <EOI>  [<ffffffff810cf5f5>] ? __audit_syscall_exit+0x65/0x450
[ 1519.448368]  [<ffffffff815f8a90>] sysret_audit+0x17/0x21
[ 1519.448369] ---[ end trace 28bd6fe5c8656efc ]---
[ 1528.234694] CPU6: Package temperature above threshold, cpu clock throttled (t
otal events = 237843)
[ 1528.340327] CPU6: Package temperature/speed normal

... then see SATA write errors.


Is this BIOS/Motherboard or the kernel throttling the CPUs?  If it is the kernel, is this a bug?
Comment 1 joshua 2012-07-05 13:55:22 EDT
We are using kernel 3.4.3-1.fc17.x86_64
Comment 2 joshua 2012-07-05 13:55:54 EDT
Created attachment 596463 [details]
partial dmesg output
Comment 3 Josh Boyer 2012-07-05 14:29:51 EDT
Don, Mauro, any thoughts here?
Comment 4 Don Zickus 2012-07-05 15:52:45 EDT
Hi,

I would need to the the dmesg output (or /var/log/messages) from the beginning.  From what I can see from the dmesg snippet above is that a thermal event happened on the Intel Xeon (running Sandy Bridge processors).  The mce_log kicked in to service it.  During that service the rcu_read_lock blocked inside an irq routine for a very long time and caused the watchdog to time out and panic.

The question is why is the rcu_read_lock blocking.  Probably because of the multiple events happening.  But that doesn't explain why the first one hung, which is why I would need to see the whole dmesg log.  These snippets seem to be residual lockups.

Cheers,
Don
Comment 5 Mauro Carvalho Chehab 2012-07-05 16:21:32 EDT
Thermal events may be generating SMI calls. A known issue is that some BIOSes take too long to handle those events. Maybe this is the case: while the mcelog
handler routine were written to be fast, if BIOS is adding too much delay to
handle it, the chances that a subsequent NMI IRQ could happen increases.

I suggest to measure the amount of time that SMI is spending to handle such events. While I never used it, I know real time people have some tools to measure
the SMI latency noise.
Comment 6 joshua 2012-07-05 18:00:32 EDT
Created attachment 596525 [details]
full dmesg output
Comment 7 joshua 2012-07-05 18:02:03 EDT
Uploaded full dmesg output.  Not sure if this is relevant, but this machine has a boat-load of NICs in it.
Comment 8 Don Zickus 2012-07-05 23:31:39 EDT
After looking at the dmesg log, it seems most events recovered except for a few like the 'Core temperature' ones.  And looking through the code, it doesn't seem to have any slowpaths.  I wouldn't be surprised if SMI was involved.

The Real Time folks tools were just guessing at where SMIs were last I heard. They scattered calls like rdtsc around known fastpaths and if things were longer than expected, they assumed it was an SMI and reported the latency.

We could do the same here.  I think if we place rdtsc's at the beginning and end of say intel_thermal_interrupt and report any differences greater than 1 second.  That would instantly tell us.

I'll see if I can dig up a Fedora tree to create a patch and then try and run something through koji (though I don't think I ever tried building a kernel on koji).

Cheers,
Don
Comment 9 joshua 2012-07-06 13:03:49 EDT
Hold up guys, it seems like a hardware issue.  We have several other identical servers that don't have this problem when put under the exact same load, and this one had it's heat sink installed improperly.

Sorry about that!  So this is what it looks like when it really is a thermal issue.

Note You need to log in before you can comment on or make changes to this bug.