Description of problem: We are running fahclient-7.1.52-1.x86_64.rpm downloaded from https://fah-web.stanford.edu/projects/FAHClient/wiki/BetaRelease ... it pushes CPUs to their limit. We are wondering if the following though is expected behavior: # dmesg | grep LOCKUP [ 93.310472] Watchdog detected hard LOCKUP on cpu 6 [ 106.896147] Watchdog detected hard LOCKUP on cpu 22 [ 244.927592] Watchdog detected hard LOCKUP on cpu 29 [ 258.348613] Watchdog detected hard LOCKUP on cpu 13 [ 432.637392] Watchdog detected hard LOCKUP on cpu 22 [ 471.441122] Watchdog detected hard LOCKUP on cpu 6 [ 588.473065] Watchdog detected hard LOCKUP on cpu 13 [ 639.885548] Watchdog detected hard LOCKUP on cpu 29 [ 771.974199] Watchdog detected hard LOCKUP on cpu 22 [ 810.430084] Watchdog detected hard LOCKUP on cpu 6 [ 884.192560] Watchdog detected hard LOCKUP on cpu 13 [ 1002.166988] Watchdog detected hard LOCKUP on cpu 29 [ 1100.792232] Watchdog detected hard LOCKUP on cpu 22 [ 1139.488615] Watchdog detected hard LOCKUP on cpu 6 [ 1189.393973] Watchdog detected hard LOCKUP on cpu 13 [ 1345.850268] Watchdog detected hard LOCKUP on cpu 29 [ 1456.170354] Watchdog detected hard LOCKUP on cpu 22 [ 1487.704738] Watchdog detected hard LOCKUP on cpu 13 [ 1519.448265] Watchdog detected hard LOCKUP on cpu 6 etc We also have a lot of this: <snip> [ 1519.448343] [<ffffffff815f4782>] __atomic_notifier_call_chain+0x12/0x20 [ 1519.448346] [<ffffffff815f47a6>] atomic_notifier_call_chain+0x16/0x20 [ 1519.448348] [<ffffffff8102b3d5>] mce_log+0x25/0x170 [ 1519.448351] [<ffffffff8102c53b>] mce_log_therm_throt_event+0x2b/0x40 [ 1519.448354] [<ffffffff8102daf8>] intel_thermal_interrupt+0x178/0x1a0 [ 1519.448357] [<ffffffff8102db71>] smp_thermal_interrupt+0x21/0x40 [ 1519.448360] [<ffffffff815f97ba>] thermal_interrupt+0x6a/0x70 [ 1519.448361] <EOI> [<ffffffff810cf5f5>] ? __audit_syscall_exit+0x65/0x450 [ 1519.448368] [<ffffffff815f8a90>] sysret_audit+0x17/0x21 [ 1519.448369] ---[ end trace 28bd6fe5c8656efc ]--- [ 1528.234694] CPU6: Package temperature above threshold, cpu clock throttled (t otal events = 237843) [ 1528.340327] CPU6: Package temperature/speed normal ... then see SATA write errors. Is this BIOS/Motherboard or the kernel throttling the CPUs? If it is the kernel, is this a bug?
We are using kernel 3.4.3-1.fc17.x86_64
Created attachment 596463 [details] partial dmesg output
Don, Mauro, any thoughts here?
Hi, I would need to the the dmesg output (or /var/log/messages) from the beginning. From what I can see from the dmesg snippet above is that a thermal event happened on the Intel Xeon (running Sandy Bridge processors). The mce_log kicked in to service it. During that service the rcu_read_lock blocked inside an irq routine for a very long time and caused the watchdog to time out and panic. The question is why is the rcu_read_lock blocking. Probably because of the multiple events happening. But that doesn't explain why the first one hung, which is why I would need to see the whole dmesg log. These snippets seem to be residual lockups. Cheers, Don
Thermal events may be generating SMI calls. A known issue is that some BIOSes take too long to handle those events. Maybe this is the case: while the mcelog handler routine were written to be fast, if BIOS is adding too much delay to handle it, the chances that a subsequent NMI IRQ could happen increases. I suggest to measure the amount of time that SMI is spending to handle such events. While I never used it, I know real time people have some tools to measure the SMI latency noise.
Created attachment 596525 [details] full dmesg output
Uploaded full dmesg output. Not sure if this is relevant, but this machine has a boat-load of NICs in it.
After looking at the dmesg log, it seems most events recovered except for a few like the 'Core temperature' ones. And looking through the code, it doesn't seem to have any slowpaths. I wouldn't be surprised if SMI was involved. The Real Time folks tools were just guessing at where SMIs were last I heard. They scattered calls like rdtsc around known fastpaths and if things were longer than expected, they assumed it was an SMI and reported the latency. We could do the same here. I think if we place rdtsc's at the beginning and end of say intel_thermal_interrupt and report any differences greater than 1 second. That would instantly tell us. I'll see if I can dig up a Fedora tree to create a patch and then try and run something through koji (though I don't think I ever tried building a kernel on koji). Cheers, Don
Hold up guys, it seems like a hardware issue. We have several other identical servers that don't have this problem when put under the exact same load, and this one had it's heat sink installed improperly. Sorry about that! So this is what it looks like when it really is a thermal issue.