Bug 169301
| Summary: | Kernel panic due to NMI watchdog timeout in ip while moving service IP addresses | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | Henry Harris <henry.harris> |
| Component: | kernel | Assignee: | Steve Dickson <steved> |
| Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.0 | CC: | bmarzins, jbaron, kanderso, lhh, teigland |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2007-07-20 11:04:34 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Henry Harris
2005-09-26 18:57:18 UTC
The following was written a few weeks ago while working with Ben Marzinski to troubleshoot bug #166701. It is the same kernel panic that we are now seeing with the U2 code on two different clusters. *************************** This crash is not clear and required some investigation. Note, this is the second time this crash has occurred where the call stack was not obvious. This crash was caused by a NMI watchdog timeout that called panic after the 'ip.sh' process seemed to be hung on acquiring a spinlock. I pieced together the ip.sh stack. __write_lock_failed+7 ----| .text.lock.spinlock+117 ---| These lines spinlock waiting for a lock to clear/release do_exit+1310 sys_exit_group+0 sys_exit_group+15 system_call+126 The system panic'ed do to a NMI watchdog timeout that apparently monitors a cpu looking for a context switch to occur. If it does not after much time (> 5 seconds minimum), the system calls panic. In dmesg. ... <4>NMI Watchdog detected LOCKUP, CPU=1, registers: ⦠Since this kdb session like yesterday came from the ip.sh process thread and the stack itself was incomplete, I looked at the current amd64 stack pointer register and got the watchdog call stack that called panic. notifier_call_chain+0x1f (0x3000000008, 0x1007f04e88, 0x100e7f04dc8, 0x14, 0x0) panic+0x100 (0x0, 0x1) die_nmi+0x81 (0x100e7f04f58, 0xffffffff80328fe, 0x21, 0x200000002, 0x21) nmi_watchdog_tick+0xd2 (0x0, 0x0, 0x0, 0x0, 0100e7f04f58) default_do_nmi+0x7a (0x1) do_nmi+ paranoid_exit() I looked at the kernel code where nmi_die() is called and right before the call there is a comment, "Ayiee, looks like this CPU is stuck wait a few IRQs (5 seconds) before doing the oops â¦" So, the cpu was hungup in a spinlock call it looks like. |