Bug 169301 - Kernel panic due to NMI watchdog timeout in ip while moving service IP addresses
Kernel panic due to NMI watchdog timeout in ip while moving service IP addresses
Status: CLOSED DUPLICATE of bug 166701
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Steve Dickson
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2005-09-26 14:57 EDT by Henry Harris
Modified: 2008-05-09 13:34 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-07-20 07:04:34 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Henry Harris 2005-09-26 14:57:18 EDT
Description of problem: While running traffic to a GFS file system over NFS 
and moving the IP address as a rgmanager service, the kernel panics due to an 
NMI watchdog timeout in ip.

Version-Release number of selected component (if applicable):

How reproducible: 

Steps to Reproduce:
1. See bug #166701
Actual results:

Expected results:

Additional info:
Comment 1 Henry Harris 2005-09-26 16:08:29 EDT
The following was written a few weeks ago while working with Ben Marzinski to 
troubleshoot bug #166701.  It is the same kernel panic that we are now seeing 
with the U2 code on two different clusters.


This crash is not clear and required some investigation. Note, this
 is the second time this crash has occurred where the call stack
 was not obvious.

 This crash was caused by a NMI watchdog timeout that called panic after the
 'ip.sh' process seemed to be hung on acquiring a spinlock. I pieced together 
 the ip.sh stack.
 __write_lock_failed+7  ----|
.text.lock.spinlock+117 ---|    These lines spinlock waiting for a lock to 

 The system panic'ed do to a NMI watchdog timeout that apparently monitors
 a cpu looking for a context switch to occur. If it does not after much time 
 (> 5 seconds minimum), the system calls panic.

 In dmesg.

 <4>NMI Watchdog detected LOCKUP, CPU=1, registers:

Since this kdb session like yesterday came from the ip.sh process thread and
the stack itself was incomplete, I looked at the current amd64 stack pointer
register and got the watchdog call stack that called panic.

notifier_call_chain+0x1f (0x3000000008, 0x1007f04e88, 0x100e7f04dc8, 0x14, 0x0)
panic+0x100 (0x0, 0x1)
die_nmi+0x81 (0x100e7f04f58, 0xffffffff80328fe, 0x21, 0x200000002, 0x21)
nmi_watchdog_tick+0xd2 (0x0, 0x0, 0x0, 0x0, 0100e7f04f58)
default_do_nmi+0x7a (0x1)

I looked at the kernel code where nmi_die() is called and right before the call
there is a comment, "Ayiee, looks like this CPU is stuck wait a few IRQs (5 
seconds) before doing the oops …"  So, the cpu was hungup in a spinlock
call it looks like.
Comment 4 Jeff Layton 2007-07-20 07:04:34 EDT

*** This bug has been marked as a duplicate of 166701 ***

Note You need to log in before you can comment on or make changes to this bug.