Bug 166701

Summary: Kernel panic with NFS traffic being moved between nodes using service IP address
Product: Red Hat Enterprise Linux 4 Reporter: Henry Harris <henry.harris>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: axel.thimm, bmarzins, jbrassow, kanderso, lhh
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2005-740 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-07 16:57:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 132823, 167257    
Attachments:
Description Flags
script to move service ip address
none
Script to run iozone none

Description Henry Harris 2005-08-24 18:59:13 UTC
Description of problem: While running NFS traffic over a service IP address, a 
cluster node hangs due to a kernel panic.


Version-Release number of selected component (if applicable):


How reproducible: Anywhere from less than 1 hour to 5 days or more depending 
on amount of traffic.


Steps to Reproduce:
1. Configure a service IP address on a two node cluster
2. Export a GFS filesystem over NFS
3. Mount the GFS export from an NFS client using the service IP address
4. Generate read/write traffic over the NFS mount so that the cpu load is at 
least 50%
5. Use a simple script to move the service IP address between the two nodes 
using clusvcadm -r every 10 seconds.
  
Actual results: Node hangs due to kernel panic


Expected results: Continuous operation


Additional info:  The following kdb info is from four different occurances on 
three different clusters with various amounts of load over a two day period.

1) Kernel BUG at spinlock:118
   Invalid operand: 0000 [1] SMP

   Stack trace from process clurgmgrd:

   _spin_lock_irqsave+0x28
   add_wait_queue+0x12
   __poll_wait+0xae
   do_select+0x290
   sys_select+0x334

2) Kernel BUG at spinlock:118
   Invalid operand: 0000 [1] SMP

   Stack trace from process dlm_ast:

   __lock_test_start+0x20
   [dlm]add_to_astqueue+0x8d
   [dlm]ast_routine+0x89
   [dlm]dlm_astd+0x331
   kthread+0xc8
   child_rip+0x8

3) Kernel NULL pointer at 0000000000000078 RIP:
   gfs:gfs_create+155
   PML4 d4f0c067 PGD 0
   Oops: 0000 [1] SMP

   Stack trace from process nfsd:

   [gfs]gfs_create+0x9b
   [sunrpc]svc_process+0x4c0
   [nfs]nfsd+0x238
   child_rip+0x8

4) Kernel NULL pointer at 000000000000008C RIP:
   rb_first+10
   PML4 dd3a4017 PGD dd3a5067 PMD 0
   Oops: 0000 [1] SMP

   Stack trace from process clvmd:

   mpol_free_shared_policy+0x53
   shmem_destroy_inode+0x11
   destroy_inode+0x42
   generic_delete_inode+0x12d
   iput+0x78
   sys_unlink+0x105

Comment 1 Kiersten (Kerri) Anderson 2005-08-24 19:51:42 UTC
adding ben marzinski to look at the problem

Comment 2 Henry Harris 2005-08-24 20:33:31 UTC
Created attachment 118092 [details]
script to move service ip address

 Use your service name and node name

Comment 3 Henry Harris 2005-08-24 23:03:04 UTC
Forgot to mention that this problem is occuring on a dual processor Opteron 
server.  Also, the more NFS traffic and the more often the service ip address 
for the mount is moved, the quicker the problem seems to occur.  The attached 
script caused the failure to occur within one hour last night.  In that 
particuloar instance, the GFS file system that was being exported was also 
being accessed locally by other software running on the cluster nodes.

Comment 4 Henry Harris 2005-08-25 19:50:28 UTC
We have just discovered that the PCI slot where our quad GigE adapter is 
located is hardwired to CPU 1.  This means this could very well be a 
multiprocessor issue if CPU 0 is moving the IP addresses while network traffic 
is being handled by CPU 1.

Comment 5 Lon Hohberger 2005-08-26 16:36:07 UTC
Given the fact that this happens in seemingly random points in the kernel, it
looks like something subtle.

Comment 6 Ben Marzinski 2005-08-26 21:40:22 UTC
It appears that I jumped to gun when I told people that I had an explanation for
the spinlock bugs.  It appears that GFS is also compile with spinlock debugging
enabled.

Comment 7 Henry Harris 2005-08-29 21:52:09 UTC
Here is the output from lsmod:

Module                  Size  Used by
nfsd                  267104  9 
exportfs                8192  1 nfsd
lockd                  78896  2 nfsd
lock_dlm               45684  2 
gfs                   320652  2 
lock_harness            6960  2 lock_dlm,gfs
autofs4                24072  0 
i2c_dev                14208  0 
i2c_core               29184  1 i2c_dev
dlm                   129796  9 lock_dlm
cman                  136224  19 lock_dlm,dlm
md5                     6272  1 
ipv6                  283104  31 
sunrpc                171128  19 nfsd,lockd
button                  9504  0 
battery                11656  0 
ac                      7176  0 
ohci_hcd               24976  0 
tg3                    89476  0 
e1000                  96228  0 
bonding                64436  0 
floppy                 66512  0 
sg                     43320  0 
ext3                  137488  4 
jbd                    68784  1 ext3
dm_mod                 65984  3 
qla2300               124032  0 
qla2xxx               122080  3 qla2300
scsi_transport_fc      11136  1 qla2xxx
mptscsih               37808  0 
mptbase                50848  1 mptscsih
sd_mod                 19328  8 
scsi_mod              140240  5 sg,qla2xxx,scsi_transport_fc,mptscsih,sd_mod

Comment 8 Corey Marthaler 2005-08-30 19:47:59 UTC
What are the exact iozone cmdlines that you are using? I've been just guessing
and  using the defaults in a loop.

Comment 9 Henry Harris 2005-08-30 20:27:49 UTC
Created attachment 118265 [details]
Script to run iozone

Here is the script that runs iozone.

Comment 10 Henry Harris 2005-08-31 20:36:05 UTC
Looking back through the /var/log/messages files, I see that in every case I 
looked at (4 or 5) the node that fails is the one that the IP service has been 
moved to.  This is also consistant with what we were seeing when the failure 
occured in the field every 2 to 4 days.

Comment 12 Ben Marzinski 2005-09-01 21:27:21 UTC
I'm almost positive I know what is causing the panic in gfs_create.  It is a
problem in GFS/NFS interaction.  Once we changed our tests to stress this
interaction, we were able to reproduce that kernel panic.  I am currently
working a test GFS rpm, to see if it fixes these kernel panics.  Unfortunately,
this doesn't look like it has anything to do with the other panics. But, there's
always hope.

Comment 13 Ben Marzinski 2005-09-06 22:02:30 UTC
Here's an explanation of the problem, and the workaround in the modified gfs module.

When the VFS layer calls a filesystem specific create function, it passes down
intent data.  This tells the filesystem information like whether or not this is
an exclusive create request (the file was opened with O_CREAT | O_EXCL). Before
the kernel nfs daemon calls a filesystem specific create function, it checks if
the file exists. If it does, nfsd never passes the request to the underlying
filesystem. Because the file doesn't exists, it doesn't matter whether or not
the create is exclusive, so nfsd doesn't pass the intent datat to the underlying
filesystem.  This works fine for local filesystems.  But for cluster
filesystems, the file could be getting created on another node after nfs checks
for it's existance.  GFS is cannot reliably check whether a file exists until
it locks the directory.

The panic in gfs_create was happening because nfs passed down a create request
with no intent information. When GFS checked, the file already existed, so GFS
checked the intent information, but found a NULL pointer. 

Since there is no way to get NFS to pass the intent information for this
release, I made GFS assume that the create was not exclusive if it gets into
this situation.

Comment 14 Henry Harris 2005-09-09 18:12:46 UTC
Posted attachments to bug #163168 showing strace for clustat hang and 
clusvcadm hang that occurred while running tests described in this bug.

Comment 15 Ben Marzinski 2005-09-14 18:02:19 UTC
Just to log this, I compiled the RHEL4 U2 gfs code (Including the nfs creation
fix from above) against the current crosswalk kernel, and as far as I know, the
systems have been running this code for days without problems. Is this correct?

Comment 16 Red Hat Bugzilla 2005-10-07 16:57:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-740.html


Comment 17 Jeff Layton 2007-07-20 11:04:40 UTC
*** Bug 169301 has been marked as a duplicate of this bug. ***