Description of problem: While running NFS traffic over a service IP address, a cluster node hangs due to a kernel panic. Version-Release number of selected component (if applicable): How reproducible: Anywhere from less than 1 hour to 5 days or more depending on amount of traffic. Steps to Reproduce: 1. Configure a service IP address on a two node cluster 2. Export a GFS filesystem over NFS 3. Mount the GFS export from an NFS client using the service IP address 4. Generate read/write traffic over the NFS mount so that the cpu load is at least 50% 5. Use a simple script to move the service IP address between the two nodes using clusvcadm -r every 10 seconds. Actual results: Node hangs due to kernel panic Expected results: Continuous operation Additional info: The following kdb info is from four different occurances on three different clusters with various amounts of load over a two day period. 1) Kernel BUG at spinlock:118 Invalid operand: 0000 [1] SMP Stack trace from process clurgmgrd: _spin_lock_irqsave+0x28 add_wait_queue+0x12 __poll_wait+0xae do_select+0x290 sys_select+0x334 2) Kernel BUG at spinlock:118 Invalid operand: 0000 [1] SMP Stack trace from process dlm_ast: __lock_test_start+0x20 [dlm]add_to_astqueue+0x8d [dlm]ast_routine+0x89 [dlm]dlm_astd+0x331 kthread+0xc8 child_rip+0x8 3) Kernel NULL pointer at 0000000000000078 RIP: gfs:gfs_create+155 PML4 d4f0c067 PGD 0 Oops: 0000 [1] SMP Stack trace from process nfsd: [gfs]gfs_create+0x9b [sunrpc]svc_process+0x4c0 [nfs]nfsd+0x238 child_rip+0x8 4) Kernel NULL pointer at 000000000000008C RIP: rb_first+10 PML4 dd3a4017 PGD dd3a5067 PMD 0 Oops: 0000 [1] SMP Stack trace from process clvmd: mpol_free_shared_policy+0x53 shmem_destroy_inode+0x11 destroy_inode+0x42 generic_delete_inode+0x12d iput+0x78 sys_unlink+0x105
adding ben marzinski to look at the problem
Created attachment 118092 [details] script to move service ip address Use your service name and node name
Forgot to mention that this problem is occuring on a dual processor Opteron server. Also, the more NFS traffic and the more often the service ip address for the mount is moved, the quicker the problem seems to occur. The attached script caused the failure to occur within one hour last night. In that particuloar instance, the GFS file system that was being exported was also being accessed locally by other software running on the cluster nodes.
We have just discovered that the PCI slot where our quad GigE adapter is located is hardwired to CPU 1. This means this could very well be a multiprocessor issue if CPU 0 is moving the IP addresses while network traffic is being handled by CPU 1.
Given the fact that this happens in seemingly random points in the kernel, it looks like something subtle.
It appears that I jumped to gun when I told people that I had an explanation for the spinlock bugs. It appears that GFS is also compile with spinlock debugging enabled.
Here is the output from lsmod: Module Size Used by nfsd 267104 9 exportfs 8192 1 nfsd lockd 78896 2 nfsd lock_dlm 45684 2 gfs 320652 2 lock_harness 6960 2 lock_dlm,gfs autofs4 24072 0 i2c_dev 14208 0 i2c_core 29184 1 i2c_dev dlm 129796 9 lock_dlm cman 136224 19 lock_dlm,dlm md5 6272 1 ipv6 283104 31 sunrpc 171128 19 nfsd,lockd button 9504 0 battery 11656 0 ac 7176 0 ohci_hcd 24976 0 tg3 89476 0 e1000 96228 0 bonding 64436 0 floppy 66512 0 sg 43320 0 ext3 137488 4 jbd 68784 1 ext3 dm_mod 65984 3 qla2300 124032 0 qla2xxx 122080 3 qla2300 scsi_transport_fc 11136 1 qla2xxx mptscsih 37808 0 mptbase 50848 1 mptscsih sd_mod 19328 8 scsi_mod 140240 5 sg,qla2xxx,scsi_transport_fc,mptscsih,sd_mod
What are the exact iozone cmdlines that you are using? I've been just guessing and using the defaults in a loop.
Created attachment 118265 [details] Script to run iozone Here is the script that runs iozone.
Looking back through the /var/log/messages files, I see that in every case I looked at (4 or 5) the node that fails is the one that the IP service has been moved to. This is also consistant with what we were seeing when the failure occured in the field every 2 to 4 days.
I'm almost positive I know what is causing the panic in gfs_create. It is a problem in GFS/NFS interaction. Once we changed our tests to stress this interaction, we were able to reproduce that kernel panic. I am currently working a test GFS rpm, to see if it fixes these kernel panics. Unfortunately, this doesn't look like it has anything to do with the other panics. But, there's always hope.
Here's an explanation of the problem, and the workaround in the modified gfs module. When the VFS layer calls a filesystem specific create function, it passes down intent data. This tells the filesystem information like whether or not this is an exclusive create request (the file was opened with O_CREAT | O_EXCL). Before the kernel nfs daemon calls a filesystem specific create function, it checks if the file exists. If it does, nfsd never passes the request to the underlying filesystem. Because the file doesn't exists, it doesn't matter whether or not the create is exclusive, so nfsd doesn't pass the intent datat to the underlying filesystem. This works fine for local filesystems. But for cluster filesystems, the file could be getting created on another node after nfs checks for it's existance. GFS is cannot reliably check whether a file exists until it locks the directory. The panic in gfs_create was happening because nfs passed down a create request with no intent information. When GFS checked, the file already existed, so GFS checked the intent information, but found a NULL pointer. Since there is no way to get NFS to pass the intent information for this release, I made GFS assume that the create was not exclusive if it gets into this situation.
Posted attachments to bug #163168 showing strace for clustat hang and clusvcadm hang that occurred while running tests described in this bug.
Just to log this, I compiled the RHEL4 U2 gfs code (Including the nfs creation fix from above) against the current crosswalk kernel, and as far as I know, the systems have been running this code for days without problems. Is this correct?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-740.html
*** Bug 169301 has been marked as a duplicate of this bug. ***