Description of problem: While running NFS traffic over a service IP address, a
cluster node hangs due to a kernel panic.
Version-Release number of selected component (if applicable):
How reproducible: Anywhere from less than 1 hour to 5 days or more depending
on amount of traffic.
Steps to Reproduce:
1. Configure a service IP address on a two node cluster
2. Export a GFS filesystem over NFS
3. Mount the GFS export from an NFS client using the service IP address
4. Generate read/write traffic over the NFS mount so that the cpu load is at
5. Use a simple script to move the service IP address between the two nodes
using clusvcadm -r every 10 seconds.
Actual results: Node hangs due to kernel panic
Expected results: Continuous operation
Additional info: The following kdb info is from four different occurances on
three different clusters with various amounts of load over a two day period.
1) Kernel BUG at spinlock:118
Invalid operand: 0000  SMP
Stack trace from process clurgmgrd:
2) Kernel BUG at spinlock:118
Invalid operand: 0000  SMP
Stack trace from process dlm_ast:
3) Kernel NULL pointer at 0000000000000078 RIP:
PML4 d4f0c067 PGD 0
Oops: 0000  SMP
Stack trace from process nfsd:
4) Kernel NULL pointer at 000000000000008C RIP:
PML4 dd3a4017 PGD dd3a5067 PMD 0
Oops: 0000  SMP
Stack trace from process clvmd:
adding ben marzinski to look at the problem
Created attachment 118092 [details]
script to move service ip address
Use your service name and node name
Forgot to mention that this problem is occuring on a dual processor Opteron
server. Also, the more NFS traffic and the more often the service ip address
for the mount is moved, the quicker the problem seems to occur. The attached
script caused the failure to occur within one hour last night. In that
particuloar instance, the GFS file system that was being exported was also
being accessed locally by other software running on the cluster nodes.
We have just discovered that the PCI slot where our quad GigE adapter is
located is hardwired to CPU 1. This means this could very well be a
multiprocessor issue if CPU 0 is moving the IP addresses while network traffic
is being handled by CPU 1.
Given the fact that this happens in seemingly random points in the kernel, it
looks like something subtle.
It appears that I jumped to gun when I told people that I had an explanation for
the spinlock bugs. It appears that GFS is also compile with spinlock debugging
Here is the output from lsmod:
Module Size Used by
nfsd 267104 9
exportfs 8192 1 nfsd
lockd 78896 2 nfsd
lock_dlm 45684 2
gfs 320652 2
lock_harness 6960 2 lock_dlm,gfs
autofs4 24072 0
i2c_dev 14208 0
i2c_core 29184 1 i2c_dev
dlm 129796 9 lock_dlm
cman 136224 19 lock_dlm,dlm
md5 6272 1
ipv6 283104 31
sunrpc 171128 19 nfsd,lockd
button 9504 0
battery 11656 0
ac 7176 0
ohci_hcd 24976 0
tg3 89476 0
e1000 96228 0
bonding 64436 0
floppy 66512 0
sg 43320 0
ext3 137488 4
jbd 68784 1 ext3
dm_mod 65984 3
qla2300 124032 0
qla2xxx 122080 3 qla2300
scsi_transport_fc 11136 1 qla2xxx
mptscsih 37808 0
mptbase 50848 1 mptscsih
sd_mod 19328 8
scsi_mod 140240 5 sg,qla2xxx,scsi_transport_fc,mptscsih,sd_mod
What are the exact iozone cmdlines that you are using? I've been just guessing
and using the defaults in a loop.
Created attachment 118265 [details]
Script to run iozone
Here is the script that runs iozone.
Looking back through the /var/log/messages files, I see that in every case I
looked at (4 or 5) the node that fails is the one that the IP service has been
moved to. This is also consistant with what we were seeing when the failure
occured in the field every 2 to 4 days.
I'm almost positive I know what is causing the panic in gfs_create. It is a
problem in GFS/NFS interaction. Once we changed our tests to stress this
interaction, we were able to reproduce that kernel panic. I am currently
working a test GFS rpm, to see if it fixes these kernel panics. Unfortunately,
this doesn't look like it has anything to do with the other panics. But, there's
Here's an explanation of the problem, and the workaround in the modified gfs module.
When the VFS layer calls a filesystem specific create function, it passes down
intent data. This tells the filesystem information like whether or not this is
an exclusive create request (the file was opened with O_CREAT | O_EXCL). Before
the kernel nfs daemon calls a filesystem specific create function, it checks if
the file exists. If it does, nfsd never passes the request to the underlying
filesystem. Because the file doesn't exists, it doesn't matter whether or not
the create is exclusive, so nfsd doesn't pass the intent datat to the underlying
filesystem. This works fine for local filesystems. But for cluster
filesystems, the file could be getting created on another node after nfs checks
for it's existance. GFS is cannot reliably check whether a file exists until
it locks the directory.
The panic in gfs_create was happening because nfs passed down a create request
with no intent information. When GFS checked, the file already existed, so GFS
checked the intent information, but found a NULL pointer.
Since there is no way to get NFS to pass the intent information for this
release, I made GFS assume that the create was not exclusive if it gets into
Posted attachments to bug #163168 showing strace for clustat hang and
clusvcadm hang that occurred while running tests described in this bug.
Just to log this, I compiled the RHEL4 U2 gfs code (Including the nfs creation
fix from above) against the current crosswalk kernel, and as far as I know, the
systems have been running this code for days without problems. Is this correct?
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
*** Bug 169301 has been marked as a duplicate of this bug. ***