Red Hat Bugzilla – Bug 177702
Umount hung; gfs_glockd using up 98% of cpu
Last modified: 2010-01-11 22:09:32 EST
Description of problem: Running a two node cluster with five GFS file systems
mounted. Got an I/O error on one of the nodes trying to access one of the
file systems. The other node was able to access the same file system without
error. Ran ls -l and saw that the mount point of the failed file system on
the failed node did not report any file permissions, file size, date, etc. Ran
ls -i and the mount point showed 0 inode. Tried to umount the file system on
failed node and the umount hung. Ran top and saw that gfs_glockd was using
98% of cpu. Ran strace on umount pid but saw nothing. Tried to kill umount
process but it did not return. Noticed that ls -l now showed valid data for
mount point and could now access the file system, but gfs_glock was still
using up almost all of the cpu.
Version-Release number of selected component (if applicable):
Has only happened once
Steps to Reproduce:
File system cannot be umounted
File system can be umounted
Can you post the message logs from the machine that got the I/O error?
Created attachment 123173 [details]
tar of /var/log/messages file from failed node
The messages file shows everything from the previous reboot on Jan. 9 until the
node was rebooted on Jan. 12 at 16:53:19 following the umount hang.
Once the I/O error occurred, the node withdrew from the cluster. This is normal
behavior and explains why the filesystem was inaccessible from the node.
For some reason GFS gets stuck in gfs_releasepage() due to the buffer b_count
being a non-zero value. From looking at the log, it appears blkno 24 had a
b_count of 1. This appears to be the same block which initiated the I/O error.
Anyway, with b_count being non-zero, we will spin in gfs_releasepage(). The
probably explains why the filesystem cannot be unmounted and gfs_glockd eats up
Is this node/cluster running on a custom-built kernel?
Yes, it is. We have added some ip statistics gathering and a few other minor
changes to 2.6.9-22.
Also, can you please explain your previous comment that it is normal behavior
for an I/O error to cause a node to withdraw from the cluster? Do you mean it
should withdraw from the cluster and get fenced the same as if a missed
heartbeat was detected?
Prior to having the withdraw capability, when an I/O error occured the kernel
would panic and thus be fenced. The other nodes can then process with recovery.
Now when an I/O error occurs, the node will withdraw from the cluster. This
stops I/O, waits for pending I/O to error out, and leaves the lockspace. Then,
usually, you can unmount the filesystem. A reboot of the node is recommended
regardless of success/failure of unmount. The advantage of withdraw is that you
can stop any services, application, etc. manually before reboot.
Are you able to recreate this problem?
I'm looking at md multipathing as a potential root of this problem.
I think there are a few issues present here. First, I am looking into why you
got the SCSI error. What was happening on the other node around the time that
this node got the SCSI error? Did you happen to see a SCSI error with return
code 0x10000 on that node?
The reason I ask is that I have a cluster with 3 nodes, all using the qla2xxx
driver and connected to the same fibre channel switch. If I kill one of the
nodes I/O path (I do think by either pulling the FC cable or disabling the port
on the switch), the node who lost its I/O path gets an error and I see a SCSI
error with return code 0x10000. Then, some time later the other nodes on the
SAN, whose I/O paths are still fully intact, get a SCSI error with return code
0x20000. This is the same as what you are seeing and should not happen. In
short, if you lost an I/O path for one node, the other nodes on the switch
should be unaffected. So the SCSI error with return code 0x20000 seems to be the
result of some other event on the SAN. Can you verify that this is what happened
in your case?
Note that I removed multipathing from my configuration only to help me debug the
problems individually. I am/was trying to eliminate md multipathing as the
source of the problem and then discovered this problem with the SCSI errors.
I tend to think that the metadata corruption problem is a result of the SCSI
I have not yet been able to recreate the gfs_glockd problem.
Please attempt to capture debug info from the QLogic driver. To do this, remove
the qla2xxx modules (modprobe -r qla2xxx) and reload with the following parameters:
modprobe qla2xxx extended_error_logging=1
Need the following information:
1) Are there multiple HBAs in each node?
2) Is the storage single processor?
3) If the storage is multi-processor, is it active/active or active/passive?
A potential workaround exists, but it doesn't seem to work all the time. I am
investigating why. You can try this:
modprobe qla2xxx ql2xprocessrscn
Please update the bugzilla stating whether or not this works as well as the
model # of the qlogic HBAs you are using.
Sorry, the option for the qla2xxx driver (above) is incorrect. The option needs
a value. It should be:
modprobe qla2xxx ql2xprocessrscn=1
Without the "=1" you will see...
"qla2xxx: `' invalid for parameter `ql2xprocessrscn'"
Answers to Q's in comment #18:
1) yes - multiple HBAs
2 and 3) - dual controller - active/passive
Additional information - here's a testcase that does not involve hardware
failure - i.e. no-one pulled any plugs or failed any ports - just plain
traffic running on the system...
Test Description and console output below. /var/log/messages from each node
will be attached next (two nodes).
** Two GFS File systems (volumes:TestExport_01 & TestExport_02) NFS mounted
and exported to a client system
** Started random IOs with block sizes of 256k & 512k on both NFS mount points
using dt utility
** Transfer upto 5G data in and out from both NFS mount points
** One of two-clustered nodes panicâd, during cman ememrgency_shutdown. See
the console output below for more details.
** The other node of the cluster did NOT panic, but reported data mismatch to
dt (maybe just an dtâs hiccup)
Console logs from the node panicâd from running the test above:
SCSI error : <3 0 1 0> return code = 0x20000
end_request: I/O error, dev sdaz, sector 1751304
SM: 00000001 sm_stop: SG still joined
SM: 01000002 sm_stop: SG still joined
SM: 02000004 sm_stop: SG still joined
SM: 03000009 sm_stop: SG still joined
TestVol_02 move flags 0,1,0 ids 13,15,13
TestVol_02 move use event 15
TestVol_02 recover event 15
TestVol_02 add node 2
TestVol_02 total nodes 2
TestVol_02 rebuild resource directory
TestVol_02 rebuilt 7 resources
TestVol_02 purge requests
TestVol_02 purged 0 requests
TestVol_02 mark waiting requests
TestVol_02 marked 0 requests
TestVol_02 recover event 15 done
TestVol_02 move flags 0,0,1 ids 13,15,15
TestVol_02 process held requests
TestVol_02 processed 0 requests
TestVol_02 resend marked requests
TestVol_02 resent 0 requests
TestVol_02 recover event 15 finished
Magma move flags 0,1,0 ids 0,18,0
Magma move use event 18
Magma recover event 18 (first)
Magma add nodes
Magma total nodes 2
Magma rebuild resource directory
Magma rebuilt 1 resources
Magma recover event 18 done
Magma move flags 0,0,1 ids 0,18,18
Magma process held requests
Magma processed 0 requests
Magma recover event 18 finished
Magma (16140) req reply einval 701c2 fr 2 r 2 usrm::vf
Magma (16140) req reply einval 6037f fr 2 r 2 usrm::vf
15572 pr_start last_stop 0 last_start 6 last_finish 0
15572 pr_start count 1 type 2 event 6 flags 250
15572 claim_jid 0
15572 pr_start 6 done 1
15572 pr_finish flags 5b
15572 pr_start last_stop 6 last_start 8 last_finish 6
15572 pr_start count 2 type 2 event 8 flags 21b
15559 recovery_done jid 0 msg 309 21b
15559 recovery_done nodeid 2 flg 0
15559 recovery_done jid 1 msg 309 21b
15559 recovery_done jid 2 msg 309 21b
15559 recovery_done jid 3 msg 309 21b
15559 recovery_done jid 4 msg 309 21b
15559 recovery_done jid 5 msg 309 21b
15559 recovery_done jid 6 msg 309 21b
15559 recovery_done jid 7 msg 309 21b
15559 recovery_done jid 8 msg 309 21b
15559 recovery_done jid 9 msg 309 21b
15572 pr_start delay done before omm 21b
15572 pr_start 8 done 0
15559 recovery_done jid 10 msg 309 101b
15559 recovery_done jid 11 msg 309 101b
15559 recovery_done jid 12 msg 309 101b
15559 recovery_done jid 13 msg 309 101b
15559 recovery_done jid 14 msg 309 101b
15559 recovery_done jid 15 msg 309 101b
15559 others_may_mount 101b
15559 others_may_mount start_done 8 181b
15572 pr_finish flags 181b
15826 pr_start last_stop 0 last_start 10 last_finish 0
15826 pr_start count 1 type 2 event 10 flags 250
15826 claim_jid 0
15826 pr_start 10 done 1
15826 pr_finish flags 5b
15814 recovery_done jid 0 msg 309 b
15814 recovery_done nodeid 1 flg 18
15814 recovery_done jid 1 msg 309 b
15825 pr_start last_stop 10 last_start 12 last_finish 10
15825 pr_start count 2 type 2 event 12 flags 21b
15814 recovery_done jid 2 msg 309 21b
15814 recovery_done jid 3 msg 309 21b
15814 recovery_done jid 4 msg 309 21b
15814 recovery_done jid 5 msg 309 21b
15814 recovery_done jid 6 msg 309 21b
15814 recovery_done jid 7 msg 309 21b
15814 recovery_done jid 8 msg 309 21b
15814 recovery_done jid 9 msg 309 21b
15825 pr_start delay done before omm 21b
15825 pr_start 12 done 0
15814 recovery_done jid 10 msg 309 101b
15814 recovery_done jid 11 msg 309 101b
15814 recovery_done jid 12 msg 309 101b
15814 recovery_done jid 13 msg 309 101b
15814 recovery_done jid 14 msg 309 101b
15814 recovery_done jid 15 msg 309 101b
15814 others_may_mount 101b
15814 others_may_mount start_done 12 181b
15825 pr_finish flags 181b
15882 pr_start last_stop 0 last_start 14 last_finish 0
15882 pr_start count 1 type 2 event 14 flags 250
15882 claim_jid 0
15882 pr_start 14 done 1
15882 pr_finish flags 5b
15868 recovery_done jid 0 msg 309 b
15868 recovery_done nodeid 1 flg 18
15881 pr_start last_stop 14 last_start 16 last_finish 14
15881 pr_start count 2 type 2 event 16 flags 21b
15868 recovery_done jid 1 msg 309 21b
15868 recovery_done jid 2 msg 309 21b
15868 recovery_done jid 3 msg 309 21b
15868 recovery_done jid 4 msg 309 21b
15868 recovery_done jid 5 msg 309 21b
15868 recovery_done jid 6 msg 309 21b
15868 recovery_done jid 7 msg 309 21b
15868 recovery_done jid 8 msg 309 21b
15868 recovery_done jid 9 msg 309 21b
15868 recovery_done jid 10 msg 309 21b
15868 recovery_done jid 11 msg 309 21b
15881 pr_start delay done before omm 21b
15881 pr_start 16 done 0
15868 recovery_done jid 12 msg 309 101b
15868 recovery_done jid 13 msg 309 101b
15868 recovery_done jid 14 msg 309 101b
15868 recovery_done jid 15 msg 309 101b
15868 others_may_mount 101b
15868 others_may_mount start_done 16 181b
15882 pr_finish flags 181b
lock_dlm: Assertion failed on line 428 of file fs/gfs_locking/lock_dlm/lock.c
lock_dlm: assertion: "!error"
lock_dlm: time = 4299504932
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:428
invalid operand: 0000  SMP
Entering kdb (current=0x000001007b8f4030, pid 15571) on processor 1 Oops:
due to oops @ 0xffffffffa023ae7c
r15 = 0x0000000000000003 r14 = 0x0000000000000001
r13 = 0x0000000000000000 r12 = 0x000001007bf29e00
rbp = 0x00000100d608a480 rbx = 0x00000000ffffffea
r11 = 0x0000000d80391c00 r10 = 0x0000000000000000
r9 = 0x00000000ffffffea r8 = 0x000000000000000d
rax = 0x0000000000000001 rcx = 0x0000000000000246
rdx = 0x00000000000a8f91 rsi = 0x0000000000000246
rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff
rip = 0xffffffffa023ae7c cs = 0x0000000000000010
eflags = 0x0000000000010212 rsp = 0x000001007c28fcc0
ss = 0x000001007c28e000 ®s = 0x000001007c28fc28
Stack traceback for pid 15571
0x000001007b8f4030 15571 6 1 1 R 0x000001007b8f4430 *lock_dlm1
RSP RIP Function (args)
0x1007c28fcc0 0xffffffffa023ae7c [lock_dlm]do_dlm_lock+0x16d (0x0,
0xffffff0000d32000, 0x100d6333080, 0x1 00d63330ac,
0x1007c28fd38 0xffffffffa023afbf [lock_dlm]lm_dlm_lock+0xd6
(0xffffffffffffffff, 0x100db979898, 0x100db97 9830,
0x1007c28ff28 0xffffffff8014796b kthread+0xc8
0x1007c28ff58 0xffffffff8010fbaf child_rip+0x8
Created attachment 124839 [details]
Created attachment 124840 [details]
Node2 - /var/log/messages
It does not appear that the ql2xprocessrscn=1 option was set on the qla2xxx
driver. Can you confirm this?
I've only seen the 0x20000 SCSI return code in response to a RSCR event on the
fabric. You can trigger one of these events (usually) by disabling a port or
pulling a cable. There may be other ways to trigger an RSCR event. Could there
have been anything else happening on the fabric that would have caused the
switch trigger an RSCR event?
If you are sure that you had the ql2xprocessrscn=1 option set and you still saw
the 0x20000 error (which is DID_BUS_BUSY) then this point to something other
than the RSCR event.
Also, I am having trouble syncing up what is happening in the two log files
posted above. It appears the info listed as being displayed on the console is
absent from the log files.
What does the snapshot/run_snap status service do? It is interesting that even
after md failed over sdaz to another path, that I/O continues to be generated on
that device, go 2 more 0x20000 errors after that point:
Feb 16 10:44:00 sqaone01 clurgmgrd: : <info> Executing
Feb 16 10:44:28 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000
Feb 16 10:44:28 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1648936
Feb 16 10:44:28 sqaone01 kernel: Operation continuing on 15 IO paths.
Feb 16 10:44:28 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1648944
Feb 16 10:44:31 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000
Feb 16 10:44:31 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1649264
Feb 16 10:44:31 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1649272
Feb 16 10:44:34 sqaone01 clurgmgrd: : <info> Executing
Feb 16 10:45:01 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000
Feb 16 10:45:01 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1720704
Feb 16 10:45:01 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1720712
It was at this point that sqaone2 detected lost heartbeats to sqaone1 and
proceeded to fence him, or at least tried to.
Are you running mdmpd? (/etc/init.d/mdmpd start)
If not, I would highly recommend it. It seems to be the only reliable way I can
prevent 0x20000 SCSI error from taking down IO paths when using md multipathing
and the qla2xxx driver.
Closing this bug.
1. md multipathing is not longer supported for GFS. dm multipathing is preferred.
2. Cannot recreate the problem regarding gfs_glockd comsuming 98% cpu.