Description of problem: Running a two node cluster with five GFS file systems mounted. Got an I/O error on one of the nodes trying to access one of the file systems. The other node was able to access the same file system without error. Ran ls -l and saw that the mount point of the failed file system on the failed node did not report any file permissions, file size, date, etc. Ran ls -i and the mount point showed 0 inode. Tried to umount the file system on failed node and the umount hung. Ran top and saw that gfs_glockd was using 98% of cpu. Ran strace on umount pid but saw nothing. Tried to kill umount process but it did not return. Noticed that ls -l now showed valid data for mount point and could now access the file system, but gfs_glock was still using up almost all of the cpu. Version-Release number of selected component (if applicable): dlm-1.0.0-5 How reproducible: Has only happened once Steps to Reproduce: 1. 2. 3. Actual results: File system cannot be umounted Expected results: File system can be umounted Additional info:
Can you post the message logs from the machine that got the I/O error?
Created attachment 123173 [details] tar of /var/log/messages file from failed node The messages file shows everything from the previous reboot on Jan. 9 until the node was rebooted on Jan. 12 at 16:53:19 following the umount hang.
Once the I/O error occurred, the node withdrew from the cluster. This is normal behavior and explains why the filesystem was inaccessible from the node. For some reason GFS gets stuck in gfs_releasepage() due to the buffer b_count being a non-zero value. From looking at the log, it appears blkno 24 had a b_count of 1. This appears to be the same block which initiated the I/O error. Anyway, with b_count being non-zero, we will spin in gfs_releasepage(). The probably explains why the filesystem cannot be unmounted and gfs_glockd eats up CPU cycles.
Is this node/cluster running on a custom-built kernel?
Yes, it is. We have added some ip statistics gathering and a few other minor changes to 2.6.9-22. Also, can you please explain your previous comment that it is normal behavior for an I/O error to cause a node to withdraw from the cluster? Do you mean it should withdraw from the cluster and get fenced the same as if a missed heartbeat was detected?
Prior to having the withdraw capability, when an I/O error occured the kernel would panic and thus be fenced. The other nodes can then process with recovery. Now when an I/O error occurs, the node will withdraw from the cluster. This stops I/O, waits for pending I/O to error out, and leaves the lockspace. Then, usually, you can unmount the filesystem. A reboot of the node is recommended regardless of success/failure of unmount. The advantage of withdraw is that you can stop any services, application, etc. manually before reboot.
Are you able to recreate this problem? I'm looking at md multipathing as a potential root of this problem.
I think there are a few issues present here. First, I am looking into why you got the SCSI error. What was happening on the other node around the time that this node got the SCSI error? Did you happen to see a SCSI error with return code 0x10000 on that node? The reason I ask is that I have a cluster with 3 nodes, all using the qla2xxx driver and connected to the same fibre channel switch. If I kill one of the nodes I/O path (I do think by either pulling the FC cable or disabling the port on the switch), the node who lost its I/O path gets an error and I see a SCSI error with return code 0x10000. Then, some time later the other nodes on the SAN, whose I/O paths are still fully intact, get a SCSI error with return code 0x20000. This is the same as what you are seeing and should not happen. In short, if you lost an I/O path for one node, the other nodes on the switch should be unaffected. So the SCSI error with return code 0x20000 seems to be the result of some other event on the SAN. Can you verify that this is what happened in your case? Note that I removed multipathing from my configuration only to help me debug the problems individually. I am/was trying to eliminate md multipathing as the source of the problem and then discovered this problem with the SCSI errors. I tend to think that the metadata corruption problem is a result of the SCSI error problem. I have not yet been able to recreate the gfs_glockd problem.
Please attempt to capture debug info from the QLogic driver. To do this, remove the qla2xxx modules (modprobe -r qla2xxx) and reload with the following parameters: modprobe qla2xxx extended_error_logging=1
Need the following information: 1) Are there multiple HBAs in each node? 2) Is the storage single processor? 3) If the storage is multi-processor, is it active/active or active/passive? Thanks.
A potential workaround exists, but it doesn't seem to work all the time. I am investigating why. You can try this: modprobe qla2xxx ql2xprocessrscn Please update the bugzilla stating whether or not this works as well as the model # of the qlogic HBAs you are using.
Sorry, the option for the qla2xxx driver (above) is incorrect. The option needs a value. It should be: modprobe qla2xxx ql2xprocessrscn=1 Without the "=1" you will see... "qla2xxx: `' invalid for parameter `ql2xprocessrscn'"
Answers to Q's in comment #18: 1) yes - multiple HBAs 2 and 3) - dual controller - active/passive Additional information - here's a testcase that does not involve hardware failure - i.e. no-one pulled any plugs or failed any ports - just plain traffic running on the system... Test Description and console output below. /var/log/messages from each node will be attached next (two nodes). Test Description: ** Two GFS File systems (volumes:TestExport_01 & TestExport_02) NFS mounted and exported to a client system ** Started random IOs with block sizes of 256k & 512k on both NFS mount points using dt utility ** Transfer upto 5G data in and out from both NFS mount points Test Results: ** One of two-clustered nodes panicâd, during cman ememrgency_shutdown. See the console output below for more details. ** The other node of the cluster did NOT panic, but reported data mismatch to dt (maybe just an dtâs hiccup) Console logs from the node panicâd from running the test above: SCSI error : <3 0 1 0> return code = 0x20000 end_request: I/O error, dev sdaz, sector 1751304 WARNING: dlm_emergency_shutdown SM: 00000001 sm_stop: SG still joined SM: 01000002 sm_stop: SG still joined SM: 02000004 sm_stop: SG still joined SM: 03000009 sm_stop: SG still joined 13,13,13 TestVol_02 move flags 0,1,0 ids 13,15,13 TestVol_02 move use event 15 TestVol_02 recover event 15 TestVol_02 add node 2 TestVol_02 total nodes 2 TestVol_02 rebuild resource directory TestVol_02 rebuilt 7 resources TestVol_02 purge requests TestVol_02 purged 0 requests TestVol_02 mark waiting requests TestVol_02 marked 0 requests TestVol_02 recover event 15 done TestVol_02 move flags 0,0,1 ids 13,15,15 TestVol_02 process held requests TestVol_02 processed 0 requests TestVol_02 resend marked requests TestVol_02 resent 0 requests TestVol_02 recover event 15 finished Magma move flags 0,1,0 ids 0,18,0 Magma move use event 18 Magma recover event 18 (first) Magma add nodes Magma total nodes 2 Magma rebuild resource directory Magma rebuilt 1 resources Magma recover event 18 done Magma move flags 0,0,1 ids 0,18,18 Magma process held requests Magma processed 0 requests Magma recover event 18 finished Magma (16140) req reply einval 701c2 fr 2 r 2 usrm::vf Magma (16140) req reply einval 6037f fr 2 r 2 usrm::vf 15572 pr_start last_stop 0 last_start 6 last_finish 0 15572 pr_start count 1 type 2 event 6 flags 250 15572 claim_jid 0 15572 pr_start 6 done 1 15572 pr_finish flags 5b 15572 pr_start last_stop 6 last_start 8 last_finish 6 15572 pr_start count 2 type 2 event 8 flags 21b 15559 recovery_done jid 0 msg 309 21b 15559 recovery_done nodeid 2 flg 0 15559 recovery_done jid 1 msg 309 21b 15559 recovery_done jid 2 msg 309 21b 15559 recovery_done jid 3 msg 309 21b 15559 recovery_done jid 4 msg 309 21b 15559 recovery_done jid 5 msg 309 21b 15559 recovery_done jid 6 msg 309 21b 15559 recovery_done jid 7 msg 309 21b 15559 recovery_done jid 8 msg 309 21b 15559 recovery_done jid 9 msg 309 21b 15572 pr_start delay done before omm 21b 15572 pr_start 8 done 0 15559 recovery_done jid 10 msg 309 101b 15559 recovery_done jid 11 msg 309 101b 15559 recovery_done jid 12 msg 309 101b 15559 recovery_done jid 13 msg 309 101b 15559 recovery_done jid 14 msg 309 101b 15559 recovery_done jid 15 msg 309 101b 15559 others_may_mount 101b 15559 others_may_mount start_done 8 181b 15572 pr_finish flags 181b 15826 pr_start last_stop 0 last_start 10 last_finish 0 15826 pr_start count 1 type 2 event 10 flags 250 15826 claim_jid 0 15826 pr_start 10 done 1 15826 pr_finish flags 5b 15814 recovery_done jid 0 msg 309 b 15814 recovery_done nodeid 1 flg 18 15814 recovery_done jid 1 msg 309 b 15825 pr_start last_stop 10 last_start 12 last_finish 10 15825 pr_start count 2 type 2 event 12 flags 21b 15814 recovery_done jid 2 msg 309 21b 15814 recovery_done jid 3 msg 309 21b 15814 recovery_done jid 4 msg 309 21b 15814 recovery_done jid 5 msg 309 21b 15814 recovery_done jid 6 msg 309 21b 15814 recovery_done jid 7 msg 309 21b 15814 recovery_done jid 8 msg 309 21b 15814 recovery_done jid 9 msg 309 21b 15825 pr_start delay done before omm 21b 15825 pr_start 12 done 0 15814 recovery_done jid 10 msg 309 101b 15814 recovery_done jid 11 msg 309 101b 15814 recovery_done jid 12 msg 309 101b 15814 recovery_done jid 13 msg 309 101b 15814 recovery_done jid 14 msg 309 101b 15814 recovery_done jid 15 msg 309 101b 15814 others_may_mount 101b 15814 others_may_mount start_done 12 181b 15825 pr_finish flags 181b 15882 pr_start last_stop 0 last_start 14 last_finish 0 15882 pr_start count 1 type 2 event 14 flags 250 15882 claim_jid 0 15882 pr_start 14 done 1 15882 pr_finish flags 5b 15868 recovery_done jid 0 msg 309 b 15868 recovery_done nodeid 1 flg 18 15881 pr_start last_stop 14 last_start 16 last_finish 14 15881 pr_start count 2 type 2 event 16 flags 21b 15868 recovery_done jid 1 msg 309 21b 15868 recovery_done jid 2 msg 309 21b 15868 recovery_done jid 3 msg 309 21b 15868 recovery_done jid 4 msg 309 21b 15868 recovery_done jid 5 msg 309 21b 15868 recovery_done jid 6 msg 309 21b 15868 recovery_done jid 7 msg 309 21b 15868 recovery_done jid 8 msg 309 21b 15868 recovery_done jid 9 msg 309 21b 15868 recovery_done jid 10 msg 309 21b 15868 recovery_done jid 11 msg 309 21b 15881 pr_start delay done before omm 21b 15881 pr_start 16 done 0 15868 recovery_done jid 12 msg 309 101b 15868 recovery_done jid 13 msg 309 101b 15868 recovery_done jid 14 msg 309 101b 15868 recovery_done jid 15 msg 309 101b 15868 others_may_mount 101b 15868 others_may_mount start_done 16 181b 15882 pr_finish flags 181b lock_dlm: Assertion failed on line 428 of file fs/gfs_locking/lock_dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 4299504932 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:428 invalid operand: 0000 [1] SMP Entering kdb (current=0x000001007b8f4030, pid 15571) on processor 1 Oops: <NULL> due to oops @ 0xffffffffa023ae7c r15 = 0x0000000000000003 r14 = 0x0000000000000001 r13 = 0x0000000000000000 r12 = 0x000001007bf29e00 rbp = 0x00000100d608a480 rbx = 0x00000000ffffffea r11 = 0x0000000d80391c00 r10 = 0x0000000000000000 r9 = 0x00000000ffffffea r8 = 0x000000000000000d rax = 0x0000000000000001 rcx = 0x0000000000000246 rdx = 0x00000000000a8f91 rsi = 0x0000000000000246 rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff rip = 0xffffffffa023ae7c cs = 0x0000000000000010 eflags = 0x0000000000010212 rsp = 0x000001007c28fcc0 ss = 0x000001007c28e000 ®s = 0x000001007c28fc28 [1]kdb> bt Stack traceback for pid 15571 0x000001007b8f4030 15571 6 1 1 R 0x000001007b8f4430 *lock_dlm1 RSP RIP Function (args) 0x1007c28fcc0 0xffffffffa023ae7c [lock_dlm]do_dlm_lock+0x16d (0x0, 0xffffff0000d32000, 0x100d6333080, 0x1 00d63330ac, 0xffffff0000d32000) 0x1007c28fd38 0xffffffffa023afbf [lock_dlm]lm_dlm_lock+0xd6 (0xffffffffffffffff, 0x100db979898, 0x100db97 9830, 0x10037e973c0, 0x212) 0x1007c28ff28 0xffffffff8014796b kthread+0xc8 0x1007c28ff58 0xffffffff8010fbaf child_rip+0x8 [1]kdb>
Created attachment 124839 [details] Node1 /var/log/messages
Created attachment 124840 [details] Node2 - /var/log/messages
It does not appear that the ql2xprocessrscn=1 option was set on the qla2xxx driver. Can you confirm this? I've only seen the 0x20000 SCSI return code in response to a RSCR event on the fabric. You can trigger one of these events (usually) by disabling a port or pulling a cable. There may be other ways to trigger an RSCR event. Could there have been anything else happening on the fabric that would have caused the switch trigger an RSCR event? If you are sure that you had the ql2xprocessrscn=1 option set and you still saw the 0x20000 error (which is DID_BUS_BUSY) then this point to something other than the RSCR event. Also, I am having trouble syncing up what is happening in the two log files posted above. It appears the info listed as being displayed on the console is absent from the log files.
What does the snapshot/run_snap status service do? It is interesting that even after md failed over sdaz to another path, that I/O continues to be generated on that device, go 2 more 0x20000 errors after that point: Feb 16 10:44:00 sqaone01 clurgmgrd: [16140]: <info> Executing /mnt/crosswalk/snapshot/run_snap status Feb 16 10:44:28 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000 Feb 16 10:44:28 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1648936 Feb 16 10:44:28 sqaone01 kernel: Operation continuing on 15 IO paths. Feb 16 10:44:28 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1648944 Feb 16 10:44:31 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000 Feb 16 10:44:31 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1649264 Feb 16 10:44:31 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1649272 Feb 16 10:44:34 sqaone01 clurgmgrd: [16140]: <info> Executing /mnt/crosswalk/snapshot/run_snap status Feb 16 10:45:01 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000 Feb 16 10:45:01 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1720704 Feb 16 10:45:01 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1720712 It was at this point that sqaone2 detected lost heartbeats to sqaone1 and proceeded to fence him, or at least tried to.
Are you running mdmpd? (/etc/init.d/mdmpd start) If not, I would highly recommend it. It seems to be the only reliable way I can prevent 0x20000 SCSI error from taking down IO paths when using md multipathing and the qla2xxx driver.
Closing this bug. 1. md multipathing is not longer supported for GFS. dm multipathing is preferred. 2. Cannot recreate the problem regarding gfs_glockd comsuming 98% cpu.