Bug 177702 - Umount hung; gfs_glockd using up 98% of cpu
Summary: Umount hung; gfs_glockd using up 98% of cpu
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Ryan O'Hara
QA Contact: GFS Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-01-12 23:35 UTC by Henry Harris
Modified: 2010-01-12 03:09 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-15 17:28:48 UTC
Embargoed:


Attachments (Terms of Use)
tar of /var/log/messages file from failed node (72.88 KB, application/octet-stream)
2006-01-13 16:39 UTC, Henry Harris
no flags Details
Node1 /var/log/messages (489.23 KB, text/plain)
2006-02-18 00:14 UTC, Henry Harris
no flags Details
Node2 - /var/log/messages (261.66 KB, text/plain)
2006-02-18 00:15 UTC, Henry Harris
no flags Details

Description Henry Harris 2006-01-12 23:35:38 UTC
Description of problem: Running a two node cluster with five GFS file systems 
mounted.  Got an I/O error on one of the nodes trying to access one of the 
file systems.  The other node was able to access the same file system without 
error.  Ran ls -l and saw that the mount point of the failed file system on 
the failed node did not report any file permissions, file size, date, etc. Ran 
ls -i and the mount point showed 0 inode.  Tried to umount the file system on 
failed node and the umount hung.  Ran top and saw that gfs_glockd was using 
98% of cpu.  Ran strace on umount pid but saw nothing.  Tried to kill umount 
process but it did not return.  Noticed that ls -l now showed valid data for 
mount point and could now access the file system, but gfs_glock was still 
using up almost all of the cpu.


Version-Release number of selected component (if applicable):
dlm-1.0.0-5

How reproducible:
Has only happened once

Steps to Reproduce:
1.
2.
3.
  
Actual results:
File system cannot be umounted

Expected results:
File system can be umounted

Additional info:

Comment 1 Kiersten (Kerri) Anderson 2006-01-13 14:40:26 UTC
Can you post the message logs from the machine that got the I/O error?

Comment 2 Henry Harris 2006-01-13 16:39:41 UTC
Created attachment 123173 [details]
tar of /var/log/messages file from failed node

The messages file shows everything from the previous reboot on Jan. 9 until the
node was rebooted on Jan. 12 at 16:53:19 following the umount hang.

Comment 3 Ryan O'Hara 2006-01-17 19:54:16 UTC
Once the I/O error occurred, the node withdrew from the cluster. This is normal
behavior and explains why the filesystem was inaccessible from the node.

For some reason GFS gets stuck in gfs_releasepage() due to the buffer b_count
being a non-zero value. From looking at the log, it appears blkno 24 had a
b_count  of 1. This appears to be the same block which initiated the I/O error.
Anyway, with b_count being non-zero, we will spin in gfs_releasepage(). The
probably explains why the filesystem cannot be unmounted and gfs_glockd eats up
CPU cycles.



Comment 4 Ryan O'Hara 2006-01-17 21:45:12 UTC
Is this node/cluster running on a custom-built kernel?

Comment 5 Henry Harris 2006-01-17 21:56:13 UTC
Yes, it is.  We have added some ip statistics gathering and a few other minor 
changes to 2.6.9-22.

Also, can you please explain your previous comment that it is normal behavior 
for an I/O error to cause a node to withdraw from the cluster?  Do you mean it
should withdraw from the cluster and get fenced the same as if a missed 
heartbeat was detected?

Comment 6 Ryan O'Hara 2006-01-17 23:35:03 UTC
Prior to having the withdraw capability, when an I/O error occured the kernel
would panic and thus be fenced. The other nodes can then process with recovery.

Now when an I/O error occurs, the node will withdraw from the cluster. This
stops I/O, waits for pending I/O to error out, and leaves the lockspace. Then,
usually, you can unmount the filesystem. A reboot of the node is recommended
regardless of success/failure of unmount. The advantage of withdraw is that you
can stop any services, application, etc. manually before reboot.


Comment 7 Ryan O'Hara 2006-01-26 17:28:30 UTC
Are you able to recreate this problem?

I'm looking at md multipathing as a potential root of this problem.


Comment 8 Ryan O'Hara 2006-02-01 19:09:22 UTC
I think there are a few issues present here. First, I am looking into why you
got the SCSI error. What was happening on the other node around the time that
this node got the SCSI error? Did you happen to see a SCSI error with return
code 0x10000 on that node?

The reason I ask is that I have a cluster with 3 nodes, all using the qla2xxx
driver and connected to the same fibre channel switch. If I kill one of the
nodes I/O path (I do think by either pulling the FC cable or disabling the port
on the switch), the node who lost its I/O path gets an error and I see a SCSI
error with return code 0x10000. Then, some time later the other nodes on the
SAN, whose I/O paths are still fully intact, get a SCSI error with return code
0x20000. This is the same as what you are seeing and should not happen. In
short, if you lost an I/O path for one node, the other nodes on the switch
should be unaffected. So the SCSI error with return code 0x20000 seems to be the
result of some other event on the SAN. Can you verify that this is what happened
in your case?

Note that I removed multipathing from my configuration only to help me debug the
problems individually. I am/was trying to eliminate md multipathing as the
source of the problem and then discovered this problem with the SCSI errors.

I tend to think that the metadata corruption problem is a result of the SCSI
error problem.

I have not yet been able to recreate the gfs_glockd problem.



Comment 9 Ryan O'Hara 2006-02-08 16:47:45 UTC
Please attempt to capture debug info from the QLogic driver. To do this, remove
the qla2xxx modules (modprobe -r qla2xxx) and reload with the following parameters:

modprobe qla2xxx extended_error_logging=1



Comment 18 Ryan O'Hara 2006-02-16 19:51:57 UTC
Need the following information:

1) Are there multiple HBAs in each node?
2) Is the storage single processor?
3) If the storage is multi-processor, is it active/active or active/passive?

Thanks.



Comment 19 Ryan O'Hara 2006-02-17 21:55:02 UTC
A potential workaround exists, but it doesn't seem to work all the time. I am
investigating why. You can try this:

modprobe qla2xxx ql2xprocessrscn

Please update the bugzilla stating whether or not this works as well as the
model # of the qlogic HBAs you are using.



Comment 20 Ryan O'Hara 2006-02-17 23:02:33 UTC
Sorry, the option for the qla2xxx driver (above) is incorrect. The option needs
a value. It should be:

modprobe qla2xxx ql2xprocessrscn=1

Without the "=1" you will see...

"qla2xxx: `' invalid for parameter `ql2xprocessrscn'"


Comment 21 Henry Harris 2006-02-18 00:10:19 UTC
Answers to Q's in comment #18:
1) yes - multiple HBAs
2 and 3) - dual controller - active/passive

Additional information - here's a testcase that does not involve hardware 
failure - i.e. no-one pulled any plugs or failed any ports - just plain 
traffic running on the system...

Test Description and console output below.  /var/log/messages from each node 
will be attached next (two nodes).

Test Description: 

** Two GFS File systems (volumes:TestExport_01 & TestExport_02) NFS mounted 
and exported to a client system

** Started random IOs with block sizes of 256k & 512k on both NFS mount points 
using dt utility

** Transfer upto 5G data in and out from both NFS mount points

 

Test Results:

** One of two-clustered nodes panicâd, during cman ememrgency_shutdown. See 
the console output below for more details.

** The other node of the cluster did NOT panic, but reported data mismatch to 
dt (maybe just an dtâs hiccup)

 

Console logs from the node panicâd from running the test above:

 

SCSI error : <3 0 1 0> return code = 0x20000

end_request: I/O error, dev sdaz, sector 1751304

WARNING: dlm_emergency_shutdown

SM: 00000001 sm_stop: SG still joined

SM: 01000002 sm_stop: SG still joined

SM: 02000004 sm_stop: SG still joined

SM: 03000009 sm_stop: SG still joined

 13,13,13

TestVol_02 move flags 0,1,0 ids 13,15,13

TestVol_02 move use event 15

TestVol_02 recover event 15

TestVol_02 add node 2

TestVol_02 total nodes 2

TestVol_02 rebuild resource directory

TestVol_02 rebuilt 7 resources

TestVol_02 purge requests

TestVol_02 purged 0 requests

TestVol_02 mark waiting requests

TestVol_02 marked 0 requests

TestVol_02 recover event 15 done

TestVol_02 move flags 0,0,1 ids 13,15,15

TestVol_02 process held requests

TestVol_02 processed 0 requests

TestVol_02 resend marked requests

TestVol_02 resent 0 requests

TestVol_02 recover event 15 finished

Magma move flags 0,1,0 ids 0,18,0

Magma move use event 18

Magma recover event 18 (first)

Magma add nodes

Magma total nodes 2

Magma rebuild resource directory

Magma rebuilt 1 resources

Magma recover event 18 done

Magma move flags 0,0,1 ids 0,18,18

Magma process held requests

Magma processed 0 requests

Magma recover event 18 finished

Magma (16140) req reply einval 701c2 fr 2 r 2 usrm::vf

Magma (16140) req reply einval 6037f fr 2 r 2 usrm::vf

15572 pr_start last_stop 0 last_start 6 last_finish 0

15572 pr_start count 1 type 2 event 6 flags 250

15572 claim_jid 0

15572 pr_start 6 done 1

15572 pr_finish flags 5b

15572 pr_start last_stop 6 last_start 8 last_finish 6

15572 pr_start count 2 type 2 event 8 flags 21b

15559 recovery_done jid 0 msg 309 21b

15559 recovery_done nodeid 2 flg 0

15559 recovery_done jid 1 msg 309 21b

15559 recovery_done jid 2 msg 309 21b

15559 recovery_done jid 3 msg 309 21b

15559 recovery_done jid 4 msg 309 21b

15559 recovery_done jid 5 msg 309 21b

15559 recovery_done jid 6 msg 309 21b

15559 recovery_done jid 7 msg 309 21b

15559 recovery_done jid 8 msg 309 21b

15559 recovery_done jid 9 msg 309 21b

15572 pr_start delay done before omm 21b

15572 pr_start 8 done 0

15559 recovery_done jid 10 msg 309 101b

15559 recovery_done jid 11 msg 309 101b

15559 recovery_done jid 12 msg 309 101b

15559 recovery_done jid 13 msg 309 101b

15559 recovery_done jid 14 msg 309 101b

15559 recovery_done jid 15 msg 309 101b

15559 others_may_mount 101b

15559 others_may_mount start_done 8 181b

15572 pr_finish flags 181b

15826 pr_start last_stop 0 last_start 10 last_finish 0

15826 pr_start count 1 type 2 event 10 flags 250

15826 claim_jid 0

15826 pr_start 10 done 1

15826 pr_finish flags 5b

15814 recovery_done jid 0 msg 309 b

15814 recovery_done nodeid 1 flg 18

15814 recovery_done jid 1 msg 309 b

15825 pr_start last_stop 10 last_start 12 last_finish 10

15825 pr_start count 2 type 2 event 12 flags 21b

15814 recovery_done jid 2 msg 309 21b

15814 recovery_done jid 3 msg 309 21b

15814 recovery_done jid 4 msg 309 21b

15814 recovery_done jid 5 msg 309 21b

15814 recovery_done jid 6 msg 309 21b

15814 recovery_done jid 7 msg 309 21b

15814 recovery_done jid 8 msg 309 21b

15814 recovery_done jid 9 msg 309 21b

15825 pr_start delay done before omm 21b

15825 pr_start 12 done 0

15814 recovery_done jid 10 msg 309 101b

15814 recovery_done jid 11 msg 309 101b

15814 recovery_done jid 12 msg 309 101b

15814 recovery_done jid 13 msg 309 101b

15814 recovery_done jid 14 msg 309 101b

15814 recovery_done jid 15 msg 309 101b

15814 others_may_mount 101b

15814 others_may_mount start_done 12 181b

15825 pr_finish flags 181b

15882 pr_start last_stop 0 last_start 14 last_finish 0

15882 pr_start count 1 type 2 event 14 flags 250

15882 claim_jid 0

15882 pr_start 14 done 1

15882 pr_finish flags 5b

15868 recovery_done jid 0 msg 309 b

15868 recovery_done nodeid 1 flg 18

15881 pr_start last_stop 14 last_start 16 last_finish 14

15881 pr_start count 2 type 2 event 16 flags 21b

15868 recovery_done jid 1 msg 309 21b

15868 recovery_done jid 2 msg 309 21b

15868 recovery_done jid 3 msg 309 21b

15868 recovery_done jid 4 msg 309 21b

15868 recovery_done jid 5 msg 309 21b

15868 recovery_done jid 6 msg 309 21b

15868 recovery_done jid 7 msg 309 21b

15868 recovery_done jid 8 msg 309 21b

15868 recovery_done jid 9 msg 309 21b

15868 recovery_done jid 10 msg 309 21b

15868 recovery_done jid 11 msg 309 21b

15881 pr_start delay done before omm 21b

15881 pr_start 16 done 0

15868 recovery_done jid 12 msg 309 101b

15868 recovery_done jid 13 msg 309 101b

15868 recovery_done jid 14 msg 309 101b

15868 recovery_done jid 15 msg 309 101b

15868 others_may_mount 101b

15868 others_may_mount start_done 16 181b

15882 pr_finish flags 181b

 

lock_dlm:  Assertion failed on line 428 of file fs/gfs_locking/lock_dlm/lock.c

lock_dlm:  assertion:  "!error"

lock_dlm:  time = 4299504932

 

----------- [cut here ] --------- [please bite here ] ---------

Kernel BUG at lock:428

invalid operand: 0000 [1] SMP

 

Entering kdb (current=0x000001007b8f4030, pid 15571) on processor 1 Oops: 
<NULL>

due to oops @ 0xffffffffa023ae7c

     r15 = 0x0000000000000003      r14 = 0x0000000000000001

     r13 = 0x0000000000000000      r12 = 0x000001007bf29e00

     rbp = 0x00000100d608a480      rbx = 0x00000000ffffffea

     r11 = 0x0000000d80391c00      r10 = 0x0000000000000000

      r9 = 0x00000000ffffffea       r8 = 0x000000000000000d

     rax = 0x0000000000000001      rcx = 0x0000000000000246

     rdx = 0x00000000000a8f91      rsi = 0x0000000000000246

     rdi = 0xffffffff80375de0 orig_rax = 0xffffffffffffffff

     rip = 0xffffffffa023ae7c       cs = 0x0000000000000010

  eflags = 0x0000000000010212      rsp = 0x000001007c28fcc0

      ss = 0x000001007c28e000 &regs = 0x000001007c28fc28

[1]kdb> bt

Stack traceback for pid 15571

0x000001007b8f4030    15571        6  1    1   R  0x000001007b8f4430 *lock_dlm1

RSP           RIP                Function (args)

0x1007c28fcc0 0xffffffffa023ae7c [lock_dlm]do_dlm_lock+0x16d (0x0, 
0xffffff0000d32000, 0x100d6333080, 0x1                  00d63330ac, 
0xffffff0000d32000)

0x1007c28fd38 0xffffffffa023afbf [lock_dlm]lm_dlm_lock+0xd6 
(0xffffffffffffffff, 0x100db979898, 0x100db97                  9830, 
0x10037e973c0, 0x212)

0x1007c28ff28 0xffffffff8014796b kthread+0xc8

0x1007c28ff58 0xffffffff8010fbaf child_rip+0x8

[1]kdb>

 


Comment 22 Henry Harris 2006-02-18 00:14:11 UTC
Created attachment 124839 [details]
Node1 /var/log/messages

Comment 23 Henry Harris 2006-02-18 00:15:03 UTC
Created attachment 124840 [details]
Node2 - /var/log/messages

Comment 24 Ryan O'Hara 2006-02-20 17:25:33 UTC
It does not appear that the ql2xprocessrscn=1 option was set on the qla2xxx
driver. Can you confirm this?

I've only seen the 0x20000 SCSI return code in response to a RSCR event on the
fabric. You can trigger one of these events (usually) by disabling a port or
pulling a cable. There may be other ways to trigger an RSCR event. Could there
have been anything else happening on the fabric that would have caused the
switch trigger an RSCR event?

If you are sure that you had the ql2xprocessrscn=1 option set and you still saw
the 0x20000 error (which is DID_BUS_BUSY) then this point to something other
than the RSCR event.

Also, I am having trouble syncing up what is happening in the two log files
posted above. It appears the info listed as being displayed on the console is
absent from the log files.

Comment 25 Kiersten (Kerri) Anderson 2006-02-20 17:53:26 UTC
What does the snapshot/run_snap status service do?  It is interesting that even
after md failed over sdaz to another path, that I/O continues to be generated on
that device, go 2 more 0x20000 errors after that point:

Feb 16 10:44:00 sqaone01 clurgmgrd: [16140]: <info> Executing
/mnt/crosswalk/snapshot/run_snap status 
Feb 16 10:44:28 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000
Feb 16 10:44:28 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1648936
Feb 16 10:44:28 sqaone01 kernel: 	Operation continuing on 15 IO paths.
Feb 16 10:44:28 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1648944
Feb 16 10:44:31 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000
Feb 16 10:44:31 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1649264
Feb 16 10:44:31 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1649272
Feb 16 10:44:34 sqaone01 clurgmgrd: [16140]: <info> Executing
/mnt/crosswalk/snapshot/run_snap status 
Feb 16 10:45:01 sqaone01 kernel: SCSI error : <3 0 1 0> return code = 0x20000
Feb 16 10:45:01 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1720704
Feb 16 10:45:01 sqaone01 kernel: end_request: I/O error, dev sdaz, sector 1720712

It was at this point that sqaone2 detected lost heartbeats to sqaone1 and
proceeded to fence him, or at least tried to.  


Comment 26 Ryan O'Hara 2006-02-20 20:42:11 UTC
Are you running mdmpd? (/etc/init.d/mdmpd start)

If not, I would highly recommend it. It seems to be the only reliable way I can
prevent 0x20000 SCSI error from taking down IO paths when using md multipathing
and the qla2xxx driver.



Comment 27 Ryan O'Hara 2006-08-15 17:28:48 UTC
Closing this bug.

1. md multipathing is not longer supported for GFS. dm multipathing is preferred.
2. Cannot recreate the problem regarding gfs_glockd comsuming 98% cpu.




Note You need to log in before you can comment on or make changes to this bug.