Description of problem: I've been having this problem lately after GFS has been mounted for a while (shortest observed time is 1 hour). After traffic is stopped and a umount issued, the command either hangs in the D state with the cman_serviced process, or spins at 99% CPU. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15101 root 25 0 1424 356 1268 R 99.8 0.1 12:07.38 umount Or this: 8098 ? S 0:00 ccsd 8110 ? SW< 0:01 [cman_comms] 8112 ? DW< 0:00 [cman_serviced] 8111 ? SW< 0:00 [cman_memb] 8113 ? SW< 0:00 [cman_hbeat] 8142 ? S 0:00 fenced 8442 ? S 0:00 clvmd 8443 ? SW< 0:01 [dlm_astd] 8444 ? SW< 0:02 [dlm_recvd] 8445 ? SW< 0:00 [dlm_sendd] 8760 ? SW< 0:00 [lock_dlm1] 8761 ? SW< 0:01 [lock_dlm2] 10827 pts/0 D 0:00 umount /mnt/gfs0 /proc/cluster/services for when you get the D state processes above looks like: [root@link-11 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2] DLM Lock Space: "clvmd" 3 4 run - [1 2] DLM Lock Space: "gfs0" 4 7 run - [1 2] GFS Mount Group: "gfs0" 5 8 run S-11,208,0 [1 2] Version-Release number of selected component (if applicable): DEVEL.1102693630 (built Dec 10 2004 09:48:39) How reproducible: It has happened numerous times to me this week. If you just mount/umount right away it doesn't happen. Usually see the problem after leaving the fs up with traffic over lunchtime or overnight. After that, can't umount. I've also been monkeying with NFS over GFS lately, so this may have something to do with it. Steps to Reproduce: 1. Mount a GFS 2. Run traffic (I've seen it happen without as well) 3. Go to lunch 4. Attempt to umount Actual results: Hangy Expected results: No Hangy Additional info:
I haven't been able to reproduce anything like this. Could you please get a cluster hung this way and let me know where to access it, or give me access to the nodes it happens on so I can reproduce it myself? (It would be great to have kdb in the kernels, too.)
I've seen this twice now while testing other things. Both times the unmount was stuck spinning in gfs_gl_hash_clear/examine_bucket. When this happens again: 1. get a dlm lock dump: echo "lockspace name" >> /proc/cluster/dlm_locks cat /proc/cluster/dlm_locks > dlm-locks 2. wait for 10 minutes (set stall_secs for less) and gfs should dump the locks it's waiting for to the console. (a gfs_tool lockdump may work here instead)
Haven't seen this in quite some time. It may have been a secondary effect of another bug.
This problem still exists. NFS seems to be a way to produce it. https://www.redhat.com/archives/linux-cluster/2005-February/msg00231.html
Removing it from the blocker list until we get a reproduceable test case running on the latest stuff in cvs.
The problem for sunjw onewaveinc com disappeared when he updated to the 2.6.11 kernel.
I am having this problem also running the 2.6.9-34 kernel. Is there a specific patch I can apply to this kernel? This was happening over and over again before I rebooted the node in addition to the high load even though all but one of the vips had been relocated to other nodes. Sep 6 19:10:05 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: Unmount seems to be stalled. Dumping lock state... Sep 6 19:10:05 flsrv02 kernel: Glock (2, 995) Sep 6 19:10:05 flsrv02 kernel: gl_flags = Sep 6 19:10:05 flsrv02 kernel: gl_count = 2 Sep 6 19:10:05 flsrv02 kernel: gl_state = 0 Sep 6 19:10:05 flsrv02 kernel: req_gh = no Sep 6 19:10:05 flsrv02 kernel: req_bh = no Sep 6 19:10:05 flsrv02 kernel: lvb_count = 0 Sep 6 19:10:05 flsrv02 kernel: object = yes Sep 6 19:10:05 flsrv02 kernel: new_le = no Sep 6 19:10:05 flsrv02 kernel: incore_le = no Sep 6 19:10:05 flsrv02 kernel: reclaim = no Sep 6 19:10:05 flsrv02 kernel: aspace = 0 Sep 6 19:10:05 flsrv02 kernel: ail_bufs = no Sep 6 19:10:05 flsrv02 kernel: Inode: Sep 6 19:10:05 flsrv02 kernel: num = 995/995 Sep 6 19:10:05 flsrv02 kernel: type = 2 Sep 6 19:10:05 flsrv02 kernel: i_count = 1 Sep 6 19:10:05 flsrv02 kernel: i_flags = Sep 6 19:10:05 flsrv02 kernel: vnode = yes Sep 6 19:10:05 flsrv02 kernel: Glock (5, 995) Sep 6 19:10:05 flsrv02 kernel: gl_flags = Sep 6 19:10:05 flsrv02 kernel: gl_count = 2 Sep 6 19:10:05 flsrv02 kernel: gl_state = 3 Sep 6 19:10:05 flsrv02 kernel: req_gh = no Sep 6 19:10:05 flsrv02 kernel: req_bh = no Sep 6 19:10:05 flsrv02 kernel: lvb_count = 0 Sep 6 19:10:05 flsrv02 kernel: object = yes Sep 6 19:10:05 flsrv02 kernel: new_le = no Sep 6 19:10:05 flsrv02 kernel: incore_le = no Sep 6 19:10:05 flsrv02 kernel: reclaim = no Sep 6 19:10:05 flsrv02 kernel: aspace = no Sep 6 19:10:05 flsrv02 kernel: ail_bufs = no Sep 6 19:10:05 flsrv02 kernel: Holder Sep 6 19:10:05 flsrv02 kernel: owner = -1 Sep 6 19:10:05 flsrv02 kernel: gh_state = 3 Sep 6 19:10:05 flsrv02 kernel: gh_flags = 5 7 Sep 6 19:10:06 flsrv02 kernel: error = 0 Sep 6 19:10:06 flsrv02 kernel: gh_iflags = 1 6 7