142993 – Umount either hanging or spinning, consuming all CPU

Bug 142993 - Umount either hanging or spinning, consuming all CPU

Summary: Umount either hanging or spinning, consuming all CPU

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Teigland
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-12-15 17:06 UTC by Derek Anderson
Modified:	2009-04-16 20:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-03-15 04:56:40 UTC
Embargoed:

Attachments	(Terms of Use)

Description Derek Anderson 2004-12-15 17:06:01 UTC

Description of problem:
I've been having this problem lately after GFS has been mounted for a
while (shortest observed time is 1 hour).  After traffic is stopped
and a umount issued, the command either hangs in the D state with the
cman_serviced process, or spins at 99% CPU.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15101 root      25   0  1424  356 1268 R 99.8  0.1  12:07.38 umount

Or this:
 8098 ?        S      0:00 ccsd
 8110 ?        SW<    0:01 [cman_comms]
 8112 ?        DW<    0:00 [cman_serviced]
 8111 ?        SW<    0:00 [cman_memb]
 8113 ?        SW<    0:00 [cman_hbeat]
 8142 ?        S      0:00 fenced
 8442 ?        S      0:00 clvmd
 8443 ?        SW<    0:01 [dlm_astd]
 8444 ?        SW<    0:02 [dlm_recvd]
 8445 ?        SW<    0:00 [dlm_sendd]
 8760 ?        SW<    0:00 [lock_dlm1]
 8761 ?        SW<    0:01 [lock_dlm2]
10827 pts/0    D      0:00 umount /mnt/gfs0

/proc/cluster/services for when you get the D state processes above
looks like:
[root@link-11 root]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

DLM Lock Space:  "clvmd"                             3   4 run       -
[1 2]

DLM Lock Space:  "gfs0"                              4   7 run       -
[1 2]

GFS Mount Group: "gfs0"                              5   8 run      
S-11,208,0
[1 2]



Version-Release number of selected component (if applicable):
DEVEL.1102693630 (built Dec 10 2004 09:48:39)

How reproducible:
It has happened numerous times to me this week.  If you just
mount/umount right away it doesn't happen.  Usually see the problem
after leaving the fs up with traffic over lunchtime or overnight. 
After that, can't umount.  I've also been monkeying with NFS over GFS
lately, so this may have something to do with it.

Steps to Reproduce:
1. Mount a GFS
2. Run traffic (I've seen it happen without as well)
3. Go to lunch
4. Attempt to umount
  
Actual results:
Hangy

Expected results:
No Hangy

Additional info:

Comment 1 David Teigland 2005-01-04 07:45:47 UTC

I haven't been able to reproduce anything like this.
Could you please get a cluster hung this way and let me know where
to access it, or give me access to the nodes it happens on so I can
reproduce it myself?  (It would be great to have kdb in the kernels,
too.)

Comment 2 David Teigland 2005-01-24 08:41:56 UTC

I've seen this twice now while testing other things.  Both times
the unmount was stuck spinning in gfs_gl_hash_clear/examine_bucket.

When this happens again:

1. get a dlm lock dump:
   echo "lockspace name" >> /proc/cluster/dlm_locks
   cat /proc/cluster/dlm_locks > dlm-locks

2. wait for 10 minutes (set stall_secs for less) and gfs should
   dump the locks it's waiting for to the console.  (a gfs_tool
   lockdump may work here instead)

Comment 3 David Teigland 2005-02-22 15:41:15 UTC

Haven't seen this in quite some time.  It may have been a secondary
effect of another bug.

Comment 4 David Teigland 2005-02-28 07:16:57 UTC

This problem still exists.  NFS seems to be a way to produce it.

https://www.redhat.com/archives/linux-cluster/2005-February/msg00231.html

Comment 5 Kiersten (Kerri) Anderson 2005-03-08 19:47:13 UTC

Removing it from the blocker list until we get a reproduceable test case running
on the latest stuff in cvs.

Comment 6 David Teigland 2005-03-15 04:56:40 UTC

The problem for sunjw onewaveinc com disappeared when he
updated to the 2.6.11 kernel.

Comment 7 Lenny Maiorani 2006-09-12 15:44:05 UTC

I am having this problem also running the 2.6.9-34 kernel. Is there a specific
patch I can apply to this kernel? 

This was happening over and over again before I rebooted the node in addition
to the high load even though all but one of the vips had been relocated to
other nodes.


Sep  6 19:10:05 flsrv02 kernel: GFS: fsid=flsrv:CRSim_PSmith.1: Unmount seems
to be stalled. Dumping lock state...
Sep  6 19:10:05 flsrv02 kernel: Glock (2, 995)
Sep  6 19:10:05 flsrv02 kernel:   gl_flags = 
Sep  6 19:10:05 flsrv02 kernel:   gl_count = 2
Sep  6 19:10:05 flsrv02 kernel:   gl_state = 0
Sep  6 19:10:05 flsrv02 kernel:   req_gh = no
Sep  6 19:10:05 flsrv02 kernel:   req_bh = no
Sep  6 19:10:05 flsrv02 kernel:   lvb_count = 0
Sep  6 19:10:05 flsrv02 kernel:   object = yes
Sep  6 19:10:05 flsrv02 kernel:   new_le = no
Sep  6 19:10:05 flsrv02 kernel:   incore_le = no
Sep  6 19:10:05 flsrv02 kernel:   reclaim = no
Sep  6 19:10:05 flsrv02 kernel:   aspace = 0
Sep  6 19:10:05 flsrv02 kernel:   ail_bufs = no
Sep  6 19:10:05 flsrv02 kernel:   Inode:
Sep  6 19:10:05 flsrv02 kernel:     num = 995/995
Sep  6 19:10:05 flsrv02 kernel:     type = 2
Sep  6 19:10:05 flsrv02 kernel:     i_count = 1
Sep  6 19:10:05 flsrv02 kernel:     i_flags = 
Sep  6 19:10:05 flsrv02 kernel:     vnode = yes
Sep  6 19:10:05 flsrv02 kernel: Glock (5, 995)
Sep  6 19:10:05 flsrv02 kernel:   gl_flags = 
Sep  6 19:10:05 flsrv02 kernel:   gl_count = 2
Sep  6 19:10:05 flsrv02 kernel:   gl_state = 3
Sep  6 19:10:05 flsrv02 kernel:   req_gh = no
Sep  6 19:10:05 flsrv02 kernel:   req_bh = no
Sep  6 19:10:05 flsrv02 kernel:   lvb_count = 0
Sep  6 19:10:05 flsrv02 kernel:   object = yes
Sep  6 19:10:05 flsrv02 kernel:   new_le = no
Sep  6 19:10:05 flsrv02 kernel:   incore_le = no
Sep  6 19:10:05 flsrv02 kernel:   reclaim = no
Sep  6 19:10:05 flsrv02 kernel:   aspace = no
Sep  6 19:10:05 flsrv02 kernel:   ail_bufs = no
Sep  6 19:10:05 flsrv02 kernel:   Holder
Sep  6 19:10:05 flsrv02 kernel:     owner = -1
Sep  6 19:10:05 flsrv02 kernel:     gh_state = 3
Sep  6 19:10:05 flsrv02 kernel:     gh_flags = 5 7 
Sep  6 19:10:06 flsrv02 kernel:     error = 0
Sep  6 19:10:06 flsrv02 kernel:     gh_iflags = 1 6 7

Note You need to log in before you can comment on or make changes to this bug.