146670 – gulm filesystems deadlock with heavy load

Bug 146670 - gulm filesystems deadlock with heavy load

Summary: gulm filesystems deadlock with heavy load

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gulm
Sub Component:
Version:	4
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	michael conrad tadpol tilstra
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	144795
TreeView+	depends on / blocked

Reported:	2005-01-31 17:29 UTC by Corey Marthaler
Modified:	2009-04-16 20:24 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-04-06 16:02:31 UTC
Embargoed:

Attachments	(Terms of Use)

Description Corey Marthaler 2005-01-31 17:29:36 UTC

Description of problem:
There is already a bug in sistina bugs I believe for the issue of Gulm
and posix locks, however I reproduced this hang by only using flocks
as well with the tests genesis and accordion on each of the machines
in the cluster.

genesis -L flock -n 250 -d 50 -p 2
accordion -L flock -p 2 accrdfile1 accrdfile2 accrdfile3 accrdfile4
accrdfile5

eventually the fileystems become stuck:

[root@morph-03 tmp]# ps -efFwTl | grep genesis
4 S root      5044  5044  5043  0  81   0 -   365 wait    336   0
Jan28 ?        00:00:00 genesis -L flock -n 250 -d 50 -p 2
5 D root      5049  5049  5044  0  78   0 -   368 glock_  448   0
Jan28 ?        00:06:05 genesis -L flock -n 250 -d 50 -p 2
5 D root      5050  5050  5044  0  78   0 -   368 glock_  448   0
Jan28 ?        00:06:02 genesis -L flock -n 250 -d 50 -p 2
4 S root      5055  5055  5054  0  84   0 -   525 wait    336   1
Jan28 ?        00:00:00 genesis -L flock -n 250 -d 50 -p 2
5 D root      5058  5058  5055  0  78   0 -   528 glock_  448   0
Jan28 ?        00:05:08 genesis -L flock -n 250 -d 50 -p 2
5 D root      5060  5060  5055  0  78   0 -   528 glock_  448   0
Jan28 ?        00:05:09 genesis -L flock -n 250 -d 50 -p 2
4 S root      5075  5075  5074  0  85   0 -   726 wait    332   0
Jan28 ?        00:00:00 genesis -L flock -n 250 -d 50 -p 2
5 D root      5076  5076  5075  0  78   0 -   730 glock_  448   0
Jan28 ?        00:04:56 genesis -L flock -n 250 -d 50 -p 2
5 D root      5078  5078  5075  0  78   0 -   730 glock_  448   0
Jan28 ?        00:04:56 genesis -L flock -n 250 -d 50 -p 2
4 S root      5086  5086  5085  0  85   0 -   843 wait    336   0
Jan28 ?        00:00:00 genesis -L flock -n 250 -d 50 -p 2
5 D root      5088  5088  5086  0  78   0 -   846 glock_  448   0
Jan28 ?        00:05:47 genesis -L flock -n 250 -d 50 -p 2
5 D root      5090  5090  5086  0  78   0 -   846 glock_  448   1
Jan28 ?        00:05:45 genesis -L flock -n 250 -d 50 -p 2


[root@morph-04 root]# strace df -h
.
.
.
statfs64("/mnt/gfs0", 84,

I tried to get more info from /proc/pid but that was stuck as well.

Version-Release number of selected component (if applicable):
Gulm <CVS> (built Jan 28 2005 16:39:38) installed


How reproducible:
Sometimes

Comment 1 michael conrad tadpol tilstra 2005-02-01 14:08:13 UTC

pretty sure this doesn't have anything to do with plocks or flocks.  Seems to be
entirely load based.

Comment 2 Kiersten (Kerri) Anderson 2005-02-01 15:35:20 UTC

Adding to release blocker list

Comment 3 michael conrad tadpol tilstra 2005-02-01 18:32:01 UTC

All that's required to get this is two clients, gulm server[s] in either slm or
rlm modes and load. (no clvm)
load for me is fsstresses and a couple doios.  Lighten the load (stop all
fsstress or doio) and the deadlock isn't hit.

Comment 4 michael conrad tadpol tilstra 2005-02-01 21:10:04 UTC

found it.  Flooding ltpx faster than it can handle. so it locks up.  
a `gulm_tool getstats <deadlocked-client>:ltpx` will timeout.
now to fix.....

Comment 5 michael conrad tadpol tilstra 2005-02-03 16:14:23 UTC

Added out going queues to the local connects.  Bug seems to dissapeared.

Comment 6 Corey Marthaler 2005-04-06 16:02:31 UTC

fix verified.

Note You need to log in before you can comment on or make changes to this bug.