Bug 146670 - gulm filesystems deadlock with heavy load
Summary: gulm filesystems deadlock with heavy load
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gulm
Version: 4
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: michael conrad tadpol tilstra
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: 144795
TreeView+ depends on / blocked
 
Reported: 2005-01-31 17:29 UTC by Corey Marthaler
Modified: 2009-04-16 20:24 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-04-06 16:02:31 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2005-01-31 17:29:36 UTC
Description of problem:
There is already a bug in sistina bugs I believe for the issue of Gulm
and posix locks, however I reproduced this hang by only using flocks
as well with the tests genesis and accordion on each of the machines
in the cluster.

genesis -L flock -n 250 -d 50 -p 2
accordion -L flock -p 2 accrdfile1 accrdfile2 accrdfile3 accrdfile4
accrdfile5

eventually the fileystems become stuck:

[root@morph-03 tmp]# ps -efFwTl | grep genesis
4 S root      5044  5044  5043  0  81   0 -   365 wait    336   0
Jan28 ?        00:00:00 genesis -L flock -n 250 -d 50 -p 2
5 D root      5049  5049  5044  0  78   0 -   368 glock_  448   0
Jan28 ?        00:06:05 genesis -L flock -n 250 -d 50 -p 2
5 D root      5050  5050  5044  0  78   0 -   368 glock_  448   0
Jan28 ?        00:06:02 genesis -L flock -n 250 -d 50 -p 2
4 S root      5055  5055  5054  0  84   0 -   525 wait    336   1
Jan28 ?        00:00:00 genesis -L flock -n 250 -d 50 -p 2
5 D root      5058  5058  5055  0  78   0 -   528 glock_  448   0
Jan28 ?        00:05:08 genesis -L flock -n 250 -d 50 -p 2
5 D root      5060  5060  5055  0  78   0 -   528 glock_  448   0
Jan28 ?        00:05:09 genesis -L flock -n 250 -d 50 -p 2
4 S root      5075  5075  5074  0  85   0 -   726 wait    332   0
Jan28 ?        00:00:00 genesis -L flock -n 250 -d 50 -p 2
5 D root      5076  5076  5075  0  78   0 -   730 glock_  448   0
Jan28 ?        00:04:56 genesis -L flock -n 250 -d 50 -p 2
5 D root      5078  5078  5075  0  78   0 -   730 glock_  448   0
Jan28 ?        00:04:56 genesis -L flock -n 250 -d 50 -p 2
4 S root      5086  5086  5085  0  85   0 -   843 wait    336   0
Jan28 ?        00:00:00 genesis -L flock -n 250 -d 50 -p 2
5 D root      5088  5088  5086  0  78   0 -   846 glock_  448   0
Jan28 ?        00:05:47 genesis -L flock -n 250 -d 50 -p 2
5 D root      5090  5090  5086  0  78   0 -   846 glock_  448   1
Jan28 ?        00:05:45 genesis -L flock -n 250 -d 50 -p 2


[root@morph-04 root]# strace df -h
.
.
.
statfs64("/mnt/gfs0", 84,

I tried to get more info from /proc/pid but that was stuck as well.

Version-Release number of selected component (if applicable):
Gulm <CVS> (built Jan 28 2005 16:39:38) installed


How reproducible:
Sometimes

Comment 1 michael conrad tadpol tilstra 2005-02-01 14:08:13 UTC
pretty sure this doesn't have anything to do with plocks or flocks.  Seems to be
entirely load based.

Comment 2 Kiersten (Kerri) Anderson 2005-02-01 15:35:20 UTC
Adding to release blocker list

Comment 3 michael conrad tadpol tilstra 2005-02-01 18:32:01 UTC
All that's required to get this is two clients, gulm server[s] in either slm or
rlm modes and load. (no clvm)
load for me is fsstresses and a couple doios.  Lighten the load (stop all
fsstress or doio) and the deadlock isn't hit.


Comment 4 michael conrad tadpol tilstra 2005-02-01 21:10:04 UTC
found it.  Flooding ltpx faster than it can handle. so it locks up.  
a `gulm_tool getstats <deadlocked-client>:ltpx` will timeout.
now to fix.....

Comment 5 michael conrad tadpol tilstra 2005-02-03 16:14:23 UTC
Added out going queues to the local connects.  Bug seems to dissapeared.

Comment 6 Corey Marthaler 2005-04-06 16:02:31 UTC
fix verified.


Note You need to log in before you can comment on or make changes to this bug.