145090 – dlm_astd stuck on rwsem_down_read_failed

Bug 145090 - dlm_astd stuck on rwsem_down_read_failed

Summary: dlm_astd stuck on rwsem_down_read_failed

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	dlm
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	David Teigland
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-01-14 10:10 UTC by David Teigland
Modified:	2009-04-16 20:30 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2005-01-24 09:52:02 UTC
Embargoed:

Attachments	(Terms of Use)

Description David Teigland 2005-01-14 10:10:31 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5)
Gecko/20041111 Firefox/1.0

Description of problem:
I've run into this twice in the past week (once on my 4-node va
cluster and once on my 8-node bench cluster).  Both times it happened
while running my mu_loop script on all nodes:

while (1)
  mount /gfs
  sleeprand 8
  umount /gfs
  sleeprand 8

I'm thinking this must be related to a recent change because I often
run this test and have never seen this before.

It's hung in exactly the same spot both times.  lock_dlm is processing
a start and has just done:

3971 lk 10,0 id 0 -1,3 9

The dlm lock dump shows the lock has been granted.  The ast lock_dlm
is waiting for is therefore for this lock 10,0.  Both times lock_dlm
is hung waiting here:

Stack traceback for pid 3971
0xcecec130     3971        7  0    1   D  0xcecec370  lock_dlm1
EBP        EIP        Function (args)
0xca6efe04 0xc033eccc schedule+0x2fc (0xc035d0d3, 0xb7c, 0xca7331c8,
0x0, 0xcecec130)
0xca6efe64 0xc033f0b4 wait_for_completion+0xa4 (0xca73318c, 0x0, 0x3,
0x5, 0x0)
0xca6efe84 0xd08dc1f9 ï¿½lock_dlmï¿½lm_dlm_lock_sync+0x59 (0xca73318c,
0x0, 0x3, 0x5, 0x5a000020)
0xca6efec8 0xd08da163 ï¿½lock_dlmï¿½id_test_and_set+0xa3 (0xca65db48, 0x0,
0x2, 0xca65db74, 0xcebd6c78)
0xca6efef4 0xd08da597 ï¿½lock_dlmï¿½claim_jid+0x47 (0xca65db48, 0xd0, 0x2,
0x1f6, 0x250)
0xca6eff3c 0xd08dad00 ï¿½lock_dlmï¿½process_start+0x480 (0xca65db48,
0xcfdac64c, 0xc9e6b390, 0xca65db80, 0xca65db78)
0xca6effbc 0xd08e1494 ï¿½lock_dlmï¿½dlm_async+0x284 (0x0, 0x0, 0xca7ddc94,
0xca7ddc88, 0xc9e92a48)

The ast isn't being delivered to lock_dlm because dlm_astd is
hung here:

0xce3a3630     2583        7  0    1   D  0xce3a3870  dlm_astd
EBP        EIP        Function (args)
0xcc2dfee4 0xc033eccc schedule+0x2fc (0x1, 0xcceac7f0, 0xca65d1b8,
0xc0119fbc, 0xce3a3630)
0xcc2dff1c 0xc033f5fc rwsem_down_read_failed+0x9c (0xd099b1e0,
0xca2d5aec, 0xd09928f8, 0x2b, 0xcec0b257)
           0xd097d191 ï¿½dlmï¿½.text.lock.ast+0x7f
0xcc2dff60 0xd097c4a2 ï¿½dlmï¿½process_asts+0xe2 (0xd099b260, 0x65bf68, 0x0)
0xcc2dffbc 0xd097cf55 ï¿½dlmï¿½dlm_astd+0x1c5 (0x0, 0x0, 0xcc693e58,
0xcc693e4c, 0xcedb65b8)
           0xc011b640 default_wake_function (0x100100, 0x200200,
0xfbfc9fe4, 0x650, 0xce3a3798)
           0xc011b640 default_wake_function (0x0, 0xff, 0x0,
0xfffffffc, 0xd097cd90)
           0xc0133d4a kthread+0xba


Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. run the test described above
2.
3.
    

Actual Results:  mount hangs on the bug above, other mounts hang
waiting for the first

Additional info:

Comment 1 David Teigland 2005-01-14 16:55:43 UTC

We should study (and test) if it's safe to just remove the locking
that's causing the problem in process_asts:

  down_read(&ls->ls_in_recovery);
  release_lkb(ls, lkb);
  release_rsb(rsb);
  up_read(&ls->ls_in_recovery);

or else add these to a list for release after recovery.

Comment 2 David Teigland 2005-01-24 09:52:02 UTC

The ast thread was blocking on ls->ls_in_recovery just prior to
that ls being freed.  dlm_astd now skips any lkb's from lockspaces
that aren't running.

Note You need to log in before you can comment on or make changes to this bug.