Bug 200753

Summary:	"ls" is blocked
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Mathieu AVILA <mathieu.avila>
Component:	dlm	Assignee:	David Teigland <teigland>
Status:	CLOSED WONTFIX	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	alban.crequy, ccaulfie, cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-11-27 16:15:37 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mathieu AVILA 2006-07-31 14:40:19 UTC

Description of problem:

When running loops of "ls" on 1 node, and "tar"/"untar" on another node, the ls
blocked after more than 1 day.

Version-Release number of selected component (if applicable):
GFS-6.1.5-0
GFS-kernel-smp-2.6.9-49.1.centos4
GFS-kernheaders-2.6.9-49.1.centos4
dlm-1.0.0-5
dlm-kernel-smp-2.6.9-41.7.centos4
dlm-kernheaders-2.6.9-41.7.centos4
dlm-devel-1.0.0-5


How reproducible:

This is hardly reproducible. The given steps to reproduce may not lead to this
bug 100% of the time.

Steps to Reproduce:
1. Run loops of tar/untar of a directory on 1 node, run loops of ls on the tar'd
directory on another node
2. Wait a long time.

NOTE : no node crash, no device error.
  
Actual results:

On the locked node that ran "ls", at the time when it blocked, syslog contains :
Jul 30 11:11:09 sam39 kernel: dlm: gfsbig: process_lockqueue_reply id 1dc01bc
state 0
Jul 30 13:14:39 sam39 kernel: dlm: gfsbig: cancel reply ret 0
Jul 30 13:14:39 sam39 kernel: dlm: gfsbig: process_lockqueue_reply id 2010258
state 0
Jul 30 13:14:39 sam39 kernel: lock_dlm: unlock sb_status 0 2,6de3e4 flags 0


Additional info:

I have saved the contents of /proc/cluster/*. I can give additional info if needed.

On the blocked node, here is the content of "echo t > /proc/sysrq-trigger" for
the blocked "ls" process:

Jul 31 14:57:53 sam39 kernel: ls            D 0000000000000000     0 29333     
1                8104 (NOTLB)
Jul 31 14:57:53 sam39 kernel: ffff8100138e1c88 0000000000000086 ffff81003f09935c
ffffffff8817c284
Jul 31 14:57:53 sam39 kernel:        ffff8100138e1c28 0000000000000000
0000000000000000 ffff8100138e1c28
Jul 31 14:57:53 sam39 kernel:        ffff810037d0f8b0 ffff810037d0f8b0
Jul 31 14:57:53 sam39 kernel: Call
Trace:<ffffffff8817c284>{:gfs:gfs_glock_dq+436} <ffffffff8019aaa6>{dput+54}
Jul 31 14:57:53 sam39 kernel:        <ffffffff803a8f13>{wait_for_completion+179}
<ffffffff80130e40>{default_wake_function+0}
Jul 31 14:57:53 sam39 kernel:        <ffffffff80190534>{do_lookup+100}
<ffffffff80130e40>{default_wake_function+0}
Jul 31 14:57:53 sam39 kernel:        <ffffffff8019aaa6>{dput+54}
<ffffffff8817bda9>{:gfs:glock_wait_internal+345}
Jul 31 14:57:53 sam39 kernel:        <ffffffff8817c6e2>{:gfs:gfs_glock_nq+994}
<ffffffff8817c9be>{:gfs:gfs_glock_nq_init+30}
Jul 31 14:57:53 sam39 kernel:        <ffffffff88194c03>{:gfs:gfs_getattr+83}
<ffffffff8018b6db>{vfs_getattr+59}
Jul 31 14:57:53 sam39 kernel:        <ffffffff8018b83d>{vfs_lstat+61}
<ffffffff8019ab27>{dput+183}
Jul 31 14:57:53 sam39 kernel:        <ffffffff801901bc>{path_release+12}
<ffffffff8018bc5f>{sys_newlstat+31}
Jul 31 14:57:53 sam39 kernel:        <ffffffff8010dc66>{system_call+126}


After looking at the existing bugs database, it seems
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=147682
may be the same bug. But i'm unable to confirm that.
(The conclusion of this bug is that no-one was able to reproduce it)

Comment 1 David Teigland 2006-07-31 18:04:28 UTC

There's a chance this may have the same root cause as bz 172944.
One possible cause in this case is an unlock that happens while
another lock operation is still in progress on the same lock.
(GFS shouldn't do that and there is plenty of room for problems
like this in the dlm if it does happen for some reason).

In that scenario, the reply to the first operation reports
"cancel rep 0" because the dlm expects a return status of
EUNLOCK (for the unlock) but gets the 0 for the first op.
The reply to the unlock then reports "process_lockqueue_reply
... state 0" because the lock was removed from the waiting list.
If this is what happened, we should see a message of
"unlock cancel status %d" on one of the other nodes.  (Even
though this has nothing to do with lock cancelation.)

We could print some additional debug info to try to say more
definatively what's going wrong.  If we're successful at that
the best I think we could hope for is a work-around to detect
when this happens and does something special to handle it.

Comment 3 David Teigland 2006-10-17 16:35:14 UTC

If this is still a problem for the reporter then we should have our
QA group run a test like this.

Comment 4 Mathieu AVILA 2006-11-21 16:07:05 UTC

(sorry for the delay)

It never reproduced.
The problem is, this behaviour is not a desired one : the processes are taking
lots of time communicating and are therefore very slow.
We haven't continued investigating in that way. We will try as hard as possible
not to have 2 nodes accessing the very same data at the very time, although GFS
is designed just for that.
Therefore the bug may be there, but we it didn't happened again. Still, it would
be interesting to have this kind of tests run by your QA group.

Comment 5 David Teigland 2006-11-27 16:15:37 UTC

Will reopen this if it becomes a problem again.