Red Hat Bugzilla – Bug 166772
NFS deadlock when multiple processes creating/deleting a file
Last modified: 2010-10-21 23:18:03 EDT
Creating a RHEL4 version of this bug since its
not clear if we have the same problem in RHEL4.
+++ This bug was initially created as a clone of Bug #165993 +++
Description of problem:
When running the attached reproduer script, one or more NFS nodes can become
Version-Release number of selected component (if applicable):
tested U5 onward
Steps to Reproduce:
1.Mount an NFS share from two separate nodes
2.run the attached reproducer script on each node, pointing the script to the
NFS share mount point on each node.
One or both NFS clients will deadlock
Systems will run without deadlock.
-- Additional comment from email@example.com on 2005-08-15 11:42 EST --
Created an attachment (id=117759)
script to reproduce deadlock problem
-- Additional comment from firstname.lastname@example.org on 2005-08-15 11:51 EST --
I've reproduced this, and I have a core available on
dl360g4.gsslab.rdu.redhat.com (root/redhat) in /root/hang_cores. Looking at it
it appears that there is a dd task which is stuck in nfs_wait_on_request, and a
sync task which is suck in wait_on_inode. They appear stuck, as the inode in
__wait_on_inode (located at 0xe0c12380) has its I_LOCKED bit clear, and the
nfs_page being waited on in nfs_wait_on_request (located at 0xf706bd00) has its
PG_DIRTY bit clear. It would appear that we have a race condition in which
tasks can go to sleep waiting on a condition after the condition which was
supposed to wake them has occured, leading to tasks that will wait forever. I
have, as of yet, been able to pinpoint where that race is so far....
-- Additional comment from email@example.com on 2005-08-15 12:02 EST --
Neil, is this rhel3 or rhel4, since you mention U5?
-- Additional comment from firstname.lastname@example.org on 2005-08-15 12:37 EST --
RHEL3 latest and greatest
-- Additional comment from email@example.com on 2005-08-15 14:15 EST --
I've been working with it under RHEL3, as eric says. I've been meaning to test
under RHEL4 but haven't gotten around to it yet.
-- Additional comment from firstname.lastname@example.org on 2005-08-24 14:30 EST --
A quick status... I am able to reproduce this and
it appears I'm seeing the same thing Neil was seeing...
-- Additional comment from email@example.com on 2005-08-25 06:25 EST --
Created an attachment (id=118104)
Please give this patch at try. Its stop an inode from be unhashed when
an ESTALE is returned on a getattr. This in turns stop the sync from
going into an infinite loop which causes the machine to hang.
I was able to continuously run the above reproducer for
a 12 hour period without neither RHLE3 client hanging.
-- Additional comment from firstname.lastname@example.org on 2005-08-25 08:39 EST --
Adding to the RHEL3U7Proposed list since we are late in the U6 cycle. Not sure
if this meets the criteria for inclusion in a RHEL3U6 respin this late in the
Created attachment 120517 [details]
An upstream patch that addressed this problem
I'm not able to reproduce this hang on a RHEL4, so I'm closing this bug.