Bug 166772

Summary: NFS deadlock when multiple processes creating/deleting a file
Product: Red Hat Enterprise Linux 4 Reporter: Steve Dickson <steved>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: jbaron, rajeev, staubach, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-11-10 19:54:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
An upstream patch that addressed this problem none

Description Steve Dickson 2005-08-25 14:54:11 UTC
Creating a RHEL4 version of this bug since its
not clear if we have the same problem in RHEL4.

+++ This bug was initially created as a clone of Bug #165993 +++

Description of problem:
When running the attached reproduer script, one or more NFS nodes can become
deadlocked

Version-Release number of selected component (if applicable):
tested U5 onward

How reproducible:
always

Steps to Reproduce:
1.Mount an NFS share from two separate nodes
2.run the attached reproducer script on each node, pointing the script to the
NFS share mount point on each node.

  
Actual results:
One or both NFS clients will deadlock

Expected results:
Systems will run without deadlock.

Additional info:

-- Additional comment from nhorman on 2005-08-15 11:42 EST --
Created an attachment (id=117759)
script to reproduce deadlock problem


-- Additional comment from nhorman on 2005-08-15 11:51 EST --
I've reproduced this, and I have a core available on
dl360g4.gsslab.rdu.redhat.com (root/redhat) in /root/hang_cores.  Looking at it
it appears that there is a dd task which is stuck in nfs_wait_on_request, and a
sync task which is suck in wait_on_inode.  They appear stuck, as the inode in
__wait_on_inode (located at 0xe0c12380) has its I_LOCKED bit clear, and the
nfs_page being waited on in nfs_wait_on_request (located at 0xf706bd00) has its
PG_DIRTY bit clear.  It would appear that we have a race condition in which
tasks can go to sleep waiting on a condition after the condition which was
supposed to wake them has occured, leading to tasks that will wait forever.  I
have, as of yet, been able to pinpoint where that race is so far....

-- Additional comment from jbaron on 2005-08-15 12:02 EST --
Neil, is this rhel3 or rhel4, since you mention U5?

-- Additional comment from eparis on 2005-08-15 12:37 EST --
RHEL3 latest and greatest

-- Additional comment from nhorman on 2005-08-15 14:15 EST --
I've been working with it under RHEL3, as eric says.  I've been meaning to test
under RHEL4 but haven't gotten around to it yet.

-- Additional comment from steved on 2005-08-24 14:30 EST --
A quick status... I am able to reproduce this and
it appears I'm seeing the same thing Neil was seeing... 

-- Additional comment from steved on 2005-08-25 06:25 EST --
Created an attachment (id=118104)
Proposed Patch

Please give this patch at try. Its stop an inode from be unhashed when
an ESTALE is returned on a getattr. This in turns stop the sync from
going into an infinite loop which causes the machine to hang.

I was able to continuously run the above reproducer for
a 12 hour period without neither RHLE3 client hanging. 

-- Additional comment from kanderso on 2005-08-25 08:39 EST --
Adding to the RHEL3U7Proposed list since we are late in the U6 cycle.  Not sure
if this meets the criteria for inclusion in a RHEL3U6 respin this late in the
release cycle.

Comment 3 Steve Dickson 2005-10-28 20:07:44 UTC
Created attachment 120517 [details]
An upstream patch that addressed this problem

Comment 4 Steve Dickson 2005-11-10 19:54:15 UTC
I'm not able to reproduce this hang on a RHEL4, so I'm closing this bug.