Bug 166772 - NFS deadlock when multiple processes creating/deleting a file
NFS deadlock when multiple processes creating/deleting a file
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Steve Dickson
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-08-25 10:54 EDT by Steve Dickson
Modified: 2010-10-21 23:18 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-11-10 14:54:15 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
An upstream patch that addressed this problem (3.34 KB, text/plain)
2005-10-28 16:07 EDT, Steve Dickson
no flags Details

  None (edit)
Description Steve Dickson 2005-08-25 10:54:11 EDT
Creating a RHEL4 version of this bug since its
not clear if we have the same problem in RHEL4.

+++ This bug was initially created as a clone of Bug #165993 +++

Description of problem:
When running the attached reproduer script, one or more NFS nodes can become
deadlocked

Version-Release number of selected component (if applicable):
tested U5 onward

How reproducible:
always

Steps to Reproduce:
1.Mount an NFS share from two separate nodes
2.run the attached reproducer script on each node, pointing the script to the
NFS share mount point on each node.

  
Actual results:
One or both NFS clients will deadlock

Expected results:
Systems will run without deadlock.

Additional info:

-- Additional comment from nhorman@redhat.com on 2005-08-15 11:42 EST --
Created an attachment (id=117759)
script to reproduce deadlock problem


-- Additional comment from nhorman@redhat.com on 2005-08-15 11:51 EST --
I've reproduced this, and I have a core available on
dl360g4.gsslab.rdu.redhat.com (root/redhat) in /root/hang_cores.  Looking at it
it appears that there is a dd task which is stuck in nfs_wait_on_request, and a
sync task which is suck in wait_on_inode.  They appear stuck, as the inode in
__wait_on_inode (located at 0xe0c12380) has its I_LOCKED bit clear, and the
nfs_page being waited on in nfs_wait_on_request (located at 0xf706bd00) has its
PG_DIRTY bit clear.  It would appear that we have a race condition in which
tasks can go to sleep waiting on a condition after the condition which was
supposed to wake them has occured, leading to tasks that will wait forever.  I
have, as of yet, been able to pinpoint where that race is so far....

-- Additional comment from jbaron@redhat.com on 2005-08-15 12:02 EST --
Neil, is this rhel3 or rhel4, since you mention U5?

-- Additional comment from eparis@redhat.com on 2005-08-15 12:37 EST --
RHEL3 latest and greatest

-- Additional comment from nhorman@redhat.com on 2005-08-15 14:15 EST --
I've been working with it under RHEL3, as eric says.  I've been meaning to test
under RHEL4 but haven't gotten around to it yet.

-- Additional comment from steved@redhat.com on 2005-08-24 14:30 EST --
A quick status... I am able to reproduce this and
it appears I'm seeing the same thing Neil was seeing... 

-- Additional comment from steved@redhat.com on 2005-08-25 06:25 EST --
Created an attachment (id=118104)
Proposed Patch

Please give this patch at try. Its stop an inode from be unhashed when
an ESTALE is returned on a getattr. This in turns stop the sync from
going into an infinite loop which causes the machine to hang.

I was able to continuously run the above reproducer for
a 12 hour period without neither RHLE3 client hanging. 

-- Additional comment from kanderso@redhat.com on 2005-08-25 08:39 EST --
Adding to the RHEL3U7Proposed list since we are late in the U6 cycle.  Not sure
if this meets the criteria for inclusion in a RHEL3U6 respin this late in the
release cycle.
Comment 3 Steve Dickson 2005-10-28 16:07:44 EDT
Created attachment 120517 [details]
An upstream patch that addressed this problem
Comment 4 Steve Dickson 2005-11-10 14:54:15 EST
I'm not able to reproduce this hang on a RHEL4, so I'm closing this bug. 

Note You need to log in before you can comment on or make changes to this bug.