Bug 166772

Summary:

NFS deadlock when multiple processes creating/deleting a file

Product:

Red Hat Enterprise Linux 4

Reporter:

Steve Dickson <steved>

Component:

kernel

Assignee:

Steve Dickson <steved>

Status:

CLOSED NOTABUG

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.0

CC:

jbaron, rajeev, staubach, tao

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-11-10 19:54:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
An upstream patch that addressed this problem	none

Description Steve Dickson 2005-08-25 14:54:11 UTC

Creating a RHEL4 version of this bug since its
not clear if we have the same problem in RHEL4.

+++ This bug was initially created as a clone of Bug #165993 +++

Description of problem:
When running the attached reproduer script, one or more NFS nodes can become
deadlocked

Version-Release number of selected component (if applicable):
tested U5 onward

How reproducible:
always

Steps to Reproduce:
1.Mount an NFS share from two separate nodes
2.run the attached reproducer script on each node, pointing the script to the
NFS share mount point on each node.

  
Actual results:
One or both NFS clients will deadlock

Expected results:
Systems will run without deadlock.

Additional info:

-- Additional comment from nhorman on 2005-08-15 11:42 EST --
Created an attachment (id=117759)
script to reproduce deadlock problem


-- Additional comment from nhorman on 2005-08-15 11:51 EST --
I've reproduced this, and I have a core available on
dl360g4.gsslab.rdu.redhat.com (root/redhat) in /root/hang_cores.  Looking at it
it appears that there is a dd task which is stuck in nfs_wait_on_request, and a
sync task which is suck in wait_on_inode.  They appear stuck, as the inode in
__wait_on_inode (located at 0xe0c12380) has its I_LOCKED bit clear, and the
nfs_page being waited on in nfs_wait_on_request (located at 0xf706bd00) has its
PG_DIRTY bit clear.  It would appear that we have a race condition in which
tasks can go to sleep waiting on a condition after the condition which was
supposed to wake them has occured, leading to tasks that will wait forever.  I
have, as of yet, been able to pinpoint where that race is so far....

-- Additional comment from jbaron on 2005-08-15 12:02 EST --
Neil, is this rhel3 or rhel4, since you mention U5?

-- Additional comment from eparis on 2005-08-15 12:37 EST --
RHEL3 latest and greatest

-- Additional comment from nhorman on 2005-08-15 14:15 EST --
I've been working with it under RHEL3, as eric says.  I've been meaning to test
under RHEL4 but haven't gotten around to it yet.

-- Additional comment from steved on 2005-08-24 14:30 EST --
A quick status... I am able to reproduce this and
it appears I'm seeing the same thing Neil was seeing... 

-- Additional comment from steved on 2005-08-25 06:25 EST --
Created an attachment (id=118104)
Proposed Patch

Please give this patch at try. Its stop an inode from be unhashed when
an ESTALE is returned on a getattr. This in turns stop the sync from
going into an infinite loop which causes the machine to hang.

I was able to continuously run the above reproducer for
a 12 hour period without neither RHLE3 client hanging. 

-- Additional comment from kanderso on 2005-08-25 08:39 EST --
Adding to the RHEL3U7Proposed list since we are late in the U6 cycle.  Not sure
if this meets the criteria for inclusion in a RHEL3U6 respin this late in the
release cycle.

Comment 3 Steve Dickson 2005-10-28 20:07:44 UTC

Created attachment 120517 [details]
An upstream patch that addressed this problem

Comment 4 Steve Dickson 2005-11-10 19:54:15 UTC

I'm not able to reproduce this hang on a RHEL4, so I'm closing this bug.