Creating a RHEL4 version of this bug since its not clear if we have the same problem in RHEL4. +++ This bug was initially created as a clone of Bug #165993 +++ Description of problem: When running the attached reproduer script, one or more NFS nodes can become deadlocked Version-Release number of selected component (if applicable): tested U5 onward How reproducible: always Steps to Reproduce: 1.Mount an NFS share from two separate nodes 2.run the attached reproducer script on each node, pointing the script to the NFS share mount point on each node. Actual results: One or both NFS clients will deadlock Expected results: Systems will run without deadlock. Additional info: -- Additional comment from nhorman on 2005-08-15 11:42 EST -- Created an attachment (id=117759) script to reproduce deadlock problem -- Additional comment from nhorman on 2005-08-15 11:51 EST -- I've reproduced this, and I have a core available on dl360g4.gsslab.rdu.redhat.com (root/redhat) in /root/hang_cores. Looking at it it appears that there is a dd task which is stuck in nfs_wait_on_request, and a sync task which is suck in wait_on_inode. They appear stuck, as the inode in __wait_on_inode (located at 0xe0c12380) has its I_LOCKED bit clear, and the nfs_page being waited on in nfs_wait_on_request (located at 0xf706bd00) has its PG_DIRTY bit clear. It would appear that we have a race condition in which tasks can go to sleep waiting on a condition after the condition which was supposed to wake them has occured, leading to tasks that will wait forever. I have, as of yet, been able to pinpoint where that race is so far.... -- Additional comment from jbaron on 2005-08-15 12:02 EST -- Neil, is this rhel3 or rhel4, since you mention U5? -- Additional comment from eparis on 2005-08-15 12:37 EST -- RHEL3 latest and greatest -- Additional comment from nhorman on 2005-08-15 14:15 EST -- I've been working with it under RHEL3, as eric says. I've been meaning to test under RHEL4 but haven't gotten around to it yet. -- Additional comment from steved on 2005-08-24 14:30 EST -- A quick status... I am able to reproduce this and it appears I'm seeing the same thing Neil was seeing... -- Additional comment from steved on 2005-08-25 06:25 EST -- Created an attachment (id=118104) Proposed Patch Please give this patch at try. Its stop an inode from be unhashed when an ESTALE is returned on a getattr. This in turns stop the sync from going into an infinite loop which causes the machine to hang. I was able to continuously run the above reproducer for a 12 hour period without neither RHLE3 client hanging. -- Additional comment from kanderso on 2005-08-25 08:39 EST -- Adding to the RHEL3U7Proposed list since we are late in the U6 cycle. Not sure if this meets the criteria for inclusion in a RHEL3U6 respin this late in the release cycle.
Created attachment 120517 [details] An upstream patch that addressed this problem
I'm not able to reproduce this hang on a RHEL4, so I'm closing this bug.