Description of problem: When running the attached reproduer script, one or more NFS nodes can become deadlocked Version-Release number of selected component (if applicable): tested U5 onward How reproducible: always Steps to Reproduce: 1.Mount an NFS share from two separate nodes 2.run the attached reproducer script on each node, pointing the script to the NFS share mount point on each node. Actual results: One or both NFS clients will deadlock Expected results: Systems will run without deadlock. Additional info:
Created attachment 117759 [details] script to reproduce deadlock problem
A quick status... I am able to reproduce this and it appears I'm seeing the same thing Neil was seeing...
Created attachment 118104 [details] Proposed Patch Please give this patch at try. Its stop an inode from be unhashed when an ESTALE is returned on a getattr. This in turns stop the sync from going into an infinite loop which causes the machine to hang. I was able to continuously run the above reproducer for a 12 hour period without neither RHLE3 client hanging.
TomK/JayT, has Q/A approved of a fix for this bug being taken into the final RHEL3 U6 kernel respin? If so, could we please get the Q/A management ack and the bug moved to the CanFix list? SteveD, should committing your fix be gated on successful testing by the bug reporter? (This bug is still in NEEDINFO_REPORTER.) Removing block against RHEL4 bug 166772.
yes
Created attachment 118308 [details] Crash dump with the nfs hang patch applied
Created attachment 118346 [details] crash log on test kernel IT 75445
Neil, we're waiting for you to confirm that SteveD's patch in comment #7 resolves the problem at your customer's site. We need this answer by the end of the day today! Thanks.
Steve, Ernie, Sorry for the delay. I can confirm that the attached patch fixes the reported problem. Now I think you said we just need a QA ACK to move this along.
Thanks, Neil. TomK/JayT, could you please do the final honors (QA ack and list move)? Thanks.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this afternoon (in kernel version 2.4.21-36.EL).
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html