Description of problem:
Under heavy load NFS client runs into deadlock
Version-Release number of selected component (if applicable):
2.6.18 up to 2.6.23-rc1
run the attached scripts a few minutes
Steps to Reproduce:
1. run 6 times write_back.sh concurrently
2. also run read.sh
wait a few minutes, scripts will hang
'ps -ef' hangs, too
scripts are running forever, 'ps -ef' doesn't hang
It's a two way machine, with mount options tcp,nolock the problem
doesn't occur (probably due to a different timing).
The bug is also in the RHEL5.1 code base, but can also not be reproduced
(probably due to a different timing, too).
I will attach both scripts and the patch that fixes the problem.
Created attachment 160321 [details]
write_back.sh - puts load on the NFS write path
Created attachment 160322 [details]
read.sh - puts stress on the NFS read path
Created attachment 160323 [details]
this patch fixes the problem
patch fixes concurrency issue in put_nfs_open_context
Forgot to mention, that the scrips put load on /tmp and on /var/, so you need
NFS root to reproduce
I have the same problem. I am using Fedora 7 with kernel 22.214.171.124-65.fc7. I
modified the scripts above to read and write /home/user, as that is what is NFS
mounted on my system. The scripts hung after approximately five minutes of
running. Once hung, I could not "ls /home/user," as this process would also hang.
In normal use, spamassassin seems to cause a hang when accessing the file
I tried the patch in comment #3.
I ran Christian's scripts for one hour and thirty minutes and did not see a hang.
However, I then ran the scripts while spamassassin was processing approximately
1,000 emails. In this case, NFS access hung as described in the previous
comment after 52 minutes.
I will continue to experiment and will report what I find.
I also have this problem when using Fedora 8 Test 2.
I believe I'm experiencing this problem in normal usage. My F7 server works for
a few days, serving home directories and other data, but eventually the F7 and
F8Test clients start reporting that the server lockd is not responding. The only
fix I've found is to reboot the server. I did not have this problem when the
server was running FC5.
I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.
I am CC'ing myself to this bug and will try and assist you in resolving it if I can.
There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?
If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.
I have not had this problem in some time but I am not the original reporter.
The patch in Comment #3 is in both f8 and f7 at this point which is probably
the reason you are no longer seeing this problem.