Hide Forgot
Description of problem: I have few NFS servers (with 2.6.18-128.el5) with very high load and from time to time, all NFSD hangs and i noticed the following message (two variations) in /var/log/messages file: kernel: nfsd: peername failed (err 107) and kernel: nfsd: non-standard errno: -107 After that, clients cant access the mounts and i am not even able to stop nfsd properly (need to use kill -9). I tried google for it, and i see others have this problem, but could not find a proper official solution. Server export file: /MNT_ROOT 10.0.0.3(rw,no_root_squash,no_wdelay,no_subtree_check,sync,crossmnt,fsid=1) /MNT_ROOT *(rw,root_squash,no_wdelay,no_subtree_check,sync,crossmnt,fsid=1) Client mountpoint: server:/MNT_ROOT on /server type nfs (rw,_netdev,rsize=8192,wsize=8192,soft,tcp,addr=10.0.0.1) Version-Release number of selected component (if applicable): nfs-utils-1.0.9-40.el5 nfs-utils-lib-1.0.8-7.2.z2 How reproducible: High load with lots of small files?? Steps to Reproduce: 1. 2. 3. Actual results: All mounts on clients hang. Unable to stop nfsd on the sever unless using kill -9 Expected results: Additional info:
We are also encountering this issue when two clients with the same MAC address try to mount a NFS directory ? Are you planning to fix this ?
If you have two clients with the same MAC address, I'm amazed that anything works at all; that's a separate problem. My first thought was that it was the same problem as that which c51e88efa9bf31e0f0bdf872c61a0e921a9faffb "sunrpc: fix peername failed on closed listener", but that fixes a regression that never existed in rhel5. It might be interesting to know where the nfsd threads are hanging, if they in fact are. Perhaps a sysrq-t trace would help? (echo "t" >/proc/sysrq-trigger, then attach the results which are dumped to the log).
Hi Bruce, Thanks for helping. One question, can i run it now when it works or should i run it just at the exact same time it hang? I ran this on other server to test the command and noticed it hanged my server for a minute or so...is that suppose to do it? Thanks, Eyal.
Run that after nfsd stops hanging. No, I wouldn't expect that to hang the server for a minute. How do you know the server was hung during that time?
Looks like the same problem I am seeing with kernel 2.6.32-279.22.1.el6.x86_64 nfsd: peername failed (err 107)! There are 2 blocked task messages before the err 107 message INFO: task nfsd:25721 blocked for more than 120 seconds. INFO: task nfsd:25745 blocked for more than 120 seconds. and the trace starts with: Call Trace: [<ffffffff81090dee>] ? prepare_to_wait_exclusive+0x4e/0x80 [<ffffffffa01ee6e0>] cv_wait_common+0xa0/0x1a0 [spl] [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0284190>] ? avl_find+0x60/0xb0 [zavl] [<ffffffffa01ee813>] __cv_wait+0x13/0x20 [spl] ....... The clients have no access during this error but after 10 minutes the nfsd recovered on it's own. Any fix for this issue? Thanks,Eva
How many clients do you have, and how many nfsd threads are you running?
I am also seeing this problem with kernel 2.6.32-279.el6.x86_64. The NFS server displays: nfsd: peername failed (err 107)! ... The client server will then display these messages repeatedly: server SERVERNAME not responding, still trying server SERVERNAME OK The NFS mount points then become unresponsive on all clients (we're in a blade configuration, multiple clients pointing to same NFS server). The scenario sounds similar. We are transferring numerous files from NFS server to client via "cp". The files vary in size, from several gigs to a few Kb's. However, the "cp" commands are being executed sequentially, so I wouldn't think there is excessive parallelism occurring. I do have one oddity which may be unrelated to this thread. The copies are always failing on the same file, which is an ~10 MB gzip-compressed TAR file. Is anyone else noticing their failures occurring on the same file? My take is, "a file is a file," so I don't know why a compressed TAR would be handled any differently. Permissions are fine. Like I said, this fact might be unrelated and coincidental. I just figured I should add it, since I see no resolution currently.
No additional minor releases are planned for Production Phase 2 in Red Hat Enterprise Linux 5, and therefore Red Hat is closing this bugzilla as it does not meet the inclusion criteria as stated in: https://access.redhat.com/site/support/policy/updates/errata/#Production_2_Phase