Bug 725547

Summary:	NFS server hangs with kernel: nfsd: peername failed (err 107)
Product:	Red Hat Enterprise Linux 5	Reporter:	Eyal <shimony>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED WONTFIX	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.3	CC:	Bert.Deknuydt, bfields, dhowells, diana.chinces, hocks, jlayton, jmcaninl, rwheeler, sprabhu, steved
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	908876 (view as bug list)		Environment:
Last Closed:	2013-10-15 19:25:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	908876

Description Eyal 2011-07-25 20:03:37 UTC

Description of problem:
I have few NFS servers (with 2.6.18-128.el5) with very high load and from time to time, all NFSD
hangs and i noticed the following message (two variations) in /var/log/messages
file:
kernel: nfsd: peername failed (err 107)
and
kernel: nfsd: non-standard errno: -107
After that, clients cant access the mounts and i am not even able to stop nfsd
properly (need to use kill -9).

I tried google for it, and i see others have this problem, but could not find a
proper official solution.

Server export file:
/MNT_ROOT    
10.0.0.3(rw,no_root_squash,no_wdelay,no_subtree_check,sync,crossmnt,fsid=1)
/MNT_ROOT     *(rw,root_squash,no_wdelay,no_subtree_check,sync,crossmnt,fsid=1)

Client mountpoint:
server:/MNT_ROOT on /server type nfs
(rw,_netdev,rsize=8192,wsize=8192,soft,tcp,addr=10.0.0.1)


Version-Release number of selected component (if applicable):
nfs-utils-1.0.9-40.el5
nfs-utils-lib-1.0.8-7.2.z2

How reproducible:
High load with lots of small files??

Steps to Reproduce:
1.
2.
3.
  
Actual results:
All mounts on clients hang.
Unable to stop nfsd on the sever unless using kill -9

Expected results:


Additional info:

Comment 1 Diana Chinces 2011-08-10 10:48:20 UTC

We are also encountering this issue when two clients with the same MAC address try to mount a NFS directory ? Are you planning to fix this ?

Comment 2 J. Bruce Fields 2011-08-16 18:43:19 UTC

If you have two clients with the same MAC address, I'm amazed that anything works at all; that's a separate problem.

My first thought was that it was the same problem as that which c51e88efa9bf31e0f0bdf872c61a0e921a9faffb "sunrpc: fix peername failed on closed listener", but that fixes a regression that never existed in rhel5.

It might be interesting to know where the nfsd threads are hanging, if they in fact are.  Perhaps a sysrq-t trace would help?  (echo "t" >/proc/sysrq-trigger, then attach the results which are dumped to the log).

Comment 3 Eyal 2011-08-16 20:07:38 UTC

Hi Bruce,

Thanks for helping.
One question, can i run it now when it works or should i run it just at the exact same time it hang?
I ran this on other server to test the command and noticed it hanged my server for a minute or so...is that suppose to do it?

Thanks,
Eyal.

Comment 4 J. Bruce Fields 2011-08-16 20:17:16 UTC

Run that after nfsd stops hanging.

No, I wouldn't expect that to hang the server for a minute.  How do you know the server was hung during that time?

Comment 5 Eva Hocks 2013-07-09 18:08:16 UTC

Looks like the same problem I am seeing with kernel 2.6.32-279.22.1.el6.x86_64

nfsd: peername failed (err 107)!

There are 2 blocked task messages before the err 107 message
INFO: task nfsd:25721 blocked for more than 120 seconds.
INFO: task nfsd:25745 blocked for more than 120 seconds.

and the trace starts with:  
Call Trace:
 [<ffffffff81090dee>] ? prepare_to_wait_exclusive+0x4e/0x80
 [<ffffffffa01ee6e0>] cv_wait_common+0xa0/0x1a0 [spl]
 [<ffffffff81090be0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0284190>] ? avl_find+0x60/0xb0 [zavl]
 [<ffffffffa01ee813>] __cv_wait+0x13/0x20 [spl]
.......


The clients have no access during this error but after 10 minutes the nfsd recovered on it's own. 

Any fix for this issue? 
Thanks,Eva

Comment 6 J. Bruce Fields 2013-07-20 20:48:55 UTC

How many clients do you have, and how many nfsd threads are you running?

Comment 7 John Smith 2013-08-28 14:35:36 UTC

I am also seeing this problem with kernel 2.6.32-279.el6.x86_64.

The NFS server displays:

   nfsd: peername failed (err 107)!
   ...

The client server will then display these messages repeatedly:

   server SERVERNAME not responding, still trying
   server SERVERNAME OK

The NFS mount points then become unresponsive on all clients (we're in a blade configuration, multiple clients pointing to same NFS server).

The scenario sounds similar. We are transferring numerous files from NFS server to client via "cp". The files vary in size, from several gigs to a few Kb's. However, the "cp" commands are being executed sequentially, so I wouldn't think there is excessive parallelism occurring. 

I do have one oddity which may be unrelated to this thread. The copies are always failing on the same file, which is an ~10 MB gzip-compressed TAR file. Is anyone else noticing their failures occurring on the same file?

My take is, "a file is a file," so I don't know why a compressed TAR would be handled any differently. Permissions are fine. Like I said, this fact might be unrelated and coincidental. I just figured I should add it, since I see no resolution currently.

Comment 9 Andrius Benokraitis 2013-10-15 19:25:58 UTC

No additional minor releases are planned for Production Phase 2 in Red Hat Enterprise Linux 5, and therefore Red Hat is closing this bugzilla as it does not meet the inclusion criteria as stated in:
https://access.redhat.com/site/support/policy/updates/errata/#Production_2_Phase

Comment 10 Red Hat Bugzilla 2023-09-14 01:24:29 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days