Bug 597356
| Summary: | nfsd will not start reliably, reports setting version failed: errno 16 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Mike Gahagan <mgahagan> | ||||
| Component: | rpcbind | Assignee: | Jeff Layton <jlayton> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Martin Cermak <mcermak> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 6.0 | CC: | chuck.lever, jburke, jlayton, mcermak, notting, rwheeler, steved, syeghiay | ||||
| Target Milestone: | beta | Keywords: | Regression, TestBlocker | ||||
| Target Release: | 6.0 | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | rpcbind-0.2.0-5.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 611875 (view as bug list) | Environment: | |||||
| Last Closed: | 2010-07-02 19:49:36 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 611875 | ||||||
| Attachments: |
|
||||||
I'm also seeing this failure with nfs-utils-1.2.2-2 This is probably a kernel problem, but I'll take a look and verify that before I change the component. I've reproduced this with 2.6.32-31.el6.x86_64 and nfs-utils-1.2.2-3.el6.x86_64. An example (that reflects what nfsd is doing): [root@apone ~]# cat /proc/fs/nfsd/versions +2 +3 +4 -4.1 [root@apone ~]# echo -e "+2 +3 +4 \n"> /proc/fs/nfsd/versions -bash: echo: write error: Device or resource busy Sounds like the nfsd kernel threads are still running there. Is that the case on your box, Bill? They appeared to be for a while, then it got into a state where nfs started, while /proc/fs/nfsd/ was completely empty (with the module still loaded, and everpc.mountd running.) The -32 kernel doesn't seem to have helped this. The problem seems to be something with rpcbind registration and unregistration. It looks almost more like a problem with rpcbind itself. What I see is that sometimes it gets into a state where the IPv4 TCP socket gets borked: rpcbind 1489 rpc 6u IPv4 12854 0t0 UDP *:sunrpc rpcbind 1489 rpc 7u IPv4 12858 0t0 UDP *:816 rpcbind 1489 rpc 8u sock 0,6 0t0 12859 can't identify protocol rpcbind 1489 rpc 9u unix 0xffff88003c2340d0 0t0 12861 socket rpcbind 1489 rpc 10u IPv6 12862 0t0 UDP *:sunrpc rpcbind 1489 rpc 11u IPv6 12864 0t0 UDP *:816 rpcbind 1489 rpc 12u IPv6 12865 0t0 TCP *:sunrpc (LISTEN) ...in the above, fd 8 should be an IPv4 TCP listening socket, but it's fubar here... Looks like the problem is related to rpcbind reusing the socket when it's restarted quickly: Jun 3 09:04:07 dhcp231-77 rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w" Jun 3 09:04:07 dhcp231-77 rpcbind: cannot bind tcp: Address already in use Ok, I think I understand the problem: After starting rpcbind and nfs services, the kernel is keeping the socket to rpcbind open: [root@dhcp231-77 ~]# netstat -an | grep 111 tcp 0 0 127.0.0.1:111 127.0.0.1:824 ESTABLISHED Then, when we go to restart rpcbind, it's unable to bind to the socket: Jun 3 09:17:58 dhcp231-77 rpcbind: cannot bind tcp: Address already in use There may be some other bugs involved here too with error handling when the kernel can't talk to rpcbind, but this is the main one I think. What's not clear to me is whether this has changed recently. Mike, is this a recent regression? Did RHEL6 handle this differently before? Ok, I can reproduce this on rawhide too so this is really an upstream problem. We can probably fix it by setting SO_REUSEADDR on the socket, but we need to discuss that upstream first. I think we should drop the beta blocker status on this as well. The problem here is pretty synthetic -- people don't typically restart rpcbind at all. We can continue to work on this but I don't see any need to hold up the beta for it. Note that there may also be some problems with error handling in the kernel when rpcbind isn't responding. I think what happens is that the first attempt to write to the versions file gets back ECONNREFUSED, and then later we end up getting back EBUSY due to the refcount not getting cleaned up correctly. That's something we should also fix, but I think the first order of business to allow rpcbind to restart more reliably. Created attachment 419410 [details]
proposed patch
Proposed patch to set SO_REUSEADDR on listening sockets. If the kernel has a socket open to the daemon at the time that it's killed, then it'll go into TIME_WAIT. If you then restart the daemon it won't be able to bind to the port unless this option is set.
Mike, could you test this patched package and let me know whether it fixes the problem for you? To explain this a little further... I don't think this is a regression from Beta1 as it also would have used TCP for rpcbind registrations. It was probably a regression from the Alpha, since it used UDP for rpcbind registration. It may just not have been noticed during Beta1. There is some timing involved here -- if you wait long enough after starting nfsd (a minute or two), the kernel will close its socket to rpcbind and then you can restart rpcbind without any problem. We may also have some lingering issues with error handling in the kernel when there is a problem doing rpcbind registration, but that's really a separate issue and definitely not something we should hold up the beta for. 2.6.34 (or .35-rc) has some changes to how RPC/TCP reconnection works. Before diving into kernel fixes, you might want to try the latest kernel to see if the kernel end of this problem was addressed or made worse. The kernel I'm playing with here is pretty close to what's upstream for 2.6.35. I've also tested this on a 2.6.34 kernel on my rawhide guest and the problem is there too. Still though, I think setting SO_REUSEADDR for rpcbind is probably reasonable. We generally don't want existing sockets to block the formation of a new listener (unless there's already a listener set up on this port of course). In this case, the kernel was blocking rpcbind from opening a new listener, but that could have just as easily been a userspace process or even a connection from a different host. I've tested the patch referenced in comment 16 and it appears to resolve this issue. Confirmed by restarting rpcbind/nfs multiple times and running the test case which reproduced the problem originally. => VERIFIED Looks like the patch is in upstream rpcbind repo now. The kernel problems with nfsd's error handling are being tracked in bug 599675. Red Hat Enterprise Linux Beta 2 is now available and should resolve the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. |
Description of problem: It appears that in the latest trees, nfs server has issues with starting reliably. [root@ibm-z10-14 20060822]# /etc/init.d/rpcbind stop Stopping rpcbind: [ OK ] [root@ibm-z10-14 20060822]# /etc/init.d/rpcbind restart Stopping rpcbind: [FAILED] Starting rpcbind: [ OK ] [root@ibm-z10-14 20060822]# /etc/init.d/nfs restart Shutting down NFS mountd: [FAILED] Shutting down NFS daemon: [FAILED] Shutting down NFS quotas: [FAILED] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: rpc.nfsd: Setting version failed: errno 16 (Device or resource busy) [ OK ] Starting NFS mountd: [ OK ] [root@ibm-z10-14 20060822]# /etc/init.d/nfs restart Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Sometimes I'm getting this error instead. [root@localhost ~]# /etc/init.d/nfs stop Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] [root@localhost ~]# /etc/init.d/rpcbind stop Stopping rpcbind: [ OK ] [root@localhost ~]# /etc/init.d/rpcbind start Starting rpcbind: [ OK ] [root@localhost ~]# /etc/init.d/nfs start Starting NFS services: [ OK ] Starting NFS daemon: rpc.nfsd: writing fd to kernel failed: errno 111 (Connection refused) rpc.nfsd: unable to set any sockets for nfsd [FAILED] Version-Release number of selected component (if applicable): [root@ibm-z10-14 20060822]# rpm -q nfs-utils rpcbind kernel nfs-utils-1.2.2-3.el6.s390x rpcbind-0.2.0-4.1.el6.s390x kernel-2.6.32-28.el6.s390x kernel-2.6.32-30.el6.s390x <- running kernel I have reproduced this on s/390x and x86_64. I also see the same behavior with the -25 kernel. How reproducible: frequent Steps to Reproduce: 1.start rpcbind and nfs, it will usually fail on the first attempt and nfs will usually start on the second attempt, behavior seems to change somewhat when doing "restart" rather than "start" so our init script may no longer be initilizating something correctly. 2. 3. Actual results: NFS fails to start sometimes or reports errors on startup Expected results: nfs starts without errors. Additional info: This is blocking execution of at least 1 KernelTier2 test (/kernel/filesystems/ltpfstress)