From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7 Description of problem: After we have received the first of these messages from our nfs clients (all RHEL4 u4): Oct 16 13:43:01 n110 kernel: RPC: error 512 connecting to server rtcfiler Oct 16 13:43:01 n40 kernel: RPC: error 512 connecting to server rtcfiler Oct 16 13:43:01 n98 kernel: RPC: error 512 connecting to server rtcfiler Oct 16 13:43:01 n22 kernel: RPC: error 512 connecting to server rtcfiler Oct 16 13:43:01 n30 kernel: RPC: error 512 connecting to server rtcfiler The nfs server starts misbehaving, something as trivial as a re-export of nfs shares doesn't even work : [root@rtcfiler /]# cat /etc/exports /users 192.168.0.0/255.255.255.0(rw,async) 192.168.0.250(rw,async,no_root_squash) /apps 192.168.0.0/255.255.255.0(async) 192.168.0.250(rw,async,no_root_squash) 192.168.0.240(rw,async,no_root_squash) /projects 192.168.0.0/255.255.255.0(rw,async,fsid=20) 192.168.0.250(rw,async,no_root_squash,fsid=20) /backups rcsg.rice.edu(rw,async,no_root_squash,fsid=40) 128.42.6.40(rw,async,fsid=40) neptuno.rice.edu(rw,async,fsid=40) (I added my workstation neptuno to the /backups share ACL for this example) [root@rtcfiler /]# service nfs reload [root@rtcfiler /]# showmount -e localhost Export list for localhost: /apps 192.168.0.0/255.255.255.0,management.rtc /users 192.168.0.0/255.255.255.0 /backups hera.cs.rice.edu,rcsg.rice.edu /projects 192.168.0.0/255.255.255.0 On the server , /etc/sysconfig/nfs looks like : [root@rtcfiler /]# cat /etc/sysconfig/nfs RPCNFSDCOUNT=32 RPCNFSDARGS="--no-nfs-version 4" MOUNTD_NFS_V3=default Server NFS stats: Server packet stats: packets udp tcp tcpconn 308225124 50886640 257312455 86472 Server rpc stats: calls badcalls badauth badclnt xdrcall 308202850 0 0 0 0 Server reply cache: hits misses nocache 53970 135127025 173021908 Server file handle cache: lookup anon ncachedir ncachedir stale 0 0 0 0 111448 Server nfs v2: null getattr setattr root lookup readlink 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% read wrcache write create remove rename 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% link symlink mkdir rmdir readdir fsstat 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% Server nfs v3: null getattr setattr lookup access readlink 1397 0% 99325945 32% 5573248 1% 28216968 9% 17945580 5% 51162 0% read write create mkdir symlink mknod 9045297 2% 112054739 36% 9917846 3% 298554 0% 97662 0% 13 0% remove rmdir rename link readdir readdirplus 6377358 2% 67129 0% 420595 0% 373811 0% 7628 0% 861807 0% fsstat fsinfo pathconf commit 29184 0% 1718 0% 0 0% 17531392 5% These are the client mount options : rtcfiler:/users on /users type nfs (rw,bg,noacl,hard,intr,rsize=32768,wsize=32768,timeo=50,retrans=10,addr=192.168.0.247) rtcfiler:/projects on /projects type nfs (rw,bg,noacl,hard,intr,rsize=32768,wsize=32768,timeo=50,retrans=10,addr=192.168.0.247) rtcfiler:/apps on /opt/apps type nfs (rw,bg,noacl,hard,intr,rsize=32768,wsize=32768,timeo=50,retrans=10,addr=192.168.0.247) Client NFS stats: [root@n97 ~]# nfsstat -c Client rpc stats: calls retrans authrefrsh 528922 1 0 Client nfs v3: null getattr setattr lookup access readlink 0 0% 52631 9% 1760 0% 7908 1% 22234 4% 76 0% read write create mkdir symlink mknod 15390 2% 367759 69% 2347 0% 3 0% 0 0% 0 0% remove rmdir rename link readdir readdirplus 51 0% 3 0% 0 0% 0 0% 0 0% 52 0% fsstat fsinfo pathconf commit 18 0% 4 0% 0 0% 58686 11% The ethernet switch used on this cluster is healthy, we ran several mtr sessions from different clients to the server for days without packet loss. There's a strange pattern in when these errors are triggered, it seems like it happens always when the clients close() a file. This is the main fileserver for a beowulf cluster and we've had a lot of failures along with these error messages always at the end of jobs which is when the job output gets transfered and FH(s) get released. Thanks in advance, Version-Release number of selected component (if applicable): nfs-utils-1.0.6-70.EL4 How reproducible: Always Steps to Reproduce: 1.Install and use NFS on an homogenous RHEL4 u4 IA64 platform 2.Wait until the first client reports a RPC "Error 512 connecting to server" 3.Modify /etc/exports and re-export it Actual Results: I couldn't re-export nfs shares without restarting the entire service. Other apps/services using NFS needed to be restarted. (PBS mom in our case) Expected Results: The filesystems should have been re-exported reflecting the changes made /etc/exports. Services on the compute nodes shouldn't need to be restarted when this happens. Additional info:
Would it be possible to post a bzip2 binary ethereal network trace between the server and client? Somthing similar to: tethereal -w /tmp/data.pcap host <server> bzip2 /tmp/data.pcap
Since there doesn't seem to be any activity with this bug, I will add that we are seeing a similar problem. Our server is running RHEL4 i386, kernel 2.6.9-42.0.2.ELsmp and the version of nfs is nfs-utils-1.0.6-70.EL4. We are reasonably up2date on patches. Specifically, when I make a change to /etc/exports on the server the changes to not appear to take effect when doing a "service nfs reload" like they have in the past. However, a "service nfs restart" did seem to work. I have not seen any errors in the messages files on either the client or server. Thomas Walker
Thomas, Would it be possible to get a tethereal network trace as described in Comment #1?
Created attachment 143062 [details] output from tethereal output of; redhat1.stsci.edu> tethereal -w /tmp/data.pcap host redhat-srvr2.stsci.edu redhat1.stsci.edu> bzip2 /tmp/data.pcap
Just to be clear, your getting the same RPC errors as described in the first bug comment? Also, does doing a 'exportfs -arv' instead of a 'service reload' make any difference?
> Just to be clear, your getting the same RPC errors as described in > the first bug comment? > > Also, does doing a 'exportfs -arv' instead of a 'service reload' make any > difference? I do not see any RPC errors on either the client or server. My symptom is that changes to /etc/exports are not being picked up by a "service nfs reload" like they used to be. I would rather not do an "exportfs -arv" at this time, since this is a production server and I can't risk causing a problem. I can try that command at a more quiet time of day but it won't be till next week. Thomas Walker
This defect is describing a problem I'm seeing now on a production box. Kernel:Linux george 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:54:53 EST 2006 i686 athlon i386 GNU/Linux nfs-utils-1.0.6-65.EL4 The interesting part is I don't see /proc/fs/nfsd/exports get updated after attempting to export a new fs w/in the realm of an already running nfs server. I haven't spotted the RPC errors, but I have 400-500 clients & can not check them all. If you have suggestions on how to manually manipulate the exports via /proc w/o using exportfs, I'm willing to give that a try if it helps. -mike [root@george nfs]# cat /proc/fs/nfsd/exports # Version 1.1 # Path Client(Flags) # IPs /emc100 *(rw,no_root_squash,sync,wdelay) /viewstore *(rw,no_root_squash,sync,wdelay) [root@george nfs]# cat /etc/exports /viewstore *(rw,no_root_squash) /emc100 *(rw,no_root_squash) /emc102 *(rw,no_root_squash) [root@george nfs]# exportfs -a exportfs: /etc/exports [1]: No 'sync' or 'async' option specified for export "*:/viewstore". Assuming default behaviour ('sync'). NOTE: this default has changed from previous versions exportfs: /etc/exports [2]: No 'sync' or 'async' option specified for export "*:/emc100". Assuming default behaviour ('sync'). NOTE: this default has changed from previous versions exportfs: /etc/exports [3]: No 'sync' or 'async' option specified for export "*:/emc102". Assuming default behaviour ('sync'). NOTE: this default has changed from previous versions [root@george nfs]# cat /proc/fs/nfsd/exports # Version 1.1 # Path Client(Flags) # IPs /emc100 *(rw,no_root_squash,sync,wdelay) /viewstore *(rw,no_root_squash,sync,wdelay) --- And looking from a Solaris client (less issues w/ new NFS shares on a pre-existing server): > ls /net/george emc100 viewstore > ls /net/george/emc102 /net/george/emc102: No such file or directory >
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.