Description of problem: actions like a subversion checkout etc fail with loads of small files in NFS4 mounted directories if the NFS4 server is a OpenSolaris 10 with nmbmand=on flag set. See http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28412.html The nmbmand=on option must be set on CIFS exported homefolders on the Solaris side. RedHat 6 beta does not have this problem with the NFS4 share. Version-Release number of selected component (if applicable): RHES 5.5 Kernel 2.6.18-194.3.1 nfs-utils-lib-1.0.8-7.6.el5 nfs-utils-1.0.9-44.el5 How reproducible: # mount -t nfs4 -o rw,port=2049 solaris-homeserver:/home/kai /mnt ( also tested with sync : # mount -t nfs4 -o sync,rw,port=2049 solaris-homeserver:/home/kai /mnt) # cd /mnt # svn co svn+ssh://svn/repos ... A repos/filesystem/RetryingPathRemoverTest.java A repos/filesystem/remote A repos/filesystem/remote/RemoteStoreCopyActivitySensorTest.java ... svn: In directory 'repos/filesystem/remote' svn: Can't move 'repos/filesystem/remote/.svn/tmp/RemoteStoreCopyActivitySensorTest.java.tmp.tmp' to 'repos/filesystem/remote/.svn/tmp/RemoteStoreCopyActivitySensorTest.java.tmp': Input/output error # Steps to Reproduce: 1. setup opensolaris 5.11 svn_133 with zfs share exported as cifs and nfs4 (with nmbmand=on option) 2. mount that share 3. work on it Actual results: NFS4 client fails with error when NFS4ERR_FILE_OPEN error is returned Expected results: NFS4 client should be able to handle the problem (cite : Retry a few times before we give up: the error is usually due to ordering issues with asynchronous RPC calls.) Additional info: snoop (tcpdump) on the server side: ...skipping... 83833 0.00405 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4_OK ST=1AE7:1 RF=PL DT=N GETFH NFS4_OK FH=A5BB GETATTR NFS4... 83834 0.00011 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5BB SETATTR ST=1AE7:1 GETATTR 10011a 30a23a 83835 0.00155 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SETATTR 0 400002 GETATTR NFS4_OK 83836 0.00014 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5BB CLOSE SQ=5 OST=1AE7:1 GETATTR 10011a 30a23a 83837 0.00197 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK CLOSE OST=1AE7:2 GETATTR NFS4_OK 83838 0.00011 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A752 SAVEFH OPEN RemoteStoreCopyActivitySensorTest.java.svn-base OT=NC SQ=6 CT=N AC=R DN... 83839 0.00021 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4_OK ST=1907:1 RF=PL DT=R DST=1512:0 GETFH NFS4_OK FH=A44F G... 83840 0.00019 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A720 SAVEFH OPEN RemoteStoreCopyActivitySensorTest.java.tmp.tmp OT=CR(E) SQ=7 CT=N AC=RW... 83841 0.00107 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4_OK ST=1937:1 RF=PL DT=N GETFH NFS4_OK FH=A5D0 GETATTR NFS4... 83842 0.00010 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 SETATTR ST=1937:1 GETATTR 10011a 30a23a 83843 0.00116 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SETATTR 0 400002 GETATTR NFS4_OK 83844 0.00034 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 WRITE ST=1937:1 at 0 for 4096 <Fragmented RPC> 83850 0.00173 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 4096 (FSYNC) GETATTR NFS4_OK 83851 0.00014 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A44F CLOSE SQ=8 OST=1907:1 GETATTR 10011a 30a23a 83852 0.00006 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 WRITE ST=1937:1 at 4096 for 2418 <Fragmented RPC> 83855 0.00010 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK CLOSE OST=1907:2 GETATTR NFS4_OK 83856 0.00105 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 2418 (FSYNC) GETATTR NFS4_OK 83858 0.00004 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 CLOSE SQ=9 OST=1937:1 GETATTR 10011a 30a23a ### The actual error 83859 0.00013 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A720 SAVEFH PUTFH FH=A720 RENAME RemoteStoreCopyActivitySensorTest.java.tmp.tmp to Remot... 83861 0.00024 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4ERR_FILE_OPEN PUTFH NFS4_OK SAVEFH NFS4_OK PUTFH NFS4_OK RENAME NFS4ERR_FILE_OPEN 83862 0.00009 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK CLOSE OST=1937:2 GETATTR NFS4_OK 83864 2.05876 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 GETATTR 10011a 30a23a 83865 0.00013 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK GETATTR NFS4_OK 83866 0.00011 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A720 REMOVE RemoteStoreCopyActivitySensorTest.java.tmp.tmp GETATTR 10011a 30a23a 83867 0.00140 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK REMOVE NFS4_OK GETATTR NFS4_OK 83868 0.00018 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A720 REMOVE RemoteStoreCopyActivitySensorTest.java.tmp GETATTR 10011a 30a23a 83869 0.00137 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK REMOVE NFS4_OK GETATTR NFS4_OK 83870 0.00014 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A465 CLOSE SQ=10 OST=1AB7:2 GETATTR 10011a 30a23a 83871 0.00010 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK CLOSE OST=1AB7:3 GETATTR NFS4_OK
Since we heavily depend on the sun NFS4 service i looked for patches in later kernels myself and found the patch regarding this just in kernel 2.6.19 ... See here : https://kerneltrap.org/mailarchive/git-commits-head/2009/12/14/19033 Since this patch does not work with the current 2.6.18-194.3.1.el5 kernel i added the needed lines, rebuild it and the error is gone now. Please add this to future RHES5 kernels!
Created attachment 428928 [details] backport of the NFS4ERR_FILE_OPEN handling in Linux/NFS patch of Kernel 2.6.19
Thanks for the patch. I'm not sure that the part in nfs4xdr.c is really necessary to fix this, but it seems like a better mapping than -EIO. I'll plan to add this patch to my test kernels in the near future.
Actually, now that I look more closely...this patch is broken: case -NFS4ERR_STALE_CLIENTID: case -NFS4ERR_STALE_STATEID: + case -NFS4ERR_FILE_OPEN: + if (exception->timeout > HZ) { + /* We have retried a decent amount, time to fail */ + ret = -EBUSY; + break; + } Because you've put this in after NFS4ERR_STALE_CLIENTID and NFS4ERR_STALE_STATEID, you're making the kernel handle those errors the same way. I don't think that's what we want here.
Created attachment 436296 [details] patch -- backport of NFS4ERR_FILE_OPEN handling patch (try #2) This patch also makes it so that when you get this error, the kernel goes into state recovery. That's also not ideal. I think this patch is closer to what's needed. Kai, could you test this and let me know if it also fixes the problem?
I've also added this patch to the test kernels on my people.redhat.com page: http://people.redhat.com/jlayton/ Kai, if you're not able to test this then it may not make 5.6. I'll need to set up a test environment for it and may not have time to do that before the patch submission deadline.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-214.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Tried with OpenSolaris 5.10, cannot reproduce this issue. # OpenSolaris NFS Server bash-3.00# uname -a SunOS unknown 5.10 Generic_141445-09 i86pc i386 i86pc bash-3.00# zfs get all | grep nbmand tank nbmand on local tank/fs nbmand on inherited from tank bash-3.00# share - /export/home rw "" bash-3.00# # NFS client [root@nec-em9 fs]# uname -a Linux nec-em9.rhts.eng.bos.redhat.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux [root@nec-em9 fs]# mount | grep nfs4 10.66.65.194:/ on /media type nfs4 (rw,addr=10.66.65.194) [root@nec-em9 fs]# pwd /media/export/home/fs Copied kernel source tree to nfs mount and grep for some string in the tree, then rm the whole tree. Also tried with svn checkout a large project. Ran fsstress on the NFS mount, no issue found on -233 kernel. Confirmed patch linux-2.6-fs-nfs-fix-nfs4err_file_open-handling-in-linux-nfs.patch is applied in kernnel 2.6.18-233.el5 correctly.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html