Bug 604044 - NFS4 breaks when server returns NFS4ERR_FILE_OPEN
Summary: NFS4 breaks when server returns NFS4ERR_FILE_OPEN
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: x86_64
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Jeff Layton
QA Contact: yanfu,wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-15 09:22 UTC by Kai Mosebach
Modified: 2018-11-14 17:32 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 21:37:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
backport of the NFS4ERR_FILE_OPEN handling in Linux/NFS patch of Kernel 2.6.19 (892 bytes, patch)
2010-07-02 10:25 UTC, Kai Mosebach
no flags Details | Diff
patch -- backport of NFS4ERR_FILE_OPEN handling patch (try #2) (3.01 KB, patch)
2010-08-03 15:19 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Kai Mosebach 2010-06-15 09:22:48 UTC
Description of problem:

actions like a subversion checkout etc fail with loads of small files in NFS4 mounted directories if the NFS4 server is a OpenSolaris 10 with nmbmand=on flag set.

See http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg28412.html

The nmbmand=on option must be set on CIFS exported homefolders on the Solaris side. RedHat 6 beta  does not have this problem with the NFS4 share.

Version-Release number of selected component (if applicable):

RHES 5.5

Kernel 2.6.18-194.3.1
nfs-utils-lib-1.0.8-7.6.el5
nfs-utils-1.0.9-44.el5

How reproducible:

# mount -t nfs4 -o rw,port=2049 solaris-homeserver:/home/kai /mnt
( also tested with sync : # mount -t nfs4 -o sync,rw,port=2049 solaris-homeserver:/home/kai /mnt)
# cd /mnt
# svn co svn+ssh://svn/repos
...
A   repos/filesystem/RetryingPathRemoverTest.java
A   repos/filesystem/remote
A   repos/filesystem/remote/RemoteStoreCopyActivitySensorTest.java
...
svn: In directory 'repos/filesystem/remote'
svn: Can't move 'repos/filesystem/remote/.svn/tmp/RemoteStoreCopyActivitySensorTest.java.tmp.tmp' to 'repos/filesystem/remote/.svn/tmp/RemoteStoreCopyActivitySensorTest.java.tmp': Input/output error
#

Steps to Reproduce:
1. setup opensolaris 5.11 svn_133 with zfs share exported as cifs and nfs4 (with nmbmand=on option)
2. mount that share
3. work on it
  
Actual results:

NFS4 client fails with error when NFS4ERR_FILE_OPEN error is returned

Expected results:

NFS4 client should be able to handle the problem (cite : Retry a few times before we give up: the error is usually due to ordering issues with asynchronous RPC calls.)

Additional info:

snoop (tcpdump) on the server side:

...skipping...
83833   0.00405 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4_OK ST=1AE7:1 RF=PL DT=N GETFH NFS4_OK FH=A5BB GETATTR NFS4...
83834   0.00011 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5BB SETATTR ST=1AE7:1 GETATTR 10011a 30a23a
83835   0.00155 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SETATTR 0 400002 GETATTR NFS4_OK
83836   0.00014 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5BB CLOSE SQ=5 OST=1AE7:1 GETATTR 10011a 30a23a
83837   0.00197 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK CLOSE OST=1AE7:2 GETATTR NFS4_OK
83838   0.00011 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A752 SAVEFH OPEN RemoteStoreCopyActivitySensorTest.java.svn-base OT=NC SQ=6 CT=N AC=R DN...
83839   0.00021 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4_OK ST=1907:1 RF=PL DT=R DST=1512:0 GETFH NFS4_OK FH=A44F G...
83840   0.00019 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A720 SAVEFH OPEN RemoteStoreCopyActivitySensorTest.java.tmp.tmp OT=CR(E) SQ=7 CT=N AC=RW...
83841   0.00107 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SAVEFH NFS4_OK OPEN NFS4_OK ST=1937:1 RF=PL DT=N GETFH NFS4_OK FH=A5D0 GETATTR NFS4...
83842   0.00010 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 SETATTR ST=1937:1 GETATTR 10011a 30a23a
83843   0.00116 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK SETATTR 0 400002 GETATTR NFS4_OK
83844   0.00034 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 WRITE ST=1937:1 at 0 for 4096 <Fragmented RPC>
83850   0.00173 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 4096 (FSYNC) GETATTR NFS4_OK
83851   0.00014 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A44F CLOSE SQ=8 OST=1907:1 GETATTR 10011a 30a23a
83852   0.00006 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 WRITE ST=1937:1 at 4096 for 2418 <Fragmented RPC>
83855   0.00010 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK CLOSE OST=1907:2 GETATTR NFS4_OK
83856   0.00105 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 2418 (FSYNC) GETATTR NFS4_OK
83858   0.00004 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 CLOSE SQ=9 OST=1937:1 GETATTR 10011a 30a23a

### The actual error

83859   0.00013 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A720 SAVEFH PUTFH FH=A720 RENAME RemoteStoreCopyActivitySensorTest.java.tmp.tmp to Remot...
83861   0.00024 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4ERR_FILE_OPEN PUTFH NFS4_OK SAVEFH NFS4_OK PUTFH NFS4_OK RENAME NFS4ERR_FILE_OPEN
83862   0.00009 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK CLOSE OST=1937:2 GETATTR NFS4_OK

83864   2.05876 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A5D0 GETATTR 10011a 30a23a
83865   0.00013 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK GETATTR NFS4_OK
83866   0.00011 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A720 REMOVE RemoteStoreCopyActivitySensorTest.java.tmp.tmp GETATTR 10011a 30a23a
83867   0.00140 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK REMOVE NFS4_OK GETATTR NFS4_OK
83868   0.00018 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A720 REMOVE RemoteStoreCopyActivitySensorTest.java.tmp GETATTR 10011a 30a23a
83869   0.00137 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK REMOVE NFS4_OK GETATTR NFS4_OK
83870   0.00014 rhes5-client.localdomain -> svn-server-s.localdomain NFS C 4 () PUTFH FH=A465 CLOSE SQ=10 OST=1AB7:2 GETATTR 10011a 30a23a
83871   0.00010 svn-server-s.localdomain -> rhes5-client.localdomain NFS R 4 () NFS4_OK PUTFH NFS4_OK CLOSE OST=1AB7:3 GETATTR NFS4_OK

Comment 1 Kai Mosebach 2010-07-02 10:22:38 UTC
Since we heavily depend on the sun NFS4 service i looked for patches in later kernels myself and found the patch regarding this just in kernel 2.6.19 ... See here :

https://kerneltrap.org/mailarchive/git-commits-head/2009/12/14/19033

Since this patch does not work with the current 2.6.18-194.3.1.el5 kernel i added the needed lines, rebuild it and the error is gone now.

Please add this to future RHES5 kernels!

Comment 2 Kai Mosebach 2010-07-02 10:25:17 UTC
Created attachment 428928 [details]
backport of the NFS4ERR_FILE_OPEN handling in Linux/NFS patch of Kernel 2.6.19

Comment 3 Jeff Layton 2010-07-07 18:41:26 UTC
Thanks for the patch. I'm not sure that the part in nfs4xdr.c is really necessary to fix this, but it seems like a better mapping than -EIO. I'll plan to add this patch to my test kernels in the near future.

Comment 5 Jeff Layton 2010-08-03 15:08:31 UTC
Actually, now that I look more closely...this patch is broken:

 		case -NFS4ERR_STALE_CLIENTID:
 		case -NFS4ERR_STALE_STATEID:
+		case -NFS4ERR_FILE_OPEN:
+			if (exception->timeout > HZ) {
+				/* We have retried a decent amount, time to fail */
+				ret = -EBUSY;
+				break;
+			}

Because you've put this in after NFS4ERR_STALE_CLIENTID and NFS4ERR_STALE_STATEID, you're making the kernel handle those errors the same way. I don't think that's what we want here.

Comment 6 Jeff Layton 2010-08-03 15:19:52 UTC
Created attachment 436296 [details]
patch -- backport of NFS4ERR_FILE_OPEN handling patch (try #2)

This patch also makes it so that when you get this error, the kernel goes into state recovery. That's also not ideal. I think this patch is closer to what's needed.

Kai, could you test this and let me know if it also fixes the problem?

Comment 7 Jeff Layton 2010-08-06 11:37:06 UTC
I've also added this patch to the test kernels on my people.redhat.com page:

    http://people.redhat.com/jlayton/

Kai, if you're not able to test this then it may not make 5.6. I'll need to set up a test environment for it and may not have time to do that before the patch submission deadline.

Comment 10 RHEL Program Management 2010-08-11 11:40:02 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Jarod Wilson 2010-08-31 01:17:37 UTC
in kernel-2.6.18-214.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 14 Eryu Guan 2010-11-30 08:53:06 UTC
Tried with OpenSolaris 5.10, cannot reproduce this issue.

# OpenSolaris NFS Server
bash-3.00# uname -a
SunOS unknown 5.10 Generic_141445-09 i86pc i386 i86pc
bash-3.00# zfs get all | grep nbmand
tank     nbmand                on                     local
tank/fs  nbmand                on                     inherited from tank
bash-3.00# share
-               /export/home   rw   ""  
bash-3.00#

# NFS client
[root@nec-em9 fs]# uname -a
Linux nec-em9.rhts.eng.bos.redhat.com 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
[root@nec-em9 fs]# mount | grep nfs4
10.66.65.194:/ on /media type nfs4 (rw,addr=10.66.65.194)
[root@nec-em9 fs]# pwd
/media/export/home/fs

Copied kernel source tree to nfs mount and grep for some string in the tree, then rm the whole tree. Also tried with svn checkout a large project.

Ran fsstress on the NFS mount, no issue found on -233 kernel.
Confirmed patch linux-2.6-fs-nfs-fix-nfs4err_file_open-handling-in-linux-nfs.patch is applied in kernnel 2.6.18-233.el5 correctly.

Comment 16 errata-xmlrpc 2011-01-13 21:37:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.