Bug 749181

Summary: Infrequent 'permission denied' errors with NFS3 on client
Product: Red Hat Enterprise Linux 6 Reporter: Stefan Walter <walteste>
Component: kernelAssignee: nfs-maint
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1CC: baumanmo, eparis, jlayton, steved
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: NFS server
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-03-20 18:27:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stefan Walter 2011-10-26 12:26:04 UTC
Description of problem:

We run RHEL 6.1 NFS servers and clients and mount home directories via NFS3.
This setup generally works but we sometimes get a 'permission denied' error,
for instance when running an 'ls' in ssh sessions that were left idle for
hours:

$ cd ~/a/b/c
$ ls 
1 2 3
... (wait a few hours)
$ ls
ls: cannot open directory .: Permission denied

The export is not the issue because the error immediately goes away
when the directory is accessed using its absolute path:

$ ls
ls: cannot open directory .: Permission denied
$ ls ~/a/b/c
1 2 3
$ ls
1 2 3

Turing off iptables and SELinux on server and clients does not help.
Mounting with all kinds of options like lookupcache=none or sync
does also have no impact.

The error can more or less forced to show up immediately by modifying the 
file caching behaviour of the server:

# echo 20000 > /proc/sys/vm/vfs_cache_pressure
# echo 3 > /proc/sys/vm/drop_caches
# sync 

Over the network I see this with wireshark when the 'ls' fails:

1	0.000000	x	y	NFS	V3 ACCESS Call, FH:0xd01522c9
2	0.000115	x	y	NFS	V3 ACCESS Reply (Call In 1) Error:NFS3ERR_ACCES

I did some initial bug hunting with a kernel built from 
kernel-2.6.32-131.12.1.el6.src.rpm that has some more dprintk()s spread
over the nfsd code, starting with nfsd_access() in fs/nfsd/nfs3proc.c.
Here are my results so far:

The NFS3ERR_ACCES comes from an EACCES detected by the following code in the
function nfsd_set_fh_dentry() in fs/nfsd/nfsfh.c:

   242          if (fileid_type == FILEID_ROOT)
   243                  dentry = dget(exp->ex_path.dentry);
   244          else {
   245                  dentry = exportfs_decode_fh(exp->ex_path.mnt, fid,
   246                                  data_left, fileid_type,
   247                                  nfsd_acceptable, exp);
   248          }
   249          if (dentry == NULL)
   250                  goto out;
   251          if (IS_ERR(dentry)) {
   252                  if (PTR_ERR(dentry) != -EINVAL)
   253                          error = nfserrno(PTR_ERR(dentry));
   254                  goto out;
   255          }

fileid_type is FILEID_ROOT when the error occurs and therefore the
entry exp->ex_path.dentry is used and seems to be an -EACCES error
code instead of a pointer to a valid dentry. I have not found out
what code sets exp->ex_path.dentry to -EACCES but i suspect that should
actually never be the case otherwise the dget() in line 243 is unsafe
because it assumes to operate on a valid pointer.

If someone of the nfs-utils or kernel nfs wizards could help to
debug this further and fix the issue would be greatly appreciated.

Version-Release number of selected component (if applicable):

kernel-2.6.32-131.12.1.el6.x86_64
nfs-utils-1.2.3-7.el6.x86_64

How reproducible:

As described above. We failed so far to reproduce it on a simple test server.
On our productive server with tens of exports and many clients it always shows
up.

Steps to Reproduce:
1. Configure a fairly large NFS3 server 
2. Mount a home directory via NFS3
3. Checne the current directory to some subdirectory
4. try to do an 'ls'
  
Actual results:

As described above.

Expected results:

There should be no errors.

Comment 2 Stefan Walter 2011-10-27 13:50:21 UTC
It seem that our testing with SELinux was not thorough enough (our team is
playing with a server of a productive service after all). After a few
reboots the server started to report SELinux messages like the following randomly:

type=1400 audit(1319712116.696:712): avc:  denied  { 0x400000 } for  pid=3164 comm="nfsd" name="" dev=dm-18 ino=16124690 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file

This avc message did not show up in our logs before today.

A quick google search for that message leads me to believe that we hit the bug
reported in BZ576207 which is not fixed in the RHEl6. Disabling SELinux on the
server now makes the problem go away.

I am going to run our server with a kernel that incorporates the patch for a
few days. If that works stable I will request to back-port the fix from
BZ576207 to RHEL6.

Comment 3 RHEL Program Management 2011-10-31 05:47:23 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 6 Jeff Layton 2012-03-20 18:13:29 UTC
Supposedly, this bug has been fixed in RHEL6 though I don't know the BZ# right offhand. Are you still able to reproduce this on more recent kernels?

Comment 7 Jeff Layton 2012-03-20 18:27:50 UTC
Ok, for the record this is probably the same as bug 656458, and should be fixed in 6.2. I'll go ahead and close this as a duplicate. Please reopen if it's not fixed in 6.2.

*** This bug has been marked as a duplicate of bug 656458 ***