Bug 110421 (IT_35748_41569_41486)

Summary: fh_verify: no root_squashed access ... when accessing nfs share
Product: Red Hat Enterprise Linux 3 Reporter: Need Real Name <sveinrn>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED NOTABUG QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: andrew, aron.vrtala, buysse, cmc, ee-cap-admin-dl, emea-presales, equus, flgal3, hp, jwulf, kanderso, lwhatley, martin.pelikan, nalin, pamadio, petrides, p.van.egdom, riek, riel, shl1, tao, ubeck, uthomas, voetelink, weage98
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-13 13:59:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 170445    
Attachments:
Description Flags
Changes an error code from nfserr_stale to nfserr_acces
none
Tcpdump and app trace (tusc) output from onsite testing.
none
oke-traces-040805-01.tar.bz2 - New traces with no_subtree_checks
none
Possible Fix for this problem
none
RHEL3 version of the previous none

Description Need Real Name 2003-11-19 14:47:00 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031118

Description of problem:
After replacing Redhat 9 with Redhat AS 3, lots of error-messages
(>10MB) appears in /var/log/messages:

fh_verify: no root_squashed acces at [file from nfs-share]

Also, file locking does not work reliably when locking files from
another computer. 

The files are accessed from HP-UX 11.00 clients. The share is exported
with "insecure" and "insecure_ports". 

The problem disappears completely when we are giving
"no_subtree_check" as an option in the exports-file. 

The problem only shows up on files in certain folders, but there is no
symbolic links, no folders have been renamed and the permissions
appear to be right. 

Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-4.0.1.EL.i686.rpm

How reproducible:
Always

Steps to Reproduce:
1. export a share from the AS3-box 
2. try accessing and especially locking the files from HP-UX 11.00
3. locking fails, error message is logged to /var/log/messages
    

Additional info:

Comment 2 Tomas Drajsajtl 2004-03-08 15:40:54 UTC
I have the same problem. I cannot run Cadence on our HP-UX 
11.00/11.11 systems because of this error. Does anybody have the 
solution ? The "no_subtree_check" helps only partialy. This is realy 
critical for us. We have paid RHEL because RHL reached its EOL and I 
expected better system but this bugs pushes me to downgrade.


Comment 3 James Wulf 2004-03-09 12:51:25 UTC
I am seeing the same issue as well with HP-UX 11.00 clients and
Enterprise 3. We are in the process of evaluating Redhat to replace a
number of our HP-UX NFS server boxes but will still need to serve our
HP-UX workstations as clients. I will be adding myself to the cc list
to keep abreast of this bug. -jwulf (jwulf.com)

Comment 4 Andrew Meredith 2004-03-10 13:50:51 UTC
I had this issue on one of 18 exported data volumes (a mixture of
partitions and LVM volumes). I was originally exporting the mount
point for this partition and mounting a subdirectory. Once I exported
the subdirectory I wanted to mount, rather than it's parent directory,
the mount worked fine.

FYI this was between a collection of FC1 machines updated current to
the date of this entry.

Comment 5 aron vrtala 2004-05-03 11:52:18 UTC
We have the same problem on 2.4.21-9.0.1.ELsmp #1 SMP, but it seems
not to be an only problem of RHEL it also occurs under Fedora Core 1
2.4.22-1.2188.nptlsmp #1 SMP (updated to the last). May be this is a
hint, where to search for.

It typically occurs from non Linux systems, when locking is required:
May  3 08:57:00 arthur kernel: fh_verify: no root_squashed access at
mail/saved-messages.
May  3 09:02:00 arthur kernel: fh_verify: no root_squashed access at
mail/saved-messages.

The other side would err with a message, when opeing a folder with
pine for instance:
Problem detected: "Unexpected file locking failure: Stale NFS file
handle".
Pine exiting.
IOT/Abort trap
which is on a Tru64 System.

Filesystems are exported rw,sync,no_root_squash. If no_subtree_check
is included, the message would not be generated, but the locks would
not work either. They would infinitely block. Pine would then never
end opening a folder.

It seems to be related to another problem: A user can't login on
X/gnome, which would block when initializing nautilus, on a FC1 linux
client (same patch level as the FC1 system above).

Please help, time runs and otherwise we would have to change our
complete strategy (we have currently 600 users waiting for solution).

Comment 6 Jeff Needle 2004-05-03 20:20:17 UTC
Does anyone have a sniffer trace that we could look at?

Comment 7 aron vrtala 2004-05-04 07:04:17 UTC
At our topology difficult to make on the net. If a tcpdump is
sufficient I'll make one.

- Aron

Comment 10 aron vrtala 2004-05-04 15:04:41 UTC
TCP Dump has been sent out of band to jneedle.

Comment 11 aron vrtala 2004-05-10 09:44:50 UTC
Could anyone tell me the current state of investigation ?

Thanx, Aron

Comment 12 Chris Ritson 2004-05-26 08:46:15 UTC
This is not confined to RHEL or HPUX. We have something similar. The 
server is Redhat 9 (download edition) with a locally built kernel:-

Linux ---- 2.4.24 #2 SMP Tue Jan 6 13:08:09 GMT 2004 i686 i686 i386 
GNU/Linux

The client is:-

SunOS ---- 5.7 Generic_106541-07 sun4m sparc SUNW,SPARCclassic

Comment 13 aron vrtala 2004-05-26 11:13:45 UTC
The problem is also cause for Bugzilla Bug 102402. It is also
responsible for X-window Login problem with gnome, as can be seen there.

Comment 12 and bug 102402 clearly show, that this is a long-lasting
issue and a basic problem in the whole RH Linux chain. Is there anyone
having an idea if this problem also exists in other distributions such
as SuSE or Debian ?

We here in Vienna will not be able to wait much longer than a few days
until solution or workaround.

Aron

Comment 14 Rik van Riel 2004-05-26 12:11:49 UTC
Comment #12 (showing that the bug also occurs in 2.4.24) suggests that
most, if not all, distributions should show this behaviour.

Comment 15 aron vrtala 2004-06-04 13:54:28 UTC
Hello again,

the graphical problem (see comment #5,13) is fixed by having an FC2
NFS V4 server. The email locking problem continues, however.

Aron

Comment 16 Girts 2004-06-09 10:51:52 UTC
Hello!

Could be the same problem here. 

Server: Debian 3.0, custom built 2.4.26, kernel NFS v3.
Client: Debian 3.0, 2.4.26-grsec.

Server has many
"fh_verify: no root_squashed access at x/y."
lines in dmesg.

On the client, the following happened:
client:/dir/to/x/y# ls
ls: .: Stale NFS file handle
client:/dir/to/x/y# cd .
client:/dir/to/x/y# ls
file1            file2

The directory was neither deleted, nor moved when it happened.

Server /etc/exports:
/home_shared    client(rw,no_root_squash) client2...

Client /etc/fstab:
server:/home_shared /home nfs rw,hard,intr,nolock,rsize=8192,
wsize=8192 0 0


Currently I can't reproduce that, not sure when it happens, but the 
dmesg log has 281 such fh_verify lines since last reboot (50 days).

Girts

Comment 17 Girts 2004-06-09 12:51:42 UTC
Having done some additional testing I have found ways to reproduce 
the problem. Notice the file permissions and owners ("x" owned by 
root, rwx------, "y" owned by test3, rwxr-xr-x).

TEST 1

client:/home/staff/test3/x# ls -la
total 4
drwx------    3 root     root           14 Jun  9 15:16 .
drwxr-xr-x   15 test3    staff        4096 Jun  9 15:16 ..
drwxr-xr-x    2 test3    staff          30 Jun  9 15:12 y
client:/home/staff/test3/x# cd y
client:/home/staff/test3/x/y# ls -la
total 0
drwxr-xr-x    2 test3    staff          30 Jun  9 15:12 .
drwx------    3 root     root           14 Jun  9 15:16 ..
-rw-r--r--    1 test3    staff           0 Jun  9 15:12 file1
-rw-r--r--    1 test3    staff           0 Jun  9 15:12 file2
client:/home/staff/test3/x/y# man ls

The manpage is shown.
At the same time, the kern.log on NFS server writes:
Jun  9 15:29:18 server kernel: fh_verify: no root_squashed access at 
x/y.
I push "q" to exit man. Back to console:

Reformatting ls(1), please wait...
client:/home/staff/test3/x/y# ls -l
ls: .: Stale NFS file handle

TEST 2

Back again in the same directory, now with 2 logins to client 
machine, denoted by [1] and [2]:

[1]:
client:/home/staff/test3/x/y# ls -al
total 0
drwxr-xr-x    2 test3    staff          30 Jun  9 15:12 .
drwx------    3 root     root           14 Jun  9 15:16 ..
-rw-r--r--    1 test3    staff           0 Jun  9 15:12 file1
-rw-r--r--    1 test3    staff           0 Jun  9 15:12 file2
[2]:
client:/home/staff/test3/x/y# ls -la
total 0
drwxr-xr-x    2 test3    staff          30 Jun  9 15:12 .
drwx------    3 root     root           14 Jun  9 15:16 ..
-rw-r--r--    1 test3    staff           0 Jun  9 15:12 file1
-rw-r--r--    1 test3    staff           0 Jun  9 15:12 file2
client:/home/staff/test3/x/y# su test3 (notice no -!!)
test3@client:~/x/y$ ls
ls: .: Stale NFS file handle

Server gets "no_root_squashed access at x/y." This is ok, because of 
the check in fh_verify. User test3 cannot access the file as parent 
"x" is owned by root and not executable by test3.
Now back to console [1]. Guess what:
[1]:
client:/home/staff/test3/x/y# ls -l
ls: .: Stale NFS file handle

This is strange. Why the failed permission check on console [2] 
should have affected the current directory handle in console [1]? I 
guess there might be something relating with caching. 

Can someone confirm this? Is this just misconfiguration, should it be 
this way or is it a bug?

Girts

Comment 20 Paul Szabo 2004-06-24 04:09:13 UTC
I think this issue is related to (could be solved by?)

   http://bugs.debian.org/255931

Cheers,

Paul Szabo - psz.edu.au
http://www.maths.usyd.edu.au:8000/u/psz/
School of Mathematics and Statistics  University of Sydney   2006 
Australia

Comment 21 Steve Dickson 2004-06-30 15:31:09 UTC

It turns out the rhel and upstream kernels have half of the
debian patch. The part the don't have is:

-			error = nfserr_stale;
+			error = nfserr_acces;
+/* PSz 23 Jun 04  Not STALE but ACCES: so NFS client code (RPC really)
+ *                net/sunrpc/clnt.c will handle and re-try as real user,
+ *                do not want fs/nfs/inode.c to remove the inode. */
+/* Should not say root_squashed without checking ROOTSQUASH or ALLSQUASH
+ * and UID/GID. (Probably should be dprintk: lucky it was not.) */

From what the comment is claiming, access errors will be retried by
rpc which may not be true in every case at least w.r.t a Linux client.
But, since this code is during an security check (i.e. subtree 
checking) returning eacces may not be a bad idea, since eacces will 
always be more recoverable than estales.

But, I don't see how this patch will help when no_subtree checking
is on, since this code is not executed.

Unfortunately, I'm still unable to reproduce this even with an
HPUX client. So is there a test case that will reliably reproduce
this error? If so, would you please post it?


Comment 22 Martin Pelikan 2004-07-01 07:12:42 UTC
I've designed a test scenario for that problem.
Please take a look in the Issue Trigger : 41569

Kind regards,
Martin Pelikan

Comment 25 Steve Dickson 2004-07-14 10:16:34 UTC
Created attachment 101891 [details]
Changes  an error code from nfserr_stale to nfserr_acces

Please try this untested patch to see if it helps....

Comment 26 Bastien Nocera 2004-07-16 13:31:07 UTC
Test kernels for the patch above available at:
http://people.redhat.com/bnocera/kernel-nfs-fixes/

Those packages are not supported, but we'd be glad to hear if it fixes
the issue at hand.

Contains the patches included in bug #121475 and bug #110421

Comment 36 Daniel Riek 2004-08-06 15:49:25 UTC
Adding this because we are getting more requests about this problem
from the automotive sector. 

For other people running into this problem: 

There currently - at least in connection with the test kernel - are
two possible workarounds:

- set the NFS export options to: 
  (rw,sync,no_root_squash,insecure,insecure_locks,no_subtree_check)

- or do a chmod a+rx on all directories in the hierarchy above the 
  catia files.


Comment 46 Steve Dickson 2004-08-21 01:40:23 UTC
Created attachment 102950 [details]
Possible Fix for this problem

It has been reported that the attached patch fixes this problem,
but there has been no hard confirmation that is true. So could
the people who can easily reproduce this problem, please test
this patch.

Comment 51 Steve Dickson 2004-10-01 11:47:36 UTC
Created attachment 104626 [details]
RHEL3 version of the previous

This patch basically changes the return status from the subtree 
check from estale to eacces. I 've been told that HP clients 
handle this error in a more reasonable way.....

Comment 52 Chris Wilkinson 2004-10-01 14:24:37 UTC
Steve, as I mentioned in our offline conversations, I will apply this
patch to out servers over the weekend. I should have some definitive
results for you on Monday. 

Cheers.

Comment 53 Steve Dickson 2004-10-05 15:44:41 UTC
To hopefully move this bug along, I've
created a UP and SMP i686 test kernel rpms
in http://people.redhat.com/steved/.bz110421/

Please give one of these test kernel a try to see 
if it takes care of the problem.

Comment 54 Steven Lee 2004-11-08 14:47:33 UTC
This problem has been affecting our department's main file server
hosting UNIX home directories for a while.  About a month ago our
server went down a few times in a row so I installed the test kernel.
 The "fh_verify.h" error messages stopped and it worked for about a
month until this past weekend:

---
[root@panda root]# uname -a
Linux panda.cs.cornell.edu 2.4.21-20.EL.bz110421smp #1 SMP Tue Oct 5
11:09:30 EDT 2004 i686 i686 i386 GNU/Linux
[root@panda root]# uptime
 09:50:38  up 30 days, 22:33,  1 user,  load average: 0.38, 0.56, 0.46
[root@panda root]# head /var/log/messages
Nov  7 04:02:05 panda syslogd 1.4.1: restart.
Nov  7 04:02:08 panda kernel: fh_verify: no root_squashed access at
working/svm_loss_learn.
Nov  7 04:02:19 panda kernel: fh_verify: no root_squashed access at
cesar/rsolver.
Nov  7 04:02:23 panda last message repeated 2 times
Nov  7 04:03:03 panda kernel: rpc-srv/tcp: nfsd: sent only -107 bytes
of 268 - shutting down socket
Nov  7 04:03:05 panda kernel: rpc-srv/tcp: nfsd: sent only -107 bytes
of 236 - shutting down socket
Nov  7 04:03:06 panda kernel: fh_verify: no root_squashed access at
bin/vim.
Nov  7 04:03:06 panda last message repeated 4 times
Nov  7 04:03:19 panda kernel: rpc-srv/tcp: nfsd: sent only -107 bytes
of 132 - shutting down socket
Nov  7 04:03:29 panda last message repeated 3 times
---

Comment 55 EE CAP Admin 2004-11-28 10:26:58 UTC
Guts,

I am also seeing this.  We have a few RHEL3 WS boxes mounting an EL3
AS nfs server and they are not producing the errors, but there are
errors from a gentoo box mouting the AS box.

Has anything moved forward with resolving this bug?

Paul

Comment 56 Steve Dickson 2004-11-29 11:52:40 UTC
Could you try the posted patch or kenels to see if the 
goes away?

Comment 57 EE CAP Admin 2004-11-29 12:03:41 UTC
Hi Steve,

I'd rather not do it on the machine which is displaying the errors as
it is mission critical, and the errors are in no way effecting the
main roile of the box as a samba server.

In a week or so should have a new machine on line which I may be able
to test on.

Paul

Comment 58 Rob Thiemann 2005-01-20 22:51:34 UTC
I seem to have fixed it by adjusting permissions on my NFS shares.

here's my /etc/exports:

/home/grada   (rw,insecure,async,no_root_squash)
/home/gradb   (rw,insecure,async,no_root_squash)

And here's the original directory listings for the shares:

drwxr-s---   64 root     stud05       4096 Jan  8 23:39 grada
drwxr-s---   79 root     stud06       4096 Jan  9 02:32 gradb

Changing the permissions to:

drwxr-sr-x   64 root     stud05       4096 Jan  8 23:39 grada
drwxr-sr-x   79 root     stud06       4096 Jan  9 02:32 gradb

Made my "kernel: fh_verify: no root_squashed access at X" errors go away.



Comment 60 R. Michael Richer 2005-04-01 16:32:27 UTC
Simple Fix For Me Anyways:

Original Configuration That Caused fh_verify errors:
Server 1:  /etc/exports contained /local/home 192.168.2.0/24
(rw,no_root_squash,sync)
Client 1: /etc/fstab contained 192.168.2.2:/home /home nfs defaults 0 0

Fixed The Client To Fix The Error:
Server 1: This server stayed the same
Client 1: Changed /etc/fstab to properly mount with:
            192.168.2.2:/local/home /home nfs default 0 0

No more errors detected.

Comment 61 Steve Dickson 2005-10-13 13:59:22 UTC
I'm going to close the bug since it appears to be
a configuration problem.... Please feel free to 
reopen it this is not the case

Comment 62 C.M. Connelly 2005-10-13 16:19:20 UTC
Um...

If you mean the use of the keyword ``defaults'' rather than ``default'' (in Comment 62), then the docs 
are wrong and need to be fixed.  mount(8) says

   defaults
                     Use  default  options: rw, suid, dev, exec, auto, nouser,
                     and async.

Comment 62 also shows a change in the server filesystem name (in the client's fstab); I see this issue 
with configurations that have (always) had the correct server-side filesystem name in the client's fstab.

Comment 32 suggests specifying some options in /etc/exports on the server that ``have mild security 
implications'' (according to exports(5)), and therefore aren't particularly attractive.

Comment 58 suggests opening up the permissions on enclosing directories (allowing all r-w access), 
which, again, isn't exactly an attractive option when you're talking about people's home directories.

If you're talking about something else, then please make it clear what that configuration error is.  


Comment 63 K Donate 2005-10-18 20:01:19 UTC
Steve,

Please clarify what "configuration error" you are referring to.  

Opening up public access to directories is not a solution in the environment I
support.

The automounter is being used to mount the file systems for which this same
error is generated.  Therefore, potential changes in /etc/fstab are no option
either.

Thank you.