Bug 110421 (IT_35748_41569_41486)
Summary: | fh_verify: no root_squashed access ... when accessing nfs share | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Need Real Name <sveinrn> |
Component: | kernel | Assignee: | Steve Dickson <steved> |
Status: | CLOSED NOTABUG | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.0 | CC: | andrew, aron.vrtala, buysse, cmc, ee-cap-admin-dl, emea-presales, equus, flgal3, hp, jwulf, kanderso, lwhatley, martin.pelikan, nalin, pamadio, petrides, p.van.egdom, riek, riel, shl1, tao, ubeck, uthomas, voetelink, weage98 |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-10-13 13:59:22 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 170445 | ||
Attachments: |
Description
Need Real Name
2003-11-19 14:47:00 UTC
I have the same problem. I cannot run Cadence on our HP-UX 11.00/11.11 systems because of this error. Does anybody have the solution ? The "no_subtree_check" helps only partialy. This is realy critical for us. We have paid RHEL because RHL reached its EOL and I expected better system but this bugs pushes me to downgrade. I am seeing the same issue as well with HP-UX 11.00 clients and Enterprise 3. We are in the process of evaluating Redhat to replace a number of our HP-UX NFS server boxes but will still need to serve our HP-UX workstations as clients. I will be adding myself to the cc list to keep abreast of this bug. -jwulf (jwulf.com) I had this issue on one of 18 exported data volumes (a mixture of partitions and LVM volumes). I was originally exporting the mount point for this partition and mounting a subdirectory. Once I exported the subdirectory I wanted to mount, rather than it's parent directory, the mount worked fine. FYI this was between a collection of FC1 machines updated current to the date of this entry. We have the same problem on 2.4.21-9.0.1.ELsmp #1 SMP, but it seems not to be an only problem of RHEL it also occurs under Fedora Core 1 2.4.22-1.2188.nptlsmp #1 SMP (updated to the last). May be this is a hint, where to search for. It typically occurs from non Linux systems, when locking is required: May 3 08:57:00 arthur kernel: fh_verify: no root_squashed access at mail/saved-messages. May 3 09:02:00 arthur kernel: fh_verify: no root_squashed access at mail/saved-messages. The other side would err with a message, when opeing a folder with pine for instance: Problem detected: "Unexpected file locking failure: Stale NFS file handle". Pine exiting. IOT/Abort trap which is on a Tru64 System. Filesystems are exported rw,sync,no_root_squash. If no_subtree_check is included, the message would not be generated, but the locks would not work either. They would infinitely block. Pine would then never end opening a folder. It seems to be related to another problem: A user can't login on X/gnome, which would block when initializing nautilus, on a FC1 linux client (same patch level as the FC1 system above). Please help, time runs and otherwise we would have to change our complete strategy (we have currently 600 users waiting for solution). Does anyone have a sniffer trace that we could look at? At our topology difficult to make on the net. If a tcpdump is sufficient I'll make one. - Aron TCP Dump has been sent out of band to jneedle. Could anyone tell me the current state of investigation ? Thanx, Aron This is not confined to RHEL or HPUX. We have something similar. The server is Redhat 9 (download edition) with a locally built kernel:- Linux ---- 2.4.24 #2 SMP Tue Jan 6 13:08:09 GMT 2004 i686 i686 i386 GNU/Linux The client is:- SunOS ---- 5.7 Generic_106541-07 sun4m sparc SUNW,SPARCclassic The problem is also cause for Bugzilla Bug 102402. It is also responsible for X-window Login problem with gnome, as can be seen there. Comment 12 and bug 102402 clearly show, that this is a long-lasting issue and a basic problem in the whole RH Linux chain. Is there anyone having an idea if this problem also exists in other distributions such as SuSE or Debian ? We here in Vienna will not be able to wait much longer than a few days until solution or workaround. Aron Comment #12 (showing that the bug also occurs in 2.4.24) suggests that most, if not all, distributions should show this behaviour. Hello again, the graphical problem (see comment #5,13) is fixed by having an FC2 NFS V4 server. The email locking problem continues, however. Aron Hello! Could be the same problem here. Server: Debian 3.0, custom built 2.4.26, kernel NFS v3. Client: Debian 3.0, 2.4.26-grsec. Server has many "fh_verify: no root_squashed access at x/y." lines in dmesg. On the client, the following happened: client:/dir/to/x/y# ls ls: .: Stale NFS file handle client:/dir/to/x/y# cd . client:/dir/to/x/y# ls file1 file2 The directory was neither deleted, nor moved when it happened. Server /etc/exports: /home_shared client(rw,no_root_squash) client2... Client /etc/fstab: server:/home_shared /home nfs rw,hard,intr,nolock,rsize=8192, wsize=8192 0 0 Currently I can't reproduce that, not sure when it happens, but the dmesg log has 281 such fh_verify lines since last reboot (50 days). Girts Having done some additional testing I have found ways to reproduce the problem. Notice the file permissions and owners ("x" owned by root, rwx------, "y" owned by test3, rwxr-xr-x). TEST 1 client:/home/staff/test3/x# ls -la total 4 drwx------ 3 root root 14 Jun 9 15:16 . drwxr-xr-x 15 test3 staff 4096 Jun 9 15:16 .. drwxr-xr-x 2 test3 staff 30 Jun 9 15:12 y client:/home/staff/test3/x# cd y client:/home/staff/test3/x/y# ls -la total 0 drwxr-xr-x 2 test3 staff 30 Jun 9 15:12 . drwx------ 3 root root 14 Jun 9 15:16 .. -rw-r--r-- 1 test3 staff 0 Jun 9 15:12 file1 -rw-r--r-- 1 test3 staff 0 Jun 9 15:12 file2 client:/home/staff/test3/x/y# man ls The manpage is shown. At the same time, the kern.log on NFS server writes: Jun 9 15:29:18 server kernel: fh_verify: no root_squashed access at x/y. I push "q" to exit man. Back to console: Reformatting ls(1), please wait... client:/home/staff/test3/x/y# ls -l ls: .: Stale NFS file handle TEST 2 Back again in the same directory, now with 2 logins to client machine, denoted by [1] and [2]: [1]: client:/home/staff/test3/x/y# ls -al total 0 drwxr-xr-x 2 test3 staff 30 Jun 9 15:12 . drwx------ 3 root root 14 Jun 9 15:16 .. -rw-r--r-- 1 test3 staff 0 Jun 9 15:12 file1 -rw-r--r-- 1 test3 staff 0 Jun 9 15:12 file2 [2]: client:/home/staff/test3/x/y# ls -la total 0 drwxr-xr-x 2 test3 staff 30 Jun 9 15:12 . drwx------ 3 root root 14 Jun 9 15:16 .. -rw-r--r-- 1 test3 staff 0 Jun 9 15:12 file1 -rw-r--r-- 1 test3 staff 0 Jun 9 15:12 file2 client:/home/staff/test3/x/y# su test3 (notice no -!!) test3@client:~/x/y$ ls ls: .: Stale NFS file handle Server gets "no_root_squashed access at x/y." This is ok, because of the check in fh_verify. User test3 cannot access the file as parent "x" is owned by root and not executable by test3. Now back to console [1]. Guess what: [1]: client:/home/staff/test3/x/y# ls -l ls: .: Stale NFS file handle This is strange. Why the failed permission check on console [2] should have affected the current directory handle in console [1]? I guess there might be something relating with caching. Can someone confirm this? Is this just misconfiguration, should it be this way or is it a bug? Girts I think this issue is related to (could be solved by?) http://bugs.debian.org/255931 Cheers, Paul Szabo - psz.edu.au http://www.maths.usyd.edu.au:8000/u/psz/ School of Mathematics and Statistics University of Sydney 2006 Australia It turns out the rhel and upstream kernels have half of the debian patch. The part the don't have is: - error = nfserr_stale; + error = nfserr_acces; +/* PSz 23 Jun 04 Not STALE but ACCES: so NFS client code (RPC really) + * net/sunrpc/clnt.c will handle and re-try as real user, + * do not want fs/nfs/inode.c to remove the inode. */ +/* Should not say root_squashed without checking ROOTSQUASH or ALLSQUASH + * and UID/GID. (Probably should be dprintk: lucky it was not.) */ From what the comment is claiming, access errors will be retried by rpc which may not be true in every case at least w.r.t a Linux client. But, since this code is during an security check (i.e. subtree checking) returning eacces may not be a bad idea, since eacces will always be more recoverable than estales. But, I don't see how this patch will help when no_subtree checking is on, since this code is not executed. Unfortunately, I'm still unable to reproduce this even with an HPUX client. So is there a test case that will reliably reproduce this error? If so, would you please post it? I've designed a test scenario for that problem. Please take a look in the Issue Trigger : 41569 Kind regards, Martin Pelikan Created attachment 101891 [details]
Changes an error code from nfserr_stale to nfserr_acces
Please try this untested patch to see if it helps....
Test kernels for the patch above available at: http://people.redhat.com/bnocera/kernel-nfs-fixes/ Those packages are not supported, but we'd be glad to hear if it fixes the issue at hand. Contains the patches included in bug #121475 and bug #110421 Adding this because we are getting more requests about this problem from the automotive sector. For other people running into this problem: There currently - at least in connection with the test kernel - are two possible workarounds: - set the NFS export options to: (rw,sync,no_root_squash,insecure,insecure_locks,no_subtree_check) - or do a chmod a+rx on all directories in the hierarchy above the catia files. Created attachment 102950 [details]
Possible Fix for this problem
It has been reported that the attached patch fixes this problem,
but there has been no hard confirmation that is true. So could
the people who can easily reproduce this problem, please test
this patch.
Created attachment 104626 [details]
RHEL3 version of the previous
This patch basically changes the return status from the subtree
check from estale to eacces. I 've been told that HP clients
handle this error in a more reasonable way.....
Steve, as I mentioned in our offline conversations, I will apply this patch to out servers over the weekend. I should have some definitive results for you on Monday. Cheers. To hopefully move this bug along, I've created a UP and SMP i686 test kernel rpms in http://people.redhat.com/steved/.bz110421/ Please give one of these test kernel a try to see if it takes care of the problem. This problem has been affecting our department's main file server hosting UNIX home directories for a while. About a month ago our server went down a few times in a row so I installed the test kernel. The "fh_verify.h" error messages stopped and it worked for about a month until this past weekend: --- [root@panda root]# uname -a Linux panda.cs.cornell.edu 2.4.21-20.EL.bz110421smp #1 SMP Tue Oct 5 11:09:30 EDT 2004 i686 i686 i386 GNU/Linux [root@panda root]# uptime 09:50:38 up 30 days, 22:33, 1 user, load average: 0.38, 0.56, 0.46 [root@panda root]# head /var/log/messages Nov 7 04:02:05 panda syslogd 1.4.1: restart. Nov 7 04:02:08 panda kernel: fh_verify: no root_squashed access at working/svm_loss_learn. Nov 7 04:02:19 panda kernel: fh_verify: no root_squashed access at cesar/rsolver. Nov 7 04:02:23 panda last message repeated 2 times Nov 7 04:03:03 panda kernel: rpc-srv/tcp: nfsd: sent only -107 bytes of 268 - shutting down socket Nov 7 04:03:05 panda kernel: rpc-srv/tcp: nfsd: sent only -107 bytes of 236 - shutting down socket Nov 7 04:03:06 panda kernel: fh_verify: no root_squashed access at bin/vim. Nov 7 04:03:06 panda last message repeated 4 times Nov 7 04:03:19 panda kernel: rpc-srv/tcp: nfsd: sent only -107 bytes of 132 - shutting down socket Nov 7 04:03:29 panda last message repeated 3 times --- Guts, I am also seeing this. We have a few RHEL3 WS boxes mounting an EL3 AS nfs server and they are not producing the errors, but there are errors from a gentoo box mouting the AS box. Has anything moved forward with resolving this bug? Paul Could you try the posted patch or kenels to see if the goes away? Hi Steve, I'd rather not do it on the machine which is displaying the errors as it is mission critical, and the errors are in no way effecting the main roile of the box as a samba server. In a week or so should have a new machine on line which I may be able to test on. Paul I seem to have fixed it by adjusting permissions on my NFS shares. here's my /etc/exports: /home/grada (rw,insecure,async,no_root_squash) /home/gradb (rw,insecure,async,no_root_squash) And here's the original directory listings for the shares: drwxr-s--- 64 root stud05 4096 Jan 8 23:39 grada drwxr-s--- 79 root stud06 4096 Jan 9 02:32 gradb Changing the permissions to: drwxr-sr-x 64 root stud05 4096 Jan 8 23:39 grada drwxr-sr-x 79 root stud06 4096 Jan 9 02:32 gradb Made my "kernel: fh_verify: no root_squashed access at X" errors go away. Simple Fix For Me Anyways: Original Configuration That Caused fh_verify errors: Server 1: /etc/exports contained /local/home 192.168.2.0/24 (rw,no_root_squash,sync) Client 1: /etc/fstab contained 192.168.2.2:/home /home nfs defaults 0 0 Fixed The Client To Fix The Error: Server 1: This server stayed the same Client 1: Changed /etc/fstab to properly mount with: 192.168.2.2:/local/home /home nfs default 0 0 No more errors detected. I'm going to close the bug since it appears to be a configuration problem.... Please feel free to reopen it this is not the case Um... If you mean the use of the keyword ``defaults'' rather than ``default'' (in Comment 62), then the docs are wrong and need to be fixed. mount(8) says defaults Use default options: rw, suid, dev, exec, auto, nouser, and async. Comment 62 also shows a change in the server filesystem name (in the client's fstab); I see this issue with configurations that have (always) had the correct server-side filesystem name in the client's fstab. Comment 32 suggests specifying some options in /etc/exports on the server that ``have mild security implications'' (according to exports(5)), and therefore aren't particularly attractive. Comment 58 suggests opening up the permissions on enclosing directories (allowing all r-w access), which, again, isn't exactly an attractive option when you're talking about people's home directories. If you're talking about something else, then please make it clear what that configuration error is. Steve, Please clarify what "configuration error" you are referring to. Opening up public access to directories is not a solution in the environment I support. The automounter is being used to mount the file systems for which this same error is generated. Therefore, potential changes in /etc/fstab are no option either. Thank you. |