+++ This bug was initially created as a clone of Bug #211293 +++ Description of problem: NFS imported filesystems from Tru64 do not recognize directories as such, but return 'Not a directory'. From Solaris systems, there seems to be no prob. Version-Release number of selected component (if applicable): 2.6.18-1.2200.fc5 and 2.6.18-1.2200.fc5smp How reproducible: Always Steps to Reproduce: Actual results: 1. mount tru64machine:/whatever /mnt 2. ls /mnt/subdir ls: /mnt/subdir: Not a directory 3. cd /mnt/subdir cd: /mnt/subdir: Not a directory Expected results: Behave like 2.6.17-1.2187_FC5 (or any previous kernel), and have NFS working to Tru64 Additional info: Trying to mount with 'nfsvers=2' results in other errors (Input/output error). Also, this looks very unhealthy: > mount -o nfsvers=2 tru64machine:/whatever /mnt (seems to work; now without unmounting!) > mount -o nfsvers=3 tru64machine:/whatever /mnt > df Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda2 102861656 69463672 28172844 72% / none 517476 0 517476 0% /dev/shm tmpfs 517476 0 517476 0% /tmp tmpfs 517476 48 517428 1% /var/tmp tru64:/whatever 177743168 86595008 90918656 49% /mnt tru64:/whatever 177743168 86595008 90918656 49% /mnt Twice mounted, without error! (And you can go far beyond 2). Previous versions loudly complained for double mount attempts. -- Additional comment from deknuydt.be on 2006-10-19 06:56 EST -- I did some experiments with 2.6.18-1.2798.fc6. *) Bug is still there. *) Mounting with '-o nfsvers=2' seems to solve it here. *) Multiple mounting is still possible, as long as you change version numbers; so you can repeat these lines indefinitely: mount -o nfsvers=3 tru64machine:/whatever /mnt mount -o nfsvers=2 tru64machine:/whatever /mnt You'll need the same number of unmounts to get rid of this construct ... -- Additional comment from deknuydt.be on 2006-11-07 06:38 EST -- I tried 2.6.18-1.2224.fc5 (from testing). Not fixed. Forcefully mounting with '-o nfsvers=2' seems alleviate the problem; but for a few directories, you get a 'Input/output error' instead. All in all: unusable. Forcing NFS over TCP or UDP makes no difference. -- Additional comment from steved on 2006-11-07 08:27 EST -- I wonder if https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212471 is the same problem... -- Additional comment from deknuydt.be on 2006-11-12 10:16 EST -- As expected, 2.6.18-1.2239.fc5 does not make any difference. As of resemblance to 212471, I'm not sure, because the 'mounting' of the fs works without problem, both manual and with autofs on Tru64 machines with a single and with multiple active network interfaces. Might be related anyhow though... Strangely enough, on a single FC6 test machine, I am able to mount a Tru64 fs without problems; but _with_ the same 'Not a directory' problems as in FC5 (that is with 2.6.18-1.2798.fc6). -- Additional comment from alfredo.ferrari on 2006-11-21 07:42 EST -- I see exactly the same problem with a x86_64 dual CPU system running fedora 5 fully updated (kernel 2.6.18-1.2239.fc5). The issue appears with 6 different nfs mounts coming from two different Tru64 servers both with autofs and "normal" nfs. I spent days before realizing today it is just the "new" kernel. Obviously it is a serious issue in mixed system centers like our one. I confirm that forcing nfsvers=2 helps but still leaves issues (and big worries of filesystem corruption, any hint about such a possibility?) If I can help in any way in pinpointing the problem let me know. Alfredo Ferrari -- Additional comment from matthew.amos on 2006-11-22 06:19 EST -- We have exactly the same problem: Tru64 NFS server and dual Opteron workstation running FC5, the latest kernel 2.6.18-1.2239 leaves NFS mounts almost totally unusable, as it reports that almost all directories under the mount point are not directories. However, "ls -l" still shows them as directories. Mounts from other Linux, Solaris and IRIX servers are fine, however. With some digging through tcpdump traces the only difference I can see between the Tru64 server and others is that the Tru64 server uses a very large (or negative) fsid and a very small fileid. For example, Tru64 is returning fileids less than about 32, whereas the other systems are using fileids in the thousands. If you'd like me to run some tests, or take some more packet dumps, just let me know. -- Additional comment from steved on 2006-12-04 14:49 EST -- Yes... could you please post some bzip2 binary tethreal netowork traces of this problem... tia... -- Additional comment from steved on 2006-12-04 14:53 EST -- An example of what I'm looking for is: tethereal -w /tmp/bz211293.pcap host <server> ; bzip2 /tmp/bz211293.pcap -- Additional comment from alfredo.ferrari on 2006-12-04 15:08 EST -- Created an attachment (id=142769) requested tethereal output -- Additional comment from alfredo.ferrari on 2006-12-04 15:11 EST -- Voila' 1 test. The server is alf3.cern.ch (192.91.242.170), the client pceet030.cern.ch (192.91.242.30), the mount is via autofs (it doesn't really matter) mounting the remote (alf3) /user1 partition on misc/alf3_user1. I made a few ls /misc/alf3_user1/home and ls /misc/alf3_user1/home/xxx where xxx are various directories (the mount has with root privileges and I am root while issuing those commands). These ls mostly failed (not all and not always for the same directory, ie ls /misc/alf3_user1/home/alfredo succeeded twice and failed many more with [root@pceet030 log]# ls /misc/alf3_user1/home/alfredo/ ls: /misc/alf3_user1/home/alfredo/: Not a directory -- Additional comment from matthew.amos on 2006-12-05 06:24 EST -- Created an attachment (id=142823) Packet capture while experiencing "not a directory" bug. The packets were captured while a logged-in user attempted to cd to several sub-directories of his home directory, receiving the "not a directory" error each time. -- Additional comment from steved on 2006-12-05 18:36 EST -- Unfortunately, neither one of the trace showed anything out of the ordinary... In the bz041206.pcap trace, I see lookup of 'alfredo' and the file type that is being returned is a directory... In both traces, I also see a number of READLINKS meaning the symlinks some how involved.... which should not matter... Over all both traces look like normal NFS traffic... The guess only oddity is the lack of non-zero NFS status. I was hoping to see the server return some type of error which might give clue as to what is happening... Would it be possible to post bzip2 strace of the ls or stat command failing? something like 'strace -o /tmp/strace.txt ls /mnt/foo'. What I'm looking for is to see which system call is failing which will (hopefully) give me starting point... Also, what is the status of SElinux? On, Off? If its on, please try using the 'setenforce 0' command to turn it off to see if that makes a different. And it is true, that moving one or two of the mount out of autofs land and into /etc/fstab makes no difference? -- Additional comment from deknuydt.be on 2006-12-06 05:36 EST -- Created an attachment (id=142939) Strace, NFSV3, 'Not a directory' -- Additional comment from deknuydt.be on 2006-12-06 05:38 EST -- Created an attachment (id=142940) Strace, NFSV2, 'Input/output error' -- Additional comment from deknuydt.be on 2006-12-06 05:51 EST -- This is with 2.6.18-1.2849.fc6; mount is manual, not via autofs (but this makes no difference). > getenforce Disabled > # So SELinux is disabled completely > mount ibrahim:/raid3/users/deknuydt /mnt > strace -o /tmp/strace.txt ls /mnt/tex ls: /mnt/tex: Not a directory > # This is the first attachment > umount /mnt > mount -o vers=2 ibrahim:/raid3/users/deknuydt /mnt > strace -o /tmp/strace1.txt /mnt/mail ls: reading directory /mnt/mail: Input/output error > # This is the second attachment -- Additional comment from steved on 2006-12-06 14:26 EST -- For the recored.... here are the system calls that are failing: open("/mnt/tex", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOTDIR (Not a directory) and getdents(3, 0x61b9c8, 4096) = -1 EIO (Input/output error) -- Additional comment from steved on 2006-12-06 14:40 EST -- When the 'ls /mnt/tex' returns ENOTDIR, could you post the output of 'stat /mnt/tex'. -- Additional comment from steved on 2006-12-06 15:22 EST -- Also, are all of these clients 32bit architectures? Is anybody seeing this problem with a 64bit client? -- Additional comment from alfredo.ferrari on 2006-12-06 16:42 EST -- My one is a 64 bit client -- Additional comment from deknuydt.be on 2006-12-07 07:10 EST -- > mount ibrahim:/raid3/users/deknuydt /mnt > ls /mnt/text ls: /mnt/text: Not a directory > stat /mnt/text File: `/mnt/text' Size: 16384 Blocks: 32 IO Block: 4096 directory Device: 12h/18d Inode: 985772 Links: 7 Access: (0755/drwxr-xr-x) Uid: ( 4653/deknuydt) Gid: ( 46/ visics) Access: 2006-12-06 11:13:35.000999000 +0100 Modify: 2006-11-30 13:33:30.493267000 +0100 Change: 2006-11-30 13:33:30.493267000 +0100 > mount -o vers=2 ibrahim:/raid3/users/deknuydt /mnt > ls /mnt/mail ls: reading directory /mnt/mail: Input/output error > stat /mnt/mail File: `/mnt/mail' Size: 8192 Blocks: 16 IO Block: 4096 directory Device: 12h/18d Inode: 981630 Links: 3 Access: (0700/drwx------) Uid: ( 4653/deknuydt) Gid: ( 46/ visics) Access: 2006-12-06 01:15:18.000999000 +0100 Modify: 2006-07-20 17:17:55.006699000 +0200 Change: 2006-07-20 17:17:55.006699000 +0200 Strange ... If I chmod o+rx this mail directory, then suddenly the IO error disappears. So a 'Permission Denied' turns into an 'IO error'. For NFSV2 at least. For the record: these problems occur both with the i686 and x86_64 kernels, both onder FC5 and FC6; in short: all kernels more recent than 2.6.17-1.2187_FC5 have this. -- Additional comment from matthew.amos on 2006-12-07 07:12 EST -- Ours is 64-bit also. Here is the requested output from ls and stat when experiencing the bug: > ls CFD ls: CFD: Not a directory > stat CFD File: `CFD' Size: 8192 Blocks: 16 IO Block: 4096 directory Device: 1ah/26d Inode: 724277 Links: 20 Access: (0755/drwxr-xr-x) Uid: ( 5154/ amos) Gid: ( 110/ mamod) Access: 2006-12-04 03:39:19.387121000 +0000 Modify: 2006-11-03 10:32:02.870369000 +0000 Change: 2006-11-03 10:32:02.870369000 +0000 In case it is significant, here is an excerpt from /proc/cpuinfo. vendor_id : AuthenticAMD cpu family : 15 model : 33 model name : Dual Core AMD Opteron(tm) Processor 270 stepping : 2 -- Additional comment from steved on 2006-12-07 11:03 EST -- Cool... thanks for all the info... What version of Tru64 is everyone using... I'm trying to dig one up so I can reproduce this locally... -- Additional comment from deknuydt.be on 2006-12-07 11:09 EST -- >ssh ibrahim uname -a OSF1 ibrahim V5.1 2650 alpha alpha Tru64 >ssh ampere uname -a OSF1 ampere V5.1 732 alpha alpha Tru64 Reasonably patched :) -- Additional comment from matthew.amos on 2006-12-07 13:46 EST -- > uname -a OSF1 alphasc0 V5.1 2650 alpha -- Additional comment from alfredo.ferrari on 2006-12-07 17:53 EST -- In my case alf3> uname -a OSF1 alf3.cern.ch V4.0 1229 alpha srveet01> uname -a OSF1 srveet01.cern.ch V4.0 1229 alpha -- Additional comment from r.garth.au on 2006-12-10 20:11 EST -- *** Bug 218393 has been marked as a duplicate of this bug. *** -- Additional comment from steved on 2006-12-14 15:50 EST -- Well I was finally able to track down a Tru64 box running 'OSF1 <hostname> T5.1 2359 alpha' and (unfortunately) I was *able* to mount the machine w/out any problem... (with FC5, FC5, and RHEL5 B2 clients)..... But I did notice with all the Tru64 OS versions that were posted started with a 'V' (ie. V4.0 or V5.0, etc) and version on my box starts with a 'T'.... Does that mean anything to anybody? -- Additional comment from dchapman on 2006-12-14 15:58 EST -- The T5.1 means it was a pre-released V5.1 however I think this box is running a late T5.1 so I would not expect there to be a significant difference. -- Additional comment from deknuydt.be on 2006-12-19 06:14 EST -- 2.6.18-1.2868.fc6 is out, and still has this problem too. This is taking too long... Should we relabel this to FC6 now? As for not being able to reproduce this (see comment #27), did you export a UFS or a AdvFS file system on the Tru64? -- Additional comment from alfredo.ferrari on 2006-12-19 06:28 EST -- This is really taking too long... btw in my case they are AdvFS file systems which generate the problem (and I ahve no other files system type to test with on those machines). The problem is clearly common to FC5 and FC6, I would say to all 2.6.18 kernels released for both, at least in my experience. Again, if I can help let me know -- Additional comment from matthew.amos on 2006-12-19 09:18 EST -- All of the exports on our Tru64 server are AdvFS also. Since this server is not under my control it is not possible for me to test UFS. I second Alfredo's comments; this seems to be a kernel bug that was introduced between 2.6.16 and 2.6.18. I am happy to test kernel patches, as the affected machine is not a production server. -- Additional comment from steved on 2006-12-20 21:41 EST -- I can't agree with you more.... wrt to taking too long... believe me if the Tru64 system I just acquired showed the problem... I would be all over this... since is almost guaranteed that upcoming RHEL5 release will have the same problem... My next step will be to start going through the diffs between 2.6.16 and 2.6.18 and start throwing out some test kernels...
I have this problem as well. We have two Tru64 5.1 machines and two Tru64 4.0 machines. The problem only occurs for us w/ the 5.1 machines. I capture and examined some packet traces and I don't see anything wrong w/ the transactions themselves. The kernel just seems to be doing the wrong thing w/ the results. Linux quadra.franz.com 2.6.18-1.2869.fc6 #1 SMP Wed Dec 20 14:51:34 EST 2006 x86_64 x86_64 x86_64 GNU/Linux Sample bad behavior: [root@quadra /]# mount epsilon:/acl /mnt [root@quadra /]# stat /mnt/dancy File: `/mnt/dancy' Size: 8192 Blocks: 16 IO Block: 4096 directory Device: 3fh/63d Inode: 156057 Links: 3 Access: (0755/drwxr-xr-x) Uid: ( 443/ dancy) Gid: ( 50/ ftp) Access: 2007-01-24 15:02:25.592081000 -0800 Modify: 2006-07-18 13:15:34.169051000 -0700 Change: 2006-07-18 13:15:34.169051000 -0700 [root@quadra /]# ls /mnt/dancy total 24 8 ./ 8 ../ 8 acl-64/ [root@quadra /]# stat /mnt/dancy/acl-64 File: `/mnt/dancy/acl-64' Size: 8192 Blocks: 16 IO Block: 4096 directory Device: 3fh/63d Inode: 95568 Links: 32 Access: (0755/drwxr-xr-x) Uid: ( 443/ dancy) Gid: ( 50/ ftp) Access: 2007-01-24 01:12:59.248123000 -0800 Modify: 2006-05-18 09:03:52.744025000 -0700 Change: 2006-05-18 09:03:52.744025000 -0700 [root@quadra /]# ls /mnt/dancy/acl-64 ls: /mnt/dancy/acl-64: Not a directory [root@quadra /]# umount /mnt Clearly bogus. Packet traces available upon request.
2.6.19-1.2911.fc6 does not fix the problem. I guess by now RedHat must have given up on this one?
No! I was hoping that HP was going to bring Tru64 to this years Connectathon... Unfortunately that was not the case... so I asked the HP people that did attend to send me the name of the support person that could help with this... I also had a conversion with the upstream NFS maintainer (who also attended Connectathon) and we both are a bit stump on this one... With the 2.6.20 kernel, there has been some readdir that might address this problem... I'm looking into that now... You'll note that the FS-Cache patches are longer the latest FC6 kernel... so it takes them out of play...
Created attachment 149293 [details] Purposed Patch It turns out that this is not a Linux client problem at all! Although its true Linux server do fail against Tru64, but they fail because Tru64 servers are sending different fsids for the same filesystem. Now it appears only the Linux client actually looks at the values of fsids being returned and that started with a patch back in the Linux 2.6.12 time frame. So I have a feeling the Tru64 servers have always been broken... For the gory details see the usptream posting (I would post the link but as usual sourceforge.net down) So the attached patch does seem to fix the problem, but its not clear how accepting upstream will be of this patch...
Thanks Steve. I'm keeping my fingers crossed and sacrificing a chicken.
Here is the upstream discussion... http://sourceforge.net/mailarchive/forum.php?thread_id=31756540&forum_id=4930 It appears this patch will not be accepted. But I would suggest you use this patch until I can figure something else out... It is true, that there is simply no way HP will fix this problem, correct? HP is no longer sending out updates right?
As far as I know no chance to get an update from HP. It is even hard to get hardware reapirs (the last one was nonsensically expensive). BTW thanks for all your efforts
Just curious... is there a way to turn off READDIRPLUS support on Tru64 servers?
With all due respect to Tru64 and HP, perhaps it is time to consider a replacement? :-)
Tru64 does still appear to be maintained to some degree: http://h30097.www3.hp.com/pdf/Tru64_Roadmap_Current.pdf
(In reply to comment #8) > Just curious... is there a way to turn off READDIRPLUS > support on Tru64 servers? Yes! The following worked for me. There is probably some better way to do this so that it remains permanent across boots, but I haven't figured it out yet: ladebug -k assign doreaddirplus=0 quit
If anyone is interested, I have hacked up a new nfs_server.mod kernel module for Tru64 which fixes the bug. I have it deployed on a local system and it works fine. I built it against a host that identifies itself as Tru64 V5.1B (Rev: 2650). It can probably be adjusted for other versions as well.
*** Bug 217705 has been marked as a duplicate of this bug. ***
Regarding Comment #11 I have some further advice: You may not have "ladebug" installed on your Tru64 server, but probably you have /usr/bin/dbx. You may then modify the "doreaddirplus" variable until the next reboot by: # dbx -k /vmunix (dbx) assign doreaddirplus=0 (dbx) quit If you want the change to be persistent across reboots do: # dbx -k /vmunix (dbx) patch doreaddirplus=0 (dbx) quit If you upgrade the kernel this will have to be repeated. We have verified that this workaround solves the NFS problem with a Tru64 NFS server and a RHEL5 Linux NFS client.
Fixed in nfs-utils-1.0.10-10.fc6 by added the -o nordirplus mount option which will have the kernel support in the next kernel update.