Bug 220451 - NFS to Tru64 badly broken
NFS to Tru64 badly broken
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
6
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Steve Dickson
Brian Brock
:
: 217705 (view as bug list)
Depends On: 211293
Blocks:
  Show dependency treegraph
 
Reported: 2006-12-21 08:11 EST by Steve Dickson
Modified: 2007-11-30 17:11 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-08-29 19:46:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Purposed Patch (1.41 KB, text/x-patch)
2007-03-05 16:05 EST, Steve Dickson
no flags Details

  None (edit)
Description Steve Dickson 2006-12-21 08:11:33 EST
+++ This bug was initially created as a clone of Bug #211293 +++

Description of problem:

NFS imported filesystems from Tru64 do not recognize directories as such,
but return 'Not a directory'.

From Solaris systems, there seems to be no prob.

Version-Release number of selected component (if applicable):

2.6.18-1.2200.fc5 and 2.6.18-1.2200.fc5smp

How reproducible:

Always


Steps to Reproduce:
Actual results:

1. mount tru64machine:/whatever /mnt
2. ls /mnt/subdir
ls: /mnt/subdir: Not a directory
3. cd /mnt/subdir
cd: /mnt/subdir: Not a directory

Expected results:

Behave like 2.6.17-1.2187_FC5 (or any previous kernel), and have NFS
working to Tru64

Additional info:

Trying to mount with 'nfsvers=2' results in other errors (Input/output error).
Also, this looks very unhealthy:

> mount -o nfsvers=2 tru64machine:/whatever /mnt
(seems to work; now without unmounting!)
> mount -o nfsvers=3 tru64machine:/whatever /mnt
> df 
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2            102861656  69463672  28172844  72% /
none                    517476         0    517476   0% /dev/shm
tmpfs                   517476         0    517476   0% /tmp
tmpfs                   517476        48    517428   1% /var/tmp
tru64:/whatever
                     177743168  86595008  90918656  49% /mnt
tru64:/whatever
                     177743168  86595008  90918656  49% /mnt

Twice mounted, without error! (And you can go far beyond 2).

Previous versions loudly complained for double mount attempts.

-- Additional comment from deknuydt@esat.kuleuven.be on 2006-10-19 06:56 EST --
I did some experiments with 2.6.18-1.2798.fc6.

*) Bug is still there.
*) Mounting with '-o nfsvers=2' seems to solve it here.
*) Multiple mounting is still possible, as long as you
   change version numbers; so you can repeat these lines
   indefinitely:
 
    mount -o nfsvers=3 tru64machine:/whatever /mnt
    mount -o nfsvers=2 tru64machine:/whatever /mnt

   You'll need the same number of unmounts to get rid of this
   construct ...

-- Additional comment from deknuydt@esat.kuleuven.be on 2006-11-07 06:38 EST --
I tried 2.6.18-1.2224.fc5 (from testing).  Not fixed.

Forcefully mounting with '-o nfsvers=2' seems alleviate the problem; but
for a few directories, you get a 'Input/output error' instead.  All in all:
unusable.  Forcing NFS over TCP or UDP makes no difference.



-- Additional comment from steved@redhat.com on 2006-11-07 08:27 EST --
I wonder if https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212471
is the same problem...

-- Additional comment from deknuydt@esat.kuleuven.be on 2006-11-12 10:16 EST --
As expected, 2.6.18-1.2239.fc5 does not make any difference.

As of resemblance to 212471, I'm not sure, because the 'mounting' of the fs works 
without problem, both manual and with autofs on Tru64 machines with a single and
with multiple active network interfaces. Might be related anyhow though...

Strangely enough, on a single FC6 test machine, I am able to mount a Tru64 fs
without problems; but _with_ the same 'Not a directory' problems as in FC5 (that
is with 2.6.18-1.2798.fc6). 

-- Additional comment from alfredo.ferrari@cern.ch on 2006-11-21 07:42 EST --
I see exactly the same problem with a x86_64 dual CPU system running fedora 5
fully updated (kernel 2.6.18-1.2239.fc5). The issue appears with 6 different nfs
mounts coming from two different Tru64 servers both with autofs and "normal"
nfs. I spent days before realizing today it is just the "new" kernel. Obviously
it is a serious issue in mixed system centers like our one. I confirm that forcing
nfsvers=2 helps but still leaves issues (and big worries of filesystem
corruption, any hint about such a possibility?)
If I can help in any way in pinpointing the problem let me know.

Alfredo Ferrari

-- Additional comment from matthew.amos@baesystems.com on 2006-11-22 06:19 EST --
We have exactly the same problem: Tru64 NFS server and dual Opteron workstation
running FC5, the latest kernel 2.6.18-1.2239 leaves NFS mounts almost totally
unusable, as it reports that almost all directories under the mount point are
not directories. However, "ls -l" still shows them as directories. Mounts from
other Linux, Solaris and IRIX servers are fine, however.

With some digging through tcpdump traces the only difference I can see between
the Tru64 server and others is that the Tru64 server uses a very large (or
negative) fsid and a very small fileid. For example, Tru64 is returning fileids
less than about 32, whereas the other systems are using fileids in the thousands.

If you'd like me to run some tests, or take some more packet dumps, just let me
know.


-- Additional comment from steved@redhat.com on 2006-12-04 14:49 EST --
Yes... could you please post some bzip2 binary tethreal netowork traces
of this problem... tia...



-- Additional comment from steved@redhat.com on 2006-12-04 14:53 EST --
An example of what I'm looking for is:

   tethereal -w /tmp/bz211293.pcap host <server> ; bzip2 /tmp/bz211293.pcap

-- Additional comment from alfredo.ferrari@cern.ch on 2006-12-04 15:08 EST --
Created an attachment (id=142769)
requested tethereal output


-- Additional comment from alfredo.ferrari@cern.ch on 2006-12-04 15:11 EST --
Voila' 1 test. The server is alf3.cern.ch (192.91.242.170), the client
pceet030.cern.ch (192.91.242.30), the mount is via autofs (it doesn't really
matter) mounting the remote (alf3) /user1 partition on misc/alf3_user1.
I made a few
  ls /misc/alf3_user1/home
and
  ls /misc/alf3_user1/home/xxx
where xxx are various directories (the mount has with root privileges and I am
root while issuing those commands). These ls mostly failed (not all and not
always for the same directory, ie ls /misc/alf3_user1/home/alfredo succeeded
twice and failed many more with 

[root@pceet030 log]# ls /misc/alf3_user1/home/alfredo/
ls: /misc/alf3_user1/home/alfredo/: Not a directory

-- Additional comment from matthew.amos@baesystems.com on 2006-12-05 06:24 EST --
Created an attachment (id=142823)
Packet capture while experiencing "not a directory" bug.

The packets were captured while a logged-in user attempted to cd to several
sub-directories of his home directory, receiving the "not a directory" error
each time.

-- Additional comment from steved@redhat.com on 2006-12-05 18:36 EST --
Unfortunately, neither one of the trace showed anything out of the
ordinary... In the bz041206.pcap trace, I see lookup of 'alfredo'
and the file type that is being returned is a directory...
In both traces, I also see a number of READLINKS meaning
the symlinks some how involved.... which should not matter...
Over all both traces look like normal NFS traffic...

The guess only oddity is the lack of non-zero NFS status.
I was hoping to see the server return some type of error
which might give clue as to what is happening...

Would it be possible to post bzip2 strace of the
ls or stat command failing? something like
'strace -o /tmp/strace.txt ls /mnt/foo'. What I'm looking
for is to see which system call is failing which will
(hopefully) give me starting point...

Also, what is the status of SElinux? On, Off? If its
on, please try using the 'setenforce 0' command to
turn it off to see if that makes a different.

And it is true, that moving  one or two of the mount out of
autofs land and into /etc/fstab makes no difference?


-- Additional comment from deknuydt@esat.kuleuven.be on 2006-12-06 05:36 EST --
Created an attachment (id=142939)
Strace, NFSV3, 'Not a directory'


-- Additional comment from deknuydt@esat.kuleuven.be on 2006-12-06 05:38 EST --
Created an attachment (id=142940)
Strace, NFSV2, 'Input/output error'


-- Additional comment from deknuydt@esat.kuleuven.be on 2006-12-06 05:51 EST --
This is with 2.6.18-1.2849.fc6; mount is manual, not via autofs (but this makes
no difference).

> getenforce
Disabled
> # So SELinux is disabled completely
> mount ibrahim:/raid3/users/deknuydt /mnt
> strace -o /tmp/strace.txt ls /mnt/tex
ls: /mnt/tex: Not a directory
> # This is the first attachment


> umount /mnt
> mount -o vers=2 ibrahim:/raid3/users/deknuydt /mnt
> strace -o /tmp/strace1.txt /mnt/mail
ls: reading directory /mnt/mail: Input/output error
> # This is the second attachment

-- Additional comment from steved@redhat.com on 2006-12-06 14:26 EST --
For the recored.... here are the system calls that are failing:

open("/mnt/tex", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOTDIR (Not a directory)
and
getdents(3, 0x61b9c8, 4096)             = -1 EIO (Input/output error)


-- Additional comment from steved@redhat.com on 2006-12-06 14:40 EST --
When the 'ls /mnt/tex' returns ENOTDIR, could you post the 
output of 'stat /mnt/tex'.



-- Additional comment from steved@redhat.com on 2006-12-06 15:22 EST --
Also, are all of these clients 32bit architectures? Is anybody seeing this
problem with a 64bit client?

-- Additional comment from alfredo.ferrari@cern.ch on 2006-12-06 16:42 EST --
My one is a 64 bit client

-- Additional comment from deknuydt@esat.kuleuven.be on 2006-12-07 07:10 EST --
> mount ibrahim:/raid3/users/deknuydt /mnt
> ls /mnt/text
ls: /mnt/text: Not a directory
> stat /mnt/text
  File: `/mnt/text'
  Size: 16384     	Blocks: 32         IO Block: 4096   directory
Device: 12h/18d	Inode: 985772      Links: 7
Access: (0755/drwxr-xr-x)  Uid: ( 4653/deknuydt)   Gid: (   46/  visics)
Access: 2006-12-06 11:13:35.000999000 +0100
Modify: 2006-11-30 13:33:30.493267000 +0100
Change: 2006-11-30 13:33:30.493267000 +0100

> mount -o vers=2 ibrahim:/raid3/users/deknuydt /mnt
> ls /mnt/mail
ls: reading directory /mnt/mail: Input/output error
> stat /mnt/mail
  File: `/mnt/mail'
  Size: 8192      	Blocks: 16         IO Block: 4096   directory
Device: 12h/18d	Inode: 981630      Links: 3
Access: (0700/drwx------)  Uid: ( 4653/deknuydt)   Gid: (   46/  visics)
Access: 2006-12-06 01:15:18.000999000 +0100
Modify: 2006-07-20 17:17:55.006699000 +0200
Change: 2006-07-20 17:17:55.006699000 +0200

Strange ... If I chmod o+rx this mail directory, then suddenly the IO error
disappears.  So a 'Permission Denied' turns into an 'IO error'.  For NFSV2 at least.

For the record: these problems occur both with the i686 and x86_64 kernels, both
onder FC5 and FC6; in short: all kernels more recent than 2.6.17-1.2187_FC5 have
this.

-- Additional comment from matthew.amos@baesystems.com on 2006-12-07 07:12 EST --
Ours is 64-bit also. Here is the requested output from ls and stat when
experiencing the bug:

> ls CFD
ls: CFD: Not a directory
> stat CFD
  File: `CFD'
  Size: 8192            Blocks: 16         IO Block: 4096   directory
Device: 1ah/26d Inode: 724277      Links: 20
Access: (0755/drwxr-xr-x)  Uid: ( 5154/    amos)   Gid: (  110/   mamod)
Access: 2006-12-04 03:39:19.387121000 +0000
Modify: 2006-11-03 10:32:02.870369000 +0000
Change: 2006-11-03 10:32:02.870369000 +0000

In case it is significant, here is an excerpt from /proc/cpuinfo.

vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 270
stepping        : 2


-- Additional comment from steved@redhat.com on 2006-12-07 11:03 EST --
Cool... thanks for all the info...

What version of Tru64 is everyone using... I'm trying to dig one up 
so I can reproduce this locally... 

-- Additional comment from deknuydt@esat.kuleuven.be on 2006-12-07 11:09 EST --
>ssh ibrahim uname -a
OSF1 ibrahim V5.1 2650 alpha alpha Tru64

>ssh ampere uname -a
OSF1 ampere V5.1 732 alpha alpha Tru64

Reasonably patched :)

-- Additional comment from matthew.amos@baesystems.com on 2006-12-07 13:46 EST --
> uname -a
OSF1 alphasc0 V5.1 2650 alpha


-- Additional comment from alfredo.ferrari@cern.ch on 2006-12-07 17:53 EST --
In my case
alf3> uname -a
OSF1 alf3.cern.ch V4.0 1229 alpha

srveet01> uname -a
OSF1 srveet01.cern.ch V4.0 1229 alpha

-- Additional comment from r.garth@uws.edu.au on 2006-12-10 20:11 EST --
*** Bug 218393 has been marked as a duplicate of this bug. ***

-- Additional comment from steved@redhat.com on 2006-12-14 15:50 EST --
Well I was finally able to track down a Tru64 box running
'OSF1  <hostname> T5.1 2359 alpha' and (unfortunately)
I was *able* to mount the machine w/out any problem...
(with FC5, FC5, and RHEL5 B2 clients).....

But I did notice with all the Tru64 OS versions that were posted
started with a 'V' (ie. V4.0 or V5.0, etc) and version on my box
starts with a 'T'.... Does that mean anything to anybody?

-- Additional comment from dchapman@redhat.com on 2006-12-14 15:58 EST --
The T5.1 means it was a pre-released V5.1 however I think this box is running a
late T5.1 so I would not expect there to be a significant difference.



-- Additional comment from deknuydt@esat.kuleuven.be on 2006-12-19 06:14 EST --
2.6.18-1.2868.fc6 is out, and still has this problem too.  This is taking too
long...

Should we relabel this to FC6 now?

As for not being able to reproduce this (see comment #27), did you export
a UFS or a AdvFS file system on the Tru64?

-- Additional comment from alfredo.ferrari@cern.ch on 2006-12-19 06:28 EST --
This is really taking too long... btw in my case they are AdvFS file systems
which generate the problem (and I ahve no other files system type to test with
on those machines). The problem is clearly common to FC5 and FC6, I would say to
all 2.6.18 kernels released for both, at least in my experience. Again, if I can
help let me know

-- Additional comment from matthew.amos@baesystems.com on 2006-12-19 09:18 EST --
All of the exports on our Tru64 server are AdvFS also. Since this server is not
under my control it is not possible for me to test UFS.

I second Alfredo's comments; this seems to be a kernel bug that was introduced
between 2.6.16 and 2.6.18. I am happy to test kernel patches, as the affected
machine is not a production server.


-- Additional comment from steved@redhat.com on 2006-12-20 21:41 EST --
I can't agree with you more.... wrt to taking too long... believe me
if the Tru64 system I just acquired showed the problem... I would
be all over this... since is almost guaranteed that upcoming RHEL5
release will have the same problem...

My next step will be to start going through the diffs
between 2.6.16 and 2.6.18 and start throwing out
some test kernels...
Comment 1 Ahmon Dancy 2007-01-24 18:05:24 EST
I have this problem as well.  We have two Tru64 5.1 machines and two Tru64 4.0
machines.  The problem only occurs for us w/ the 5.1 machines.  I capture and
examined some packet traces and I don't see anything wrong w/ the transactions
themselves.  The kernel just seems to be doing the wrong thing w/ the results.  

Linux quadra.franz.com 2.6.18-1.2869.fc6 #1 SMP Wed Dec 20 14:51:34 EST 2006
x86_64 x86_64 x86_64 GNU/Linux

Sample bad behavior:

[root@quadra /]# mount epsilon:/acl /mnt
[root@quadra /]# stat /mnt/dancy
  File: `/mnt/dancy'
  Size: 8192            Blocks: 16         IO Block: 4096   directory
Device: 3fh/63d Inode: 156057      Links: 3
Access: (0755/drwxr-xr-x)  Uid: (  443/   dancy)   Gid: (   50/     ftp)
Access: 2007-01-24 15:02:25.592081000 -0800
Modify: 2006-07-18 13:15:34.169051000 -0700
Change: 2006-07-18 13:15:34.169051000 -0700
[root@quadra /]# ls /mnt/dancy
total 24
8 ./  8 ../  8 acl-64/
[root@quadra /]# stat /mnt/dancy/acl-64
  File: `/mnt/dancy/acl-64'
  Size: 8192            Blocks: 16         IO Block: 4096   directory
Device: 3fh/63d Inode: 95568       Links: 32
Access: (0755/drwxr-xr-x)  Uid: (  443/   dancy)   Gid: (   50/     ftp)
Access: 2007-01-24 01:12:59.248123000 -0800
Modify: 2006-05-18 09:03:52.744025000 -0700
Change: 2006-05-18 09:03:52.744025000 -0700
[root@quadra /]# ls /mnt/dancy/acl-64
ls: /mnt/dancy/acl-64: Not a directory
[root@quadra /]# umount /mnt

Clearly bogus.


Packet traces available upon request.

Comment 2 Bert DeKnuydt 2007-02-14 07:28:28 EST
2.6.19-1.2911.fc6 does not fix the problem.

I guess by now RedHat must have given up on this one?  
Comment 3 Steve Dickson 2007-02-14 12:06:39 EST
No! I was hoping that HP was going to bring Tru64 to this years
Connectathon... Unfortunately that was not the case... so I 
asked the HP people that did attend to send me the name
of the support person that could help with this... 

I also had a conversion with the upstream NFS maintainer
(who also attended Connectathon) and we both are a bit
stump on this one...

With the 2.6.20 kernel, there has been some readdir
that might address this problem... I'm looking into
that now... You'll note that the FS-Cache patches are 
longer the latest FC6 kernel... so it takes them out
of play... 
Comment 4 Steve Dickson 2007-03-05 16:05:13 EST
Created attachment 149293 [details]
Purposed Patch

It turns out that this is not a Linux client problem at all!
Although its true Linux server do fail against Tru64,
but they fail because Tru64 servers are sending different
fsids for the same filesystem. 

Now it appears only the Linux client actually looks
at the values of fsids being returned and that 
started with a patch back in the Linux 2.6.12 time frame.
So I have a feeling the Tru64 servers have always been
broken... 

For the gory details see the usptream posting
(I would post the link but as usual sourceforge.net down) 

So the attached patch does seem to fix the problem,
but its not clear how accepting upstream will be
of this patch...
Comment 5 Ahmon Dancy 2007-03-05 16:36:43 EST
Thanks Steve.  I'm keeping my fingers crossed and sacrificing a chicken.
Comment 6 Steve Dickson 2007-03-06 07:12:15 EST
Here is the upstream discussion... 

http://sourceforge.net/mailarchive/forum.php?thread_id=31756540&forum_id=4930

It appears this patch will not be accepted. But I would 
suggest you use this patch until I can figure something
else out... 

It is true, that there is simply no way HP will fix this 
problem, correct? HP is no longer sending out updates
right?
Comment 7 Alfredo Ferrari 2007-03-06 08:34:17 EST
As far as I know no chance to get an update from HP. It is even hard to get
hardware reapirs (the last one was nonsensically expensive).
BTW thanks for all your efforts
Comment 8 Steve Dickson 2007-03-06 09:52:11 EST
Just curious... is there a way to turn off READDIRPLUS
support on Tru64 servers?
Comment 9 Peter Staubach 2007-03-06 09:52:53 EST
With all due respect to Tru64 and HP, perhaps it is time to consider
a replacement?  :-)
Comment 10 Ahmon Dancy 2007-03-06 11:41:52 EST
Tru64 does still appear to be maintained to some degree:

http://h30097.www3.hp.com/pdf/Tru64_Roadmap_Current.pdf
Comment 11 Ahmon Dancy 2007-03-08 15:35:54 EST
(In reply to comment #8)
> Just curious... is there a way to turn off READDIRPLUS
> support on Tru64 servers?

Yes!  The following worked for me.  There is probably some better way to do this
so that it remains permanent across boots, but I haven't figured it out yet:

ladebug -k
assign doreaddirplus=0
quit




Comment 12 Ahmon Dancy 2007-03-08 19:47:21 EST
If anyone is interested, I have hacked up a new nfs_server.mod kernel module for
Tru64 which fixes the bug.  I have it deployed on a local system and it works
fine.   I built it against a host that identifies itself as Tru64 V5.1B (Rev:
2650).  It can probably be adjusted for other versions as well.
Comment 13 Steve Dickson 2007-03-12 16:21:34 EDT
*** Bug 217705 has been marked as a duplicate of this bug. ***
Comment 14 Ole Holm Nielsen 2007-04-17 16:48:04 EDT
Regarding Comment #11 I have some further advice:  You may not have "ladebug" 
installed on your Tru64 server, but probably you have /usr/bin/dbx.  
You may then modify the "doreaddirplus" variable until the next reboot by:

# dbx -k /vmunix
(dbx) assign doreaddirplus=0
(dbx) quit

If you want the change to be persistent across reboots do:

# dbx -k /vmunix
(dbx) patch doreaddirplus=0
(dbx) quit

If you upgrade the kernel this will have to be repeated.

We have verified that this workaround solves the NFS problem
with a Tru64 NFS server and a RHEL5 Linux NFS client. 
Comment 15 Steve Dickson 2007-05-15 11:00:41 EDT
Fixed in nfs-utils-1.0.10-10.fc6 by added the -o nordirplus mount option
which will have the kernel support in the next kernel update.

Note You need to log in before you can comment on or make changes to this bug.