211293 – NFS to Tru64 badly broken

Bug 211293 - NFS to Tru64 badly broken

Summary: NFS to Tru64 badly broken

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Steve Dickson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Duplicates (1):	218393 (view as bug list)
Depends On:
Blocks:	218393 220451
TreeView+	depends on / blocked

Reported:	2006-10-18 15:08 UTC by Bert DeKnuydt
Modified:	2008-05-01 10:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:	nfs-utils-1.0.10-10
Clone Of:
Environment:
Last Closed:	2008-05-01 10:50:20 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
requested tethereal output (60.21 KB, application/x-bzip) 2006-12-04 20:08 UTC, Alfredo Ferrari	no flags	Details
Packet capture while experiencing "not a directory" bug. (2.75 KB, application/octet-stream) 2006-12-05 11:24 UTC, Matt Amos	no flags	Details
Strace, NFSV3, 'Not a directory' (2.39 KB, application/x-bzip2) 2006-12-06 10:36 UTC, Bert DeKnuydt	no flags	Details
Strace, NFSV2, 'Input/output error' (2.52 KB, application/x-bzip2) 2006-12-06 10:38 UTC, Bert DeKnuydt	no flags	Details
View All

Description Bert DeKnuydt 2006-10-18 15:08:19 UTC

Description of problem:

NFS imported filesystems from Tru64 do not recognize directories as such,
but return 'Not a directory'.

From Solaris systems, there seems to be no prob.

Version-Release number of selected component (if applicable):

2.6.18-1.2200.fc5 and 2.6.18-1.2200.fc5smp

How reproducible:

Always


Steps to Reproduce:
Actual results:

1. mount tru64machine:/whatever /mnt
2. ls /mnt/subdir
ls: /mnt/subdir: Not a directory
3. cd /mnt/subdir
cd: /mnt/subdir: Not a directory

Expected results:

Behave like 2.6.17-1.2187_FC5 (or any previous kernel), and have NFS
working to Tru64

Additional info:

Trying to mount with 'nfsvers=2' results in other errors (Input/output error).
Also, this looks very unhealthy:

> mount -o nfsvers=2 tru64machine:/whatever /mnt
(seems to work; now without unmounting!)
> mount -o nfsvers=3 tru64machine:/whatever /mnt
> df 
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2            102861656  69463672  28172844  72% /
none                    517476         0    517476   0% /dev/shm
tmpfs                   517476         0    517476   0% /tmp
tmpfs                   517476        48    517428   1% /var/tmp
tru64:/whatever
                     177743168  86595008  90918656  49% /mnt
tru64:/whatever
                     177743168  86595008  90918656  49% /mnt

Twice mounted, without error! (And you can go far beyond 2).

Previous versions loudly complained for double mount attempts.

Comment 1 Bert DeKnuydt 2006-10-19 10:56:33 UTC

I did some experiments with 2.6.18-1.2798.fc6.

*) Bug is still there.
*) Mounting with '-o nfsvers=2' seems to solve it here.
*) Multiple mounting is still possible, as long as you
   change version numbers; so you can repeat these lines
   indefinitely:
 
    mount -o nfsvers=3 tru64machine:/whatever /mnt
    mount -o nfsvers=2 tru64machine:/whatever /mnt

   You'll need the same number of unmounts to get rid of this
   construct ...

Comment 2 Bert DeKnuydt 2006-11-07 11:38:27 UTC

I tried 2.6.18-1.2224.fc5 (from testing).  Not fixed.

Forcefully mounting with '-o nfsvers=2' seems alleviate the problem; but
for a few directories, you get a 'Input/output error' instead.  All in all:
unusable.  Forcing NFS over TCP or UDP makes no difference.

Comment 3 Steve Dickson 2006-11-07 13:27:27 UTC

I wonder if https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=212471
is the same problem...

Comment 4 Bert DeKnuydt 2006-11-12 15:16:12 UTC

As expected, 2.6.18-1.2239.fc5 does not make any difference.

As of resemblance to 212471, I'm not sure, because the 'mounting' of the fs works 
without problem, both manual and with autofs on Tru64 machines with a single and
with multiple active network interfaces. Might be related anyhow though...

Strangely enough, on a single FC6 test machine, I am able to mount a Tru64 fs
without problems; but _with_ the same 'Not a directory' problems as in FC5 (that
is with 2.6.18-1.2798.fc6).

Comment 5 Alfredo Ferrari 2006-11-21 12:42:33 UTC

I see exactly the same problem with a x86_64 dual CPU system running fedora 5
fully updated (kernel 2.6.18-1.2239.fc5). The issue appears with 6 different nfs
mounts coming from two different Tru64 servers both with autofs and "normal"
nfs. I spent days before realizing today it is just the "new" kernel. Obviously
it is a serious issue in mixed system centers like our one. I confirm that forcing
nfsvers=2 helps but still leaves issues (and big worries of filesystem
corruption, any hint about such a possibility?)
If I can help in any way in pinpointing the problem let me know.

Alfredo Ferrari

Comment 6 Matt Amos 2006-11-22 11:19:57 UTC

We have exactly the same problem: Tru64 NFS server and dual Opteron workstation
running FC5, the latest kernel 2.6.18-1.2239 leaves NFS mounts almost totally
unusable, as it reports that almost all directories under the mount point are
not directories. However, "ls -l" still shows them as directories. Mounts from
other Linux, Solaris and IRIX servers are fine, however.

With some digging through tcpdump traces the only difference I can see between
the Tru64 server and others is that the Tru64 server uses a very large (or
negative) fsid and a very small fileid. For example, Tru64 is returning fileids
less than about 32, whereas the other systems are using fileids in the thousands.

If you'd like me to run some tests, or take some more packet dumps, just let me
know.

Comment 7 Steve Dickson 2006-12-04 19:49:33 UTC

Yes... could you please post some bzip2 binary tethreal netowork traces
of this problem... tia...

Comment 8 Steve Dickson 2006-12-04 19:53:20 UTC

An example of what I'm looking for is:

   tethereal -w /tmp/bz211293.pcap host <server> ; bzip2 /tmp/bz211293.pcap

Comment 9 Alfredo Ferrari 2006-12-04 20:08:35 UTC

Created attachment 142769 [details]
requested tethereal output

Comment 10 Alfredo Ferrari 2006-12-04 20:11:01 UTC

Voila' 1 test. The server is alf3.cern.ch (192.91.242.170), the client
pceet030.cern.ch (192.91.242.30), the mount is via autofs (it doesn't really
matter) mounting the remote (alf3) /user1 partition on misc/alf3_user1.
I made a few
  ls /misc/alf3_user1/home
and
  ls /misc/alf3_user1/home/xxx
where xxx are various directories (the mount has with root privileges and I am
root while issuing those commands). These ls mostly failed (not all and not
always for the same directory, ie ls /misc/alf3_user1/home/alfredo succeeded
twice and failed many more with 

[root@pceet030 log]# ls /misc/alf3_user1/home/alfredo/
ls: /misc/alf3_user1/home/alfredo/: Not a directory

Comment 11 Matt Amos 2006-12-05 11:24:21 UTC

Created attachment 142823 [details]
Packet capture while experiencing "not a directory" bug.

The packets were captured while a logged-in user attempted to cd to several
sub-directories of his home directory, receiving the "not a directory" error
each time.

Comment 12 Steve Dickson 2006-12-05 23:36:08 UTC

Unfortunately, neither one of the trace showed anything out of the
ordinary... In the bz041206.pcap trace, I see lookup of 'alfredo'
and the file type that is being returned is a directory...
In both traces, I also see a number of READLINKS meaning
the symlinks some how involved.... which should not matter...
Over all both traces look like normal NFS traffic...

The guess only oddity is the lack of non-zero NFS status.
I was hoping to see the server return some type of error
which might give clue as to what is happening...

Would it be possible to post bzip2 strace of the
ls or stat command failing? something like
'strace -o /tmp/strace.txt ls /mnt/foo'. What I'm looking
for is to see which system call is failing which will
(hopefully) give me starting point...

Also, what is the status of SElinux? On, Off? If its
on, please try using the 'setenforce 0' command to
turn it off to see if that makes a different.

And it is true, that moving  one or two of the mount out of
autofs land and into /etc/fstab makes no difference?

Comment 13 Bert DeKnuydt 2006-12-06 10:36:34 UTC

Created attachment 142939 [details]
Strace, NFSV3, 'Not a directory'

Comment 14 Bert DeKnuydt 2006-12-06 10:38:21 UTC

Created attachment 142940 [details]
Strace, NFSV2, 'Input/output error'

Comment 15 Bert DeKnuydt 2006-12-06 10:51:04 UTC

This is with 2.6.18-1.2849.fc6; mount is manual, not via autofs (but this makes
no difference).

> getenforce
Disabled
> # So SELinux is disabled completely
> mount ibrahim:/raid3/users/deknuydt /mnt
> strace -o /tmp/strace.txt ls /mnt/tex
ls: /mnt/tex: Not a directory
> # This is the first attachment


> umount /mnt
> mount -o vers=2 ibrahim:/raid3/users/deknuydt /mnt
> strace -o /tmp/strace1.txt /mnt/mail
ls: reading directory /mnt/mail: Input/output error
> # This is the second attachment

Comment 16 Steve Dickson 2006-12-06 19:26:10 UTC

For the recored.... here are the system calls that are failing:

open("/mnt/tex", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOTDIR (Not a directory)
and
getdents(3, 0x61b9c8, 4096)             = -1 EIO (Input/output error)

Comment 17 Steve Dickson 2006-12-06 19:40:07 UTC

When the 'ls /mnt/tex' returns ENOTDIR, could you post the 
output of 'stat /mnt/tex'.

Comment 18 Steve Dickson 2006-12-06 20:22:43 UTC

Also, are all of these clients 32bit architectures? Is anybody seeing this
problem with a 64bit client?

Comment 19 Alfredo Ferrari 2006-12-06 21:42:25 UTC

My one is a 64 bit client

Comment 20 Bert DeKnuydt 2006-12-07 12:10:25 UTC

> mount ibrahim:/raid3/users/deknuydt /mnt
> ls /mnt/text
ls: /mnt/text: Not a directory
> stat /mnt/text
  File: `/mnt/text'
  Size: 16384     	Blocks: 32         IO Block: 4096   directory
Device: 12h/18d	Inode: 985772      Links: 7
Access: (0755/drwxr-xr-x)  Uid: ( 4653/deknuydt)   Gid: (   46/  visics)
Access: 2006-12-06 11:13:35.000999000 +0100
Modify: 2006-11-30 13:33:30.493267000 +0100
Change: 2006-11-30 13:33:30.493267000 +0100

> mount -o vers=2 ibrahim:/raid3/users/deknuydt /mnt
> ls /mnt/mail
ls: reading directory /mnt/mail: Input/output error
> stat /mnt/mail
  File: `/mnt/mail'
  Size: 8192      	Blocks: 16         IO Block: 4096   directory
Device: 12h/18d	Inode: 981630      Links: 3
Access: (0700/drwx------)  Uid: ( 4653/deknuydt)   Gid: (   46/  visics)
Access: 2006-12-06 01:15:18.000999000 +0100
Modify: 2006-07-20 17:17:55.006699000 +0200
Change: 2006-07-20 17:17:55.006699000 +0200

Strange ... If I chmod o+rx this mail directory, then suddenly the IO error
disappears.  So a 'Permission Denied' turns into an 'IO error'.  For NFSV2 at least.

For the record: these problems occur both with the i686 and x86_64 kernels, both
onder FC5 and FC6; in short: all kernels more recent than 2.6.17-1.2187_FC5 have
this.

Comment 21 Matt Amos 2006-12-07 12:12:43 UTC

Ours is 64-bit also. Here is the requested output from ls and stat when
experiencing the bug:

> ls CFD
ls: CFD: Not a directory
> stat CFD
  File: `CFD'
  Size: 8192            Blocks: 16         IO Block: 4096   directory
Device: 1ah/26d Inode: 724277      Links: 20
Access: (0755/drwxr-xr-x)  Uid: ( 5154/    amos)   Gid: (  110/   mamod)
Access: 2006-12-04 03:39:19.387121000 +0000
Modify: 2006-11-03 10:32:02.870369000 +0000
Change: 2006-11-03 10:32:02.870369000 +0000

In case it is significant, here is an excerpt from /proc/cpuinfo.

vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 270
stepping        : 2

Comment 22 Steve Dickson 2006-12-07 16:03:13 UTC

Cool... thanks for all the info...

What version of Tru64 is everyone using... I'm trying to dig one up 
so I can reproduce this locally...

Comment 23 Bert DeKnuydt 2006-12-07 16:09:01 UTC

>ssh ibrahim uname -a
OSF1 ibrahim V5.1 2650 alpha alpha Tru64

>ssh ampere uname -a
OSF1 ampere V5.1 732 alpha alpha Tru64

Reasonably patched :)

Comment 24 Matt Amos 2006-12-07 18:46:19 UTC

> uname -a
OSF1 alphasc0 V5.1 2650 alpha

Comment 25 Alfredo Ferrari 2006-12-07 22:53:52 UTC

In my case
alf3> uname -a
OSF1 alf3.cern.ch V4.0 1229 alpha

srveet01> uname -a
OSF1 srveet01.cern.ch V4.0 1229 alpha

Comment 26 Rob Garth 2006-12-11 01:11:16 UTC

*** Bug 218393 has been marked as a duplicate of this bug. ***

Comment 27 Steve Dickson 2006-12-14 20:50:24 UTC

Well I was finally able to track down a Tru64 box running
'OSF1  <hostname> T5.1 2359 alpha' and (unfortunately)
I was *able* to mount the machine w/out any problem...
(with FC5, FC5, and RHEL5 B2 clients).....

But I did notice with all the Tru64 OS versions that were posted
started with a 'V' (ie. V4.0 or V5.0, etc) and version on my box
starts with a 'T'.... Does that mean anything to anybody?

Comment 28 Doug Chapman 2006-12-14 20:58:19 UTC

The T5.1 means it was a pre-released V5.1 however I think this box is running a
late T5.1 so I would not expect there to be a significant difference.

Comment 29 Bert DeKnuydt 2006-12-19 11:14:18 UTC

2.6.18-1.2868.fc6 is out, and still has this problem too.  This is taking too
long...

Should we relabel this to FC6 now?

As for not being able to reproduce this (see comment #27), did you export
a UFS or a AdvFS file system on the Tru64?

Comment 30 Alfredo Ferrari 2006-12-19 11:28:27 UTC

This is really taking too long... btw in my case they are AdvFS file systems
which generate the problem (and I ahve no other files system type to test with
on those machines). The problem is clearly common to FC5 and FC6, I would say to
all 2.6.18 kernels released for both, at least in my experience. Again, if I can
help let me know

Comment 31 Matt Amos 2006-12-19 14:18:32 UTC

All of the exports on our Tru64 server are AdvFS also. Since this server is not
under my control it is not possible for me to test UFS.

I second Alfredo's comments; this seems to be a kernel bug that was introduced
between 2.6.16 and 2.6.18. I am happy to test kernel patches, as the affected
machine is not a production server.

Comment 32 Steve Dickson 2006-12-21 02:41:51 UTC

I can't agree with you more.... wrt to taking too long... believe me
if the Tru64 system I just acquired showed the problem... I would
be all over this... since is almost guaranteed that upcoming RHEL5
release will have the same problem...

My next step will be to start going through the diffs
between 2.6.16 and 2.6.18 and start throwing out
some test kernels...

Comment 33 Steve Dickson 2006-12-23 20:49:19 UTC

The diffs between 2.6.16 and 2.6.18 are extremely large... over 1800 lines
So in http://people.redhat.com/steved/bz211293/  is the
kernel-smp-2.6.17-1.3002 kernel which is about halfway between 2.6.16
and 2.6.18... Please give that a try to see if the problem exists...
(Note, if you need a different type of kernel, just let me know... 

Again, I apologize for taking so long on this one...

Comment 34 Bert DeKnuydt 2006-12-24 09:49:14 UTC

Don't know if this is good or bad news for you, but kernel-smp-2.6.17-1.3002
suffers from the 'Not a directory' problem too.  So I guess you'll have to
continue the binary search...

Comment 35 Steve Dickson 2007-01-01 23:46:52 UTC

Ok... please try
http://people.redhat.com/steved/bz211293/kernel-smp-2.6.16-1.3001_FC5.i686.rpm

its that last kernel before we moved to a 2.6.17...

Comment 36 Rob Garth 2007-01-02 00:19:25 UTC

Tried the Kernel. The problem went away. NFS mounts worked to OSF boxes as expected.

Comment 37 Steve Dickson 2007-01-02 14:53:20 UTC

cool... at least we are making some progress... 

Now I'm off to see what the diffs are between these kernels,
but if possible, could you try this kernel?

http://people.redhat.com/steved/bz211293/kernel-smp-2.6.17-1.2136_FC5.i686.rpm

Its the first 2.6.17 kernel released... tia...

Comment 38 Bert DeKnuydt 2007-01-04 14:16:37 UTC

2.6.17-1.2136_FC5 is clean. So is kernel-2.6.17-1.2187_FC5 btw.

I just have the gut feeling that FS-Cache is to blame ... Where did that get added?

Comment 39 Rob Garth 2007-01-12 04:34:49 UTC

This maybe related, as it doesn't happen in FC4, but I cannot samba share nfs
mounts, NFS server: FC6, Samba server/NFS client: FC6.

Running the NFS server on an older kernel version, RHEL4.3 or FC4 the problem
does not show itself.

Windows clients attempting to write to files fail with a "Delayed Write Fail"
and worse still, the file which was written to is clobbered.

Comment 40 Bert DeKnuydt 2007-01-19 09:20:55 UTC

FYI: 2.6.19-1.2895.fc6 still has it.

Comment 41 Bug Zapper 2008-04-04 04:01:27 UTC

Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 42 Steve Dickson 2008-05-01 10:50:20 UTC

Fixed in nfs-utils-1.0.10-10.fc6 by added the -o nordirplus mount option
which will have the kernel support in the next kernel update.

Note You need to log in before you can comment on or make changes to this bug.