Bug 106584

Summary: 'cp -p' returns error when destination is an nfs directory
Product: Red Hat Enterprise Linux 3 Reporter: Tim Johnson <tim_johnson>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED WONTFIX QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: agruen, apuch, bugzilla, david.jafferian, dferbert, fabud, hagberg, hollowec, k.georgiou, kmori, nhorman, paulwaterman, petrides, rproffit, tao, tcallawa, tim_johnson, twaugh, vanhoof, zachary_reneau
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-03-14 18:42:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output of strace on 'mv' command
none
Output of strace for 'mv' command
none
Patch adding "noacl" option for NFS to mount
none
tcpdump of a bad NFSACL transaction btw. solaris server and linux client
none
strace output of successful /bin/cp -p to an NFS mounted VxFS 3.3.3 filesystem from a Solaris 7 host, performed on an RHEL 3.0 system
none
tcpdump showing interaction on a failed /bin/cp -p
none
tcpdump showing interaction on a successful /bin/cp -p
none
tcpdump showing interaction on a failed /bin/cp -p
none
tcpdump showing interaction on a successful /bin/cp -p
none
tcpdump showing /bin/cp -p with a previous setfacl
none
Solaris truss (strace equivalent) results for /usr/bin/cp -p run under Solaris on NFS mounted VxFS 3.4 filesystem
none
2.4 fix
none
2.6 fix
none
The backported path that reportedly fixes this bug. none

Description Tim Johnson 2003-10-08 16:28:40 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030902

Description of problem:
If the destination direcory a remote directory (NFS) 'cp' generates a message
and returns a non-zero (1) status. It appears that it succeeds.

'mv' generates same message but returns a zero status. It also appears to succeed.

When either command is run and the destination is a local directory:
    % cp -p one /tmp
    % mv -p one /tmp
all is well.

The 'cp -p' & 'mv' commands work as expected on RH 7.1 & RH 8.0 to same remote
file system.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
% cp -p one /a/b/c
cp: preserving permissions for `/a/b/c/one': Invalid argument
% echo $?
1
% mv one /a/b/c
mv: preserving permissions for `/a/b/c/one': Invalid argument
% echo $?
0


Additional info:

Comment 1 Tim Johnson 2003-10-08 16:31:48 UTC
The mount options for the remote filesystem:
orfsrv1:/export/vol/engr/vanguard on /ims/vanguard type nfs (rw,addr=10.4.10.153)

Comment 2 Matt Wilson 2003-10-08 20:19:20 UTC
what version of the kernel are you running?  try with the latest from RHN?


Comment 3 Tim Johnson 2003-10-08 21:23:22 UTC
I see the behavior in: 2.4.21-3.ELsmp & 2.4.21-1.1931.2.423.entsmp



Comment 4 Tim Waugh 2003-10-09 10:23:01 UTC
Please do this:

strace mv one /a/b/c 2>log

and attach 'log'.  Thanks.

Comment 5 Tim Johnson 2003-10-09 18:28:40 UTC
Created attachment 95081 [details]
Output of strace on 'mv' command

% uname -a
2.4.21-1.1931.2.423.entsmp
% strace mv one /ims/vanguard/qa/Lite/tmp 2>mv_log
%

Comment 6 Matt Wilson 2003-10-09 19:14:51 UTC
Is the strace the same for 3.EL?


Comment 7 Tim Johnson 2003-10-09 21:55:41 UTC
Created attachment 95088 [details]
Output of strace for 'mv' command

% strace mv one /ims/vanguard/qa/Lite/tmp 2>mv_log
% uname -r
2.4.21.3.EL

Comment 8 Eric Hagberg 2003-11-26 14:45:53 UTC
We have observed the same problem here, under 2.4.21-4.0.1.ELhugemem.

Changing the nfs mount to use version 2 instead of version 3 makes the
problem go away, in the testing I've done, so it looks like an nfsv3
issue.

Comment 9 Neil Horman 2003-11-26 17:54:36 UTC
another customer has called in with this problem (or what appears to
be this same problem), and has a strace captured to show the
difference btw. a v2 and v3 nfs mount:

5408  getxattr("/var/tmp/test/foo", "system.posix_acl_access",
0xfeffbc90, 132)
= -1 EOPNOTSUPP (Operation not supported)
5408  setxattr("/tmp/foo/rhiso/foo/foo", "system.posix_acl_access",
0x8056040, 28, ) = -1 EOPNOTSUPP (Operation not supported)
vs. version 3's failure/warning:

5363  getxattr("/var/tmp/test/foo", "system.posix_acl_access",
0xfeff9bb0, 132)
= -1 EOPNOTSUPP (Operation not supported)
5363  setxattr("./foo", "system.posix_acl_access", 0x8056030, 28, ) =
-1 EINVAL
(Invalid argument)

Comment 10 Tim Waugh 2003-11-27 17:55:04 UTC
Sounds like it might be a kernel issue?

Comment 11 Arjan van de Ven 2003-11-27 17:56:55 UTC
does this happen in the actual RHEL3 product too and not just the beta ?

Comment 12 Eric Hagberg 2003-11-27 23:11:05 UTC
Comments 8 and 9 are both really from me - and yes, I observe this
when using the RHEL release product, fully updated via up2date.

Comment 14 Steve Dickson 2003-12-08 19:01:26 UTC
Please make sure your using 2.4.21.4.EL (Gold) or 2.4.21.5.EL (U1),
since is very similar to a problem I fixed in of the last
beta (or rc) releases.... 



Comment 15 Eric Hagberg 2003-12-08 19:04:49 UTC
I'm using 2.4.21-4.0.1.ELhugemem. 2.4.21.5.EL isn't yet available via
any means of which I'm aware.

Comment 16 Eric Hagberg 2003-12-12 16:57:38 UTC
Still broken in 2.4.21-5.EL:

saias36 $ mv /var/tmp/zzz .
mv: setting permissions for `./zzz': Invalid argument

Same strace info:

4160  utime("./zzz", [2003/12/12-11:59:03, 2003/12/12-11:59:03]) = 0
4160  chown32(0x8055b40, 0x1486, 0x64)  = 0
4160  getxattr("/var/tmp/zzz", "system.posix_acl_access", 0xfeffd530,
132) = -1 EOPNOTSUPP (Operation not supported)
4160  setxattr("./zzz", "system.posix_acl_access", 0x8056030, 28, ) =
-1 EINVAL (Invalid argument)
4160  write(2, "mv: ", 4)               = 4
4160  write(2, "setting permissions for `./zzz\'", 31) = 31
4160  write(2, ": Invalid argument", 18) = 18


Comment 17 Eric Hagberg 2003-12-12 16:58:24 UTC
saias36 $ uname -r
2.4.21-5.ELhugemem

Comment 18 Eric Hagberg 2003-12-17 16:59:19 UTC
Broken as well under 2.4.21-6.ELhugemem

Comment 19 Don Howard 2004-01-10 03:16:38 UTC
This sounds like a dup of BZ 108088.  If so it should be closed/dup
and the issue persued there.

Adding to the QU2 blocker for the time being.

Comment 20 Tim Burke 2004-01-14 23:28:33 UTC
I don't think this should be on the U2 blocker list until it has been
verified that the issue still exists as of the U1 kernel.


Comment 21 Eric Hagberg 2004-01-15 02:09:45 UTC
Comment 18 shows it is still present in the U1 beta kernel. Unless
someone has actually put in a patch to fix it, I'd say that chances
are that the U1 release kernel isn't going to fare any better on this
issue.

Comment 22 Steve Dickson 2004-01-15 16:19:29 UTC
What coreutils version are you using (i.e. rpm -qf `which mv`)?



Comment 23 Eric Hagberg 2004-01-15 16:25:28 UTC
$ rpm -qf `which mv`
coreutils-4.5.3-26

That's the latest in the RHEL3 channels, including the Update 1 beta.

Comment 24 Eric Hagberg 2004-01-16 18:16:20 UTC
Still broke in U1:

getxattr("/var/tmp/foo", "system.posix_acl_access", 0xfeffbc00, 132) =
-1 EOPNOTSUPP (Operation not supported)
setxattr("./foo", "system.posix_acl_access", 0x8056030, 28, ) = -1
EINVAL (Invalid argument)

saias11 $ uname -r
2.4.21-9.ELhugemem

Comment 30 Johnray Fuller 2004-03-18 22:02:16 UTC
Here is the strace for beta U2:

getxattr("/var/tmp/foo", "system.posix_acl_access", 0xfeffbc00, 132) =
-1 EOPNOTSUPP (Operation not supported)
setxattr("./foo", "system.posix_acl_access", 0x8056030, 28, ) = -1
EINVAL (Invalid argument)

saias11 $ uname -r
2.4.21-9.ELhugemem


Comment 31 Eric Hagberg 2004-03-18 22:11:27 UTC
If you want the whole strace, I can send it, but it is identical to
the above when running with 2.4.21-11.ELhugemem.

It's even the same as all the other straces in this BZ.

So it isn't fixed in the U2 beta.

Comment 33 Steve Dickson 2004-03-24 15:49:20 UTC
Using a RHEL3 U2 client (2.4.21-11.ELsmp) I'm getting
the following:

getxattr("/var/tmp/zz", "system.posix_acl_access", 0xbfffa910, 132) =
-1 EOPNOTSUPP (Operation not supported)
setxattr("./zz", "system.posix_acl_access", 0x8056dc8, 28, ) = -1
EOPNOTSUPP (Operation not supported)

Which is expected.... I used both a FC1 and AS2.1 server which do 
not support NFS ACLs.... So where is the difference in my setup
and yours?




Comment 34 Eric Hagberg 2004-03-24 21:38:30 UTC
Perhaps your nfs server? We use Solaris 8 servers (that's my test case
here).

Regardless, when I run the same command against the same filesystem on
the same server (same mount parameters accordint to /proc/mounts) but
just switch from AS2.1 to RHEL3, I get the invalid argument error only
from RHEL3.

Comment 35 Lenny 2004-03-30 16:02:25 UTC
This is a coreutils problem and is fixed in the current GNU coreutils 
v 5.2.1.  See bug # 119355


Comment 36 Lenny 2004-03-30 16:10:16 UTC
*** Bug 119355 has been marked as a duplicate of this bug. ***

Comment 37 Lenny 2004-03-31 00:27:29 UTC
Do we know when this fix will be integrated in the RHEL 3 coreutils RPM?

Comment 38 Tim Waugh 2004-03-31 09:13:29 UTC
Do you happen to know off-hand what the fix actually is?  I'll need to
back-port it.

Comment 39 Tim Waugh 2004-03-31 09:48:22 UTC
FWIW, GNU coreutils doesn't have ACL support, so if you just tried
unpatched GNU coreutils 5.2.1 and the problem went away that doesn't
really give us any useful data. :-(

The real problem is that, while getxattr() gives EOPNOTSUPP and we
handle it, setxattr() gives EINVAL instead.

So why is it doing that for NFSv3 (and not for NFSv2)?  See comment #9.

Comment 40 Steve Dickson 2004-03-31 11:08:24 UTC
RHEL 3 has nfsacl support for version 3 not version 2

Comment 41 Tim Waugh 2004-03-31 11:16:03 UTC
If it doesn't support it, surely EOPNOTSUPP is the correct errno to set?

Comment 42 Robert Proffitt 2004-04-30 17:17:59 UTC
Echoing above on behalf of my customer, Do we know when this fix will
be integrated in the RHEL 3 coreutils RPM?

Comment 43 Tim Waugh 2004-04-30 19:46:29 UTC
What fix, precisely?  Is this a coreutils bug, or a kernel bug?  I
haven't seen evidence either way yet.

Comment 45 John Flanagan 2004-05-12 01:07:40 UTC
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-188.html


Comment 46 Paul Waterman 2004-05-13 17:41:45 UTC
Bugzilla does not seem to want to let me reopen this bug.

This bug is NOT resolved in RHEL 3.0 Update 2:

(lnxtst02 : /var/tmp)
% cat /etc/redhat-release
Red Hat Enterprise Linux WS release 3 (Taroon Update 2)

(lnxtst02 : /var/tmp)
% touch file

(lnxtst02 : /var/tmp)
% /bin/cp -p file /usr/test/scmbld01/rhapsody
/bin/cp: setting permissions for `/usr/test/scmbld01/rhapsody/file':
Invalid argument

(lnxtst02 : /var/tmp)
% echo $status
1


As shown by strace, setxattr is still returning EINVAL when it should
be returning EOPNOTSUPP instead:

setxattr("/usr/test/scmbld01/rhapsody/file",
"system.posix_acl_access", 0x8055540, 28, ) = -1 EINVAL (Invalid argument)


Comment 47 Johnray Fuller 2004-05-13 18:09:30 UTC
I am reopening this bug since, clearly, it is not resolved.

Johnray

Comment 48 Zachary Reneau 2004-05-14 23:11:10 UTC
I have a customer that is also experiencing this issue. The setup is 
a Solaris 9 cluster sharing NFS volumes and a RHEL3 client server 
that is mounting them. The RHEL3 system is running 2.4.21-15.ELsmp.

Whenever the customer tries to copy a file on an NFS volume he 
observes the following:

# cp -p testfile testfile3
cp: preserving permissions for `testfile3': Invalid argument


Comment 50 Franklin Abud 2004-06-08 05:22:43 UTC
another customer have the same issue, using RHEL 3 WS, CRM 330104.

NFS server = Solaris
NFS client = RHEL 3 WS


Comment 51 Tom "spot" Callaway 2004-06-09 22:14:04 UTC
Another one of my customers is having this issue with RHEL 3.

Comment 52 Chris Hollowell 2004-06-11 16:22:32 UTC
Is there a reason why the mount package provided in RHEL 3 doesn't
support the "noacl" mount option for NFS filesystems?  The NFS
implementation in the latest errata kernel (2.4.21-15.EL), at least,
seems to support it.  From linux-2.4.21-15.EL/fs/nfs/inode.c:

static int nfs_show_options(struct seq_file *m, struct vfsmount *mnt)
{
        static struct proc_nfs_info {
                int flag;
                char *str;
                char *nostr;
        } nfs_info[] = {
                { NFS_MOUNT_SOFT, ",soft", ",hard" },
                 ...
                { NFS_MOUNT_NONLM, ",nolock", ",lock" },
                { NFS_MOUNT_BROKEN_SUID, ",broken_suid", "" },
                { NFS_MOUNT_NOACL, ",noacl", "" },
                { 0, NULL, NULL }
        };

I included a patch (obtained from another distribution, but
applied cleanly) to support the "noacl" option in the util-linux 
spec file, and rebuilt the mount RPM.  Now, when I mount NFSv3 file
systems on a test machine from Solaris 8/9 servers with the "noacl"
option, it appears that I no longer encounter the 'cp -p' bug
described in this report. This is not a complete solution, but it
works for me, as I have no need to use ACLs on these filesystems.


Comment 53 Tom "spot" Callaway 2004-06-11 16:39:37 UTC
Chris:

Can you attach this patch to the bug entry?

Comment 54 Chris Hollowell 2004-06-11 19:39:50 UTC
Created attachment 101069 [details]
Patch adding "noacl" option for NFS to mount

Comment 55 Frank Hirtz 2004-06-11 21:29:05 UTC
I'm trying to replicate and am unable to get the failure thus far:

[bob@dhcp122 test]$ grep test /proc/mounts 
172.16.59.187:/test /mnt/test nfs
rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=172.16.59.187 0 0
bob@dhcp122 test]$ uname -r
2.4.21-15.ELhugemem
[bob@dhcp122 test]$ rpm -q coreutils
coreutils-4.5.3-26
[bob@dhcp122 test]$ touch /tmp/testfile
[bob@dhcp122 test]$ strace -f -o /tmp/strace.out cp -p /tmp/testfile
/mnt/test/ 
[bob@dhcp122 test]$ 


<from the strace>

2230  getxattr("/tmp/testfile", "system.posix_acl_access", 0xfeffd600,
132) = -1 EOPNOTSUPP (Operation not supported)
2230  setxattr("/mnt/test/testfile", "system.posix_acl_access",
0x8055d50, 28, ) = 0
2230  exit_group(0)                     = ?


The server is a fresh Solaris 9 installation. What may I be missing here?

Comment 56 Chris Hollowell 2004-06-14 20:26:03 UTC
> The server is a fresh Solaris 9 installation. What may I be 
> missing here?

In our configuration the Solaris NFS servers are exporting VxFS
filesystems.  It seems that I'm not able to reproduce the error
when a UFS filesystem is exported from a Solaris 8 (haven't
tried Solaris 9) system via NFS.

Comment 57 Paul Waterman 2004-06-15 00:42:18 UTC
Most interesting... The problem is definitely not just isolated to NFS
v2 vs. NFS v3, nor is it strictly a VxFS filesystem issue.

I'm going to see if I can isolate this further. Preliminary testing
shows the following:

WORKS:
 - VxFS 3.2.5 filesystem exported from a Solaris 2.6 host
 - VxFS 3.3.2 filesystem exported from a Solaris 7 (2.7) host
 - VxFS 3.3.3 filesystem exported from a Solaris 7 (2.7) host
 - VxFS 3.4 filesystem exported from a Solaris 7 (2.7) host
 - filesystem exported from a Network Appliance 
 - xfs filesystem exported from an IRIX 6.5 host

FAILS:
 - VxFS 3.4 filesystem exported from a Solaris 8 (2.8) host
 - VxFS 3.5 filesystem exported from a Solaris 2.6 host
 - VxFS 3.5 filesystem exported from a Solaris 8 (2.8) host

Comment 58 Neil Horman 2004-06-15 12:45:34 UTC
ummm...All I see under the fails list are VxFS filesystems on various
version of solaris.  That would to me indicate (with the data we have
available) that this is strictly a VxFS issue, or at least that VxFS
is a contributing factor, without which the problem doesn't manifest
itself.

Comment 59 Paul Waterman 2004-06-15 16:19:46 UTC
As Neil states, all the failures that I have observed are on NFS
mounts of various VxFS filesystems exported from Solaris systems.

It would probably be helpful if other users who have seen this problem
could report the filesystem type(s) and operating system(s) of the
exported NFS filesystems on which they've seen this problem.

Comment 60 Franklin Abud 2004-06-27 22:37:10 UTC
According to my customer, this only happen on:

The OS is Solaris 8 and the patch level is Generic_108528-27.  The
filesystems that this occurs with are all Veritas filesystems (VXFS).

The problem does not occur on exported UFS filesystems...only on the
VXFS ones.  The VXFS version we are using is 3.4,REV=GA03. 

Comment 61 Zachary Reneau 2004-06-28 20:15:47 UTC
My customer confirmed that this problem only occurs with VxFS, 
although I don't know what revision.

Comment 62 Paul Waterman 2004-06-29 18:38:14 UTC
Any suggestions on how we go about determining whether the fault here
lies with RHEL, Solaris, or Veritas?

Comment 63 Tim Johnson 2004-06-29 23:18:13 UTC
If you look at the original problem description, this works just fine
in earlier (rh8) versions of RedHat. Although I did not mention this
in the original post, in my case the files exist on a Veritas (VXFS)
file system. When I orginally saw the problem the server was running
Solaris 8 and is now running Solaris 9 (with newer version of VXFS).
The behavior is unchanged (Works on rh8, does not work on rhel3). If I
remember correctly it worked in some the early (beta) version of rhe3
(can't swear to that). Has anyone seen this problem with cpio or tar?
I would suspect they have the same problem, they may not be generating
an error message. Remember that 'cp -p' or 'mv -p' work they just
generate an error message and return '1' for a status.

Comment 64 Chris Van Hoof 2004-07-01 17:41:22 UTC
Working with a customer with this setup:

WORKS:
 - UFS filesystem exported from Solaris 8 host 

FAILS: 
 - VxFS filesystem exported from a Solaris 9 host 



Comment 65 Neil Horman 2004-10-06 13:28:52 UTC
Let me ask this, is it possible to set ACL's ona veritas file system
locally?  i.e. can the customer log in locally to the solaris 9 system
and use chacl, or a simmilar ACL control tool to make modification to
the ACL of a file or directory contained on the Veritas file system?

Comment 66 Kostas Georgiou 2004-10-06 14:17:07 UTC
In my sol8+veritas system it works fine, it also works fine from the
rhel3 side as well. The only thing that fails is cp -p file1 file2
and only when file1 doesn't have an ACL. 

Comment 67 Neil Horman 2004-10-06 14:43:34 UTC
I'm sorry, can you clarify that?  When you say that your sol8 system
works fine, does that mean that you can set an ACL on a file in a
veritas filesystem without error from the local system (not over NFS)?
 What about on a solaris 9 system? Does the same hold true?

Also, are you saying that cp -p file1 file2 fails on a veritas file
system only when you have _no_ acl's configured.  That would seem to
run contrary to everything else in this ticket I think.

Comment 68 Kostas Georgiou 2004-10-06 15:00:19 UTC
I can set an ACL on a file under vxfs without an error from the local
system (sol8). I can also set ACLs on that filesystem when it's NFS
exported to an RHEL3 system without problems. I can not test under
sol9 i am afraid, but under sol8 setfacl/getfacl and vxfs works fine.

The only error that i get from a RHEL3 machine (2.4.21-20.9.EL) and a
NFS exported VXFS is:

# touch a
# cp -p a b
cp: preserving permissions for `b': Invalid argument
# setfacl -m user:nobody:rwx a
# cp -p a c
#

Now that i think about it: why does cp uses setxattr after getxattr
failed with ENODATA? Is it possible that the args are trashed and
thats why we get EINVAL?

Comment 69 Neil Horman 2004-10-06 16:29:49 UTC
No, I think ENODATA response to getxattr just means no acls are set at
all on the object, and the subsequent setxattr is issued to set the
default unix permissions in the ACL for the file (the ACL spec
indicates ACLS for the file owner, primary group and others need to be
set to emulate UNIX permissions).

So I think this sounds like the rest of the ticket then: Solaris 8 +
vxfs works fine both locally and via NFS, and Solaris 9 + vxfs fails
over NFS.  The missing piece here is: Does Solaris 9 allow ACL's to be
set on files on the local machine (_not_ over nfs).  This will answer
the question I think once and for all as to weather or not veritas
needs to address this issue.  So If someone with a Solaris 9 box +
vxfs out there could test to see if they can set an acl on a file in a
veristas file system, that would be very helpful.  

Comment 70 Paul Waterman 2004-10-06 19:31:10 UTC
I'm not sure if you just stated that badly, but it's important to note
that "Solaris 8 + vxfs works fine both locally and via NFS" is not a
correct statement.

Take a look at the following steps that I did:

--------------------
Solaris 8 system:
--------------------

% cd /path/to/exported/vxfs/filesystem
[success -- no output]

% touch solfile1 solfile2
[success -- no output]

--------------------
RHEL 3.0 system:
--------------------

% cd /path/to/nfs/mounted/vxfs/filesystem
[success -- no output]

% /bin/cp -p solfile1 linuxcopy1
/bin/cp: preserving permissions for `linuxcopy1': Invalid argument

% /bin/cp -p solfile2 linuxcopy2
/bin/cp: preserving permissions for `linuxcopy2': Invalid argument

% /bin/rm linuxcopy1 linuxcopy2
[success -- no output]

% /bin/cp -p solfile linuxcopy
/bin/cp: preserving permissions for `linuxcopy': Invalid argument

--------------------
Solaris 8 system:
--------------------

% setfacl -m user:nobody:rwx solfile1
[success -- no output]

--------------------
RHEL 3.0 system:
--------------------

% setfacl -m user:nobody:rwx solfile2
[success -- no output]

% /bin/cp -p solfile1 linuxcopy1
[success -- no output]

% /bin/cp -p solfile2 linuxcopy2
[success -- no output]


NOTE that the setfacl command is successful on BOTH the Linux and
Solaris hosts, but the /bin/cp -p command ALWAYS fails on the Linux
host UNTIL a setfacl has been performed (either locally or remotely).

ALSO NOTE that the cp command performs both a chown32 and chmod (see
the strace previously attached), so why would it also need to perform
a setxattr if there's no acl previously attached to the file?

It looks like all our Solaris hosts here which export VxFS filesystems
are running Solaris 8 or previous; I'll see if I can set up the same
test on a Solari 9 host.

Comment 71 Paul Waterman 2004-10-06 19:33:30 UTC
Whoops... ignore the following at the end of the first RHEL 3.0 section:

% /bin/cp -p solfile linuxcopy
/bin/cp: preserving permissions for `linuxcopy': Invalid argument

That was a cut-and-paste error from an earlier test... :)

Comment 72 Neil Horman 2004-10-06 20:15:07 UTC
ok, so it sounds to me like regardless of the NFS server operating
system at this point, vxfs + NFS fails to set ACLS properly.  I assume
that solaris was the NFS server here?  Then it sounds to me like the
solaris NFS component has some culpability in this. I just dug up a
tcpdump from another ticket that shows that the solaris box in its
NFSACL reply is returning a result code of 22.  Can you run a simmilar
tcpdump on your system btw. the solaris server and the rhel3 client,
and confirm that you are seeing the same result?

Comment 73 Neil Horman 2004-10-06 20:19:26 UTC
Created attachment 104862 [details]
tcpdump of a bad NFSACL transaction btw. solaris server and linux client

this is the trace with the bad NFSACL transaction in it

Comment 74 Paul Waterman 2004-10-06 21:12:54 UTC
"ok, so it sounds to me like regardless of the NFS server operating
system at this point, vxfs + NFS fails to set ACLS properly."

Not true -- see Comment #57 where I describe several Solaris/VxFS/NFS
combos that *do* appear to work.

For example, I just did the following using an NFS mounted VxFS 3.3.3
filesystem exported from a Solaris 7 system:

% cd /path/to/nfs/mounted/vxfs/filesystem
[success -- no output]

% touch file
[success -- no output]

% /bin/cp -p file file2
[success -- no output]

% setfacl -m user:nobody:rwx file
[success -- no output]

I'll also attach an strace of this successful /bin/cp -p. It's
interesting to note that in this successful case, the setxattr gets a
return value of 0...

Comment 75 Paul Waterman 2004-10-06 21:14:27 UTC
Created attachment 104866 [details]
strace output of successful /bin/cp -p to an NFS mounted VxFS 3.3.3 filesystem from a Solaris 7 host, performed on an RHEL 3.0 system

Comment 76 Paul Waterman 2004-10-06 21:20:14 UTC
Re: Comment #72, "I assume that solaris was the NFS server here?"

Yes.

To clarify, in Comment #70, "/path/to/exported/vxfs/filesystem" was
intended to convey that the filesystem is a local VxFS filesystem
which is NFS exported. "/path/to/nfs/mounted/vxfs/filesystem" is
intended to convey the same filesystem mounted via NFS on another host.

Comment 77 Neil Horman 2004-10-06 23:26:21 UTC
I'm sorry, to be it doesn't seem to matter if its a solaris 8 or 9
server, as long as we're exporting vxfs via nfs on either of those
machines.

I'll take a look at the straces you uploaded.  Can you take the
tcpdump that I asked about and compare it to the one I recently uploaded?

Comment 78 Neil Horman 2004-10-07 13:14:59 UTC
So, I just looked at the strace that you provided, and the only
tangible difference that I really see between the two is the obvious
return value of setxattr.  Its an error under linux and ok under
solaris.  My guess is that solaris is ignoring, or interpreting the
return code of the NFSACL setxattr differently.  It would be good to
have a tcpdump of the successful transaction of the solaris NFS client
to compare, as well as a tcpdump of a failed solaris client/solaris
server transaction. the dump I uploaded shows the solaris server
responsed to the setxattr request with an NFSACL error code of 22 in
the failed case, which ethereal reports as an unknown return code.

Comment 79 Paul Waterman 2004-10-07 15:13:50 UTC
Hrmmm... I guess I wasn't clear in Comment #74. The tasks described in
that comment, and the strace subsequently attached were performed on
an RHEL 3.0 system. (The Solaris equivalent of strace is truss;
/usr/sbin/strace on Solaris is something quite different.)

Comment 80 Neil Horman 2004-10-07 15:21:19 UTC
Ok, so in Comment #74 you're saying cp -p foo bar works with a RHEL3.0
NFS client and a solaris 7 NFS server which is exporting a vxfs
filesystem via NFS, right?

Regardless, all I'm asking for is two comparative tcpdumps, one which
results in a successfull call to setxattr, and one which results in a
failed call to setxattr.  If you can provide both of these tcpdumps I
think we can show that the difference lies in the return code embedded
in the NFSACL response message (you will probably need to capture
frames with -s 0 to get all of the message).  If you can conduct these
tests using the same RHEL3 NFS client, so much the better. That way
we'll have more simmilar NFS traffic than if we had to use a different
client to get successful results.

Comment 81 Paul Waterman 2004-10-07 15:42:53 UTC
Created attachment 104900 [details]
tcpdump showing interaction on a failed /bin/cp -p

This tcpdump was generated using 'tcpdump -w [file] host [nfs server]'. Prior
to starting the tcpdump, 'touch [file]' was executed on an RHEL 3.0 system in
an NFS-mounted VxFS 3.4 filesystem exported from a Solaris 8 host. While the
tcpdump was running, '/bin/cp -p [file] [file1]' was run on the RHEL 3.0
system.

This tcpdump captures the FAILURE described in the bug.

Comment 82 Paul Waterman 2004-10-07 15:44:28 UTC
Created attachment 104901 [details]
tcpdump showing interaction on a successful /bin/cp -p

This tcpdump was generated using 'tcpdump -w [file] host [nfs server]'. Prior
to starting the tcpdump, 'touch [file]' was executed on an RHEL 3.0 system in
an NFS-mounted VxFS 3.3.3 filesystem exported from a Solaris 7 host. While the
tcpdump was running, '/bin/cp -p [file] [file1]' was run on the RHEL 3.0
system.

This tcpdump captures a successful execution of /bin/cp -p -- an exception to
the bug described.

Comment 83 Paul Waterman 2004-10-07 15:55:15 UTC
Created attachment 104905 [details]
tcpdump showing interaction on a failed /bin/cp -p

Updated tcpdump, run as 'tcpdump -w [file] -s 0 host [nfs server]' to capture
full packet data.

Comment 84 Paul Waterman 2004-10-07 15:57:02 UTC
Created attachment 104906 [details]
tcpdump showing interaction on a successful /bin/cp -p

Updated tcpdump, run as 'tcpdump -w [file] -s 0 host [nfs server]' to capture
full packet data.

Comment 85 Neil Horman 2004-10-07 17:51:40 UTC
Thank you.  I'm looking at the dumps taken from your last coments,
Comment #83 indicating the failed response (Bad trace), and Comment
#84 showing the successful response (Good trace).  I think frames
21/22 in the good trace and frames 23/24 in the bad trace tell the
story.  If you bring them up in ethereal you can see that frame 21 in
the good trace and frame 23 in the bad trace are effectively
identical, and are conducting the same operation on behalf of the cp
processes setxattr request.  There are some minor differences in the
ACL message data, but they are irrelevant things like file handle
values which will change from machine to machine anyway.  There is
however a key difference in frame 22 of the good trace and frame 24 of
the bad trace.  Note the status field of the NFSACL data.  In frame 22
of the good trace, solaris 7 returns an ACL3_OK message, while frame
24 of the bad trace returns an error code 22 in its status field.  Not
suprisingly 22 is defined in errno.h as EINVAL, which is exactly what
cp is reporting back to the user.  This leads me to conclude that either:

a) Later versions of Solaris or Veritas have a bug that causes them to
return EINVAL inappropriately, and that needs to be brought up with them.

b) Solaris or Veritas think they are returning EINVAL legitimately (in
their view) in response to some piece of malformed data in the
setxattr request packet.  I don't think this can be the case, because
any of the unique data in the setxattr request from the bad trace had
been used in prior transaction of the same trace. 

Either way, this really needs to be brought up with Sun and Veritas
for a solution.

Comment 86 Paul Waterman 2004-10-07 20:06:09 UTC
I've already asked our storage guys to bring this up with Veritas (and
Sun if appropriate), so hopefully things will be pursued from that
angle shortly.

However, some additional delving that I've just been doing shows that
the problem is more complicated. Details to follow...

Comment 87 Paul Waterman 2004-10-07 20:10:10 UTC
While looking at the details of the tcpdumps that I previously
attached in ethereal, I found something odd:

Drill down into the failed NFS SETACL -> NFSACL -> attr -> attributes
-> mode. It shows as 664. This is correct (what the original file had).

Do the same thing on the "successful" NFS SETACL, and the mode shows
as 646. 

This made me go huh, so I did a little test. These commands were run
on the same RHEL 3.0 system that the tcpdumps were performed on, on
the NFS-mounted VxFS 3.3.3 filesystem exported from a Solaris 7 host:

% touch file
[success - no output]

% ls -l file
-rw-rw-r--    1 qa4669   npaero          0 Oct  7 13:40 file

% /bin/cp -p file file2
[success - no output]

% ls -l file file2
-rw-rw-r--    1 qa4669   npaero          0 Oct  7 13:40 file
-rw-r--rw-    1 qa4669   npaero          0 Oct  7 13:40 file2

So the combination of RHEL 3.0 client + Solaris 7 server + VxFS 3.3.3
NFS-exported filesystem that we previously thought worked actually
doesn't -- it's just buggy in a different way: It silently fails with
a security-problematic permissions shift.

Comment 88 Paul Waterman 2004-10-07 20:20:39 UTC
The interesting thing is that all of these problems are introduced in
the /bin/cp binary in coreutils-4.5.3-26, which is used in all RHEL
3.0 updates. These problems do *not* exist in the /bin/cp binary
included in coreutils-4.5.3-19 (from Red Hat 9).

I extracted the /bin/cp binary from coreutils-4.5.3-19, put it in my
home directory as 'cp.rh9', and did the following on my RHEL 3.0 test
host:

% cd /vxfs-3.3.3/nfs/mounted/from/solaris-7/
[success -- no output]

% touch file
[success -- no output]

% ls -l file
-rw-rw-r--    1 qa4669   npaero          0 Oct  7 15:19 file

% ~/cp.rh9 -p file file2
[success -- no output]

% ls -l file file2
-rw-rw-r--    1 qa4669   npaero          0 Oct  7 15:19 file
-rw-rw-r--    1 qa4669   npaero          0 Oct  7 15:19 file2

% cd /vxfs-3.4/nfs/mounted/from/solaris-8/
[success -- no output]

% touch file
[success -- no output]

% ls -l file
-rw-rw-r--    1 qa4669   npaero          0 Oct  7 15:21 file

% ~/cp.rh9 -p file file2
[success -- no output]

% ls -l file file2
-rw-rw-r--    1 qa4669   npaero          0 Oct  7 15:21 file
-rw-rw-r--    1 qa4669   npaero          0 Oct  7 15:21 file2



Comment 89 Paul Waterman 2004-10-07 20:25:46 UTC
And having straced the rh-9 /bin/cp commands, I find that the reason
they work is that they don't try to do anything with ACLs at all. They
just do a chown32 and chmod -- no getxattr or setxattr.

Comment 90 Neil Horman 2004-10-07 20:43:58 UTC
Can you double check that please?  I just re-locked at the bad trace
and good trace from comments 83, and 84 and I see the same permission
set in both frame 21 of the good trace and frame 23 of the bad trace:

NA_USER_OBJ = 6
NA_GROUP_OBJ = 4
NA_OTHER_OBJ = 4
NA_CLASS_OBJ = 6


I do see what you are talking about in the traces you uploaded in
comment 81 and comment 82.  I notice however that the entire frame
format for the ACL data is completely different from the format of the
frames in the traces from comment 83 and comment 84.  These are also
the captures that you made without adding the -s 0 option to tcpdump
(which captures the whole frame, rather than just the first 128
bytes).  It certainly bears investigation, but I would suggest that
first you rerun these particular tests with -s 0 specified on the
tcpdump.  I'd like to be sure that we're not just seeing some
brain-deadness from the ethereal parser when it tries to decode too
far into a short frame.

As for the rh-9 distro, I'm not even sure that rh9 supports ACLs,
which is why the cp and related commands would be missing the
set/getxattr calls

Comment 91 Paul Waterman 2004-10-07 20:52:44 UTC
Don't look at the SETACL *call* -- look at the SETACL *reply* (frames
22 and 24). The call looks okay, but the reply is whacked.

I'm gonna start pounding on our storage guys -- this looks to me like
there was an ACL bug in Veritas VxFS, they found it, "fixed" it, but
broke something else in the process.

Comment 92 Paul Waterman 2004-10-07 21:01:23 UTC
Just as a reference, the earlier and later tcpdumps *are* for the same
test cases (same servers, filesystems, etc.). The only differences
between the two *should* be (a) when they were run, (b) the -s 0
option, and (c) potentially different filenames.

Comment 93 Neil Horman 2004-10-07 21:03:47 UTC
I agree that the reply looks munged from the request, but I think the
entire frame data looks munged.  The decode format is  completely
different from the other two traces you sent in which you captured all
the frame data. I really think you should retry the first two traces
with -s 0 specified.  I really think you are seeing some ethereal
decode issue with those first two traces.

Even if you meant to not specify -s 0 in the first trace, that was the
wrong thing to do, since you don't have all the frame data captured.

Comment 94 Paul Waterman 2004-10-08 15:52:06 UTC
First, a quick reiteration: The second set of tcpdumps is the same set
of test cases as the first set; the only significant difference is the
-s 0 option on tcpdump.

I've had a night to think on this, and I have a very strong idea as to
what the problem is.

Take a look at the more tcpdumps that I attached in Comment #83 and
Command #84, paying special attention to the SETFACL call in frames 23
and 21, respectively.

In each case, the SETFACL has *four* ACL entries, one of which is of
type "NA_CLASS_OBJ (16)" with a UID of 4294967295. I'm not sure what
this is, where it's coming from, or why it's being generated by cp.

However, the maximum UID supported by Solaris is 2147483647.

Thus, the EINVAL being returned is valid. (The permissions skew for a
VxFS 3.3.3 filesystem looks like it's a bug in handling invalid high
UIDs that was corrected in 3.4.)

So why is /bin/cp -p generating this particular (bogus) ACL? After
all, -p is supposed to *preserve* the existing permissions, and that's
certainly not there on the original file...

Comment 95 Paul Waterman 2004-10-08 16:05:51 UTC
Created attachment 104945 [details]
tcpdump showing /bin/cp -p with a previous setfacl

Earlier, we showed that if you perform a setfacl on the test file and then
perform a /bin/cp -p, it succeeds (see, e.g., Comment #68).

I double checked that with my VxFS 3.3.3 filesystem. If I perform a setfacl on
the test file and then do a /bin/cp -p, no permissions skew occurs.

I then ran the following on my RHEL 3.0 host on a VxFS 3.4 filesystem NFS
mounted from a Solaris 8 host:

% touch file
[success - no output]

% setfacl -m user:nobody:rwx file
[success - no output]

% /bin/cp -p file file2
[success - no output]

I did a 'tcpdump -w [file] -s 0 host [nfs server]' during the /bin/cp -p and
captured the result (attached). 

Examine the SETFACL call in frame 21 of this tcpdump. You'll note that the
fourth ACL entry is of type "NA_CLASS_OBJ (16)", but in this case, the UID is 0
-- which the Solaris system happily accepts.

Comment 96 Neil Horman 2004-10-08 17:17:08 UTC
First, regarding the difference in traces, I'm aware of the fact that
the only difference between the traces in comments 81/82 and 83/84 is
that you didn't use the -s 0 option on the former.  The -s 0 option
specifies to tcpdump that the entire frame be captured, rather than
just the first 96 bytes.  Since we don't have complete frames, and we
are digging farther into the frame than we have captured, we can't
rely on the traces in 81/82.  Thats all I'm saying.

As for the 4th ACL entry, what you are seeing is another deficency in
the ethereal NFSACL parser.  an ACL Entry of type 16 is actaully
(according to the nfsacl rpc program in the kernel) an ACL_MASK entry,
which is used to maintain consistency between the group acl class and
the owning group acl class.  This link explains the use of the
ACL_MASK entry fairly well:
http://www.suse.de/~agruen/acl/linux-acls/online/
Moving on, ethereal appears to have a bug in its parser in the sense
that its treating the fields of the MASK entry the same as any other
ACL entry.  MASK entries only contain a type (of 16) and a mask value
where the UID value normally appears.  Note that the UID value of
4294967295 is MAXINT, aka 0xffffffff, which is the mask used to
indicate that owning group and group class permissions are identical.

That all being said, it does appear there is a problem somewhere
interpreting this mask field.  It is possible that Solaris has shifted
the value of the MASK type and is using 16 to designate some other ACL
entry at this point for which the supplied data is invalid, but Sun
has not informed the community of thier change.  And they woudl have
to if anything was going to be done about this.

A good test to perform may be the reverse of what we have been doing.
 Rather than using RHEL3 as a client use RHEL3 as a server (don't
worry about including veritas in this test).  Use solaris 7 or 8 as a
client, and mount an NFS share exported by the RHEL3 server.  Make
sure ACLs are enabled and settable by the version of cp running on
solaris (or just use setfacl).  The solaris client should generate a
simmilar NFSACL packet to the one the RHEL3 client did, which should
include a mask ACL entry.  If you run tcpdump -s 0 on that transaction
we should be able to tell what type the solaris client uses for its
mask ACL entry.  If its some value other than 16, we can have some
relative level of confidence that Sun has changed the acl type values
around, and from there we need to decide what to do about it.  But if
the solaris client sends a mask entry using type 16 to identify it,
then we can conclude that Sun has an NFS client which is using type 16
to send ACL mask entries and is interpreting type 16 as something
different when it receives mask entries on the NFS server.  That I
think would clearly point to a bug in solaris.

Comment 97 Paul Waterman 2004-10-08 18:34:07 UTC
Created attachment 104954 [details]
Solaris truss (strace equivalent) results for /usr/bin/cp -p run under Solaris on NFS mounted VxFS 3.4 filesystem

Unfortunately, that's not as easy as it sounds.

It looks like '/usr/bin/cp -p' in Solaris handles things a bit more
intelligently than GNU cp. I've attached truss (Solaris strace equivalent)
results for a /usr/bin/cp -p that I ran on a Solaris 9 client on an NFS-mounted
VxFS 3.4 filesystem exported from a Solaris 8 host.

It appears that Solaris examines the ACLs (line 39), but never attempts to set
any ACLs on the new file. I'm guessing that it determines the file just has
standard permissions, so it only sets them with chown and chmod (lines 51-52).

I grabbed GNU fileutils-4.1, but it appears that the ACL patches added in
coreutils-4.5.3-23 and 4.5.3-24 haven't yet made it into there, so there
doesn't appear to be a simple way to do an apples-to-apples comparison here...

Comment 98 Neil Horman 2004-10-08 18:40:00 UTC
There must be a way to drive an NFSACL transaction from a solaris
client.  Is there a setfacl utility on solaris that you can use to set
an acl on a RHEL3 NFS server?  If there is just use that (you man need
to explicity set an ACL mask to ensure that it includes an ACL mask
entry in the NFSACL transaction.

Comment 99 Neil Horman 2004-10-11 18:53:15 UTC
So interestingly, I just came to discover something.  I dug up a
solaris 9 machine and attempted the experiment that I had mentioned
above.  Any attempt I made to set acls over NFS from a solaris client
to a RHEL3 server failed with an ENOTSUPP error.  It would seem
solaris doesn't not support the NFSACL program in its NFS client.  So
that kinda scraps the previous reverse test idea.  However, using a
RHEL3 client and RHEL3 server, it successfully set ACLs on NFS mounts,
which included an ACL mask of 0xffffffff.  So the question remains,
why is solaris rejecting an all ones ACL mask.

Comment 100 David N. Jafferian 2004-10-20 17:42:32 UTC
Perhaps VxFS is rejecting the all ones ACL mask and Solaris NFS is
simply relaying that rejection.  The problem does not occur on exported
UFS filesystems.


Comment 101 Steve Dickson 2004-10-20 18:50:02 UTC
In util-linux-2.11y-31.2 a noacl nfs mount option was added.
In theory (although it's truly the big hammer approach) using
this new option should clear up the problem (once and for all)

Comment 102 Paul Waterman 2004-10-27 16:27:04 UTC
Neil - could you post the commands that you were using to set ACLs on
Solaris clients? I did a little playing around with ACL setting of an
exported ext3 filesystem before "real work" interfered with my ability
to contribute to this bug's resolution, and it seemed to work fine for
me from a Solaris host.

I'm back at a point where I can run some tests and post results, so
being able to run the same commands you did would be helpful.

Comment 103 David N. Jafferian 2004-10-29 14:04:33 UTC
I've run a test in which the following was done :

- VxFS 3.5 was installed on a Sun Blade 100 running Solaris 9.
- A vxfs was created on a spare internal disk and exported.
- A file was created on the vxfs :

# touch bepa
# ls -l bepa
-rw-r--r--   1 root     other          0 Oct 29 06:41 bepa
# getfacl bepa

# file: bepa
# owner: root
# group: other
user::rw-
group::r--              #effective:r--
mask:r--
other:r--

On another Sun Blade 100 running Solaris 9 a client program was run.
The client program did the following :

- Opened "bepa".
- Using popen, executed mdb to obtain the NFS file handle of "bepa".
- Used the NFS file handle and the rpc_call library function to make
  a SETACL call to the server machine.  Here are the 4 acls supplied
  in that call :

static aclent_t acls[4] = {
	{
		NA_USER_OBJ,	/* a_type, the type of ACL entry */
		MYUID,		/* a_id, the entry in -uid or gid */
		6		/* a_perm, the permission field */
	},
	{
		NA_GROUP_OBJ,
		MYGID,
		6
	},
	{
		NA_OTHER_OBJ,
		0,
		4
	},
	{
		NA_CLASS_OBJ,
		-1,		/* is this the problem ? */
		6
	}
};

For reference :

# grep NA_CLASS_OBJ /usr/include/nfs/nfs_acl.h
#define NA_CLASS_OBJ    0x10

The client program was successful :

# getfacl bepa

# file: bepa
# owner: root
# group: other
user::rw-
group::rw-              #effective:rw-
mask:rw-
other:r--

A status of EINVAL (22) was *not* returned, so the server machine
did not have a problem with the UID of -1 (4294967295 in unsigned
decimal) in the NA_CLASS_OBJ acl.

I can try this with a server running Solaris 8 and VxFS 3.4.  If I
can get access to a client machine running RHEL3 I may be able to
run more extensive tests.


Comment 104 David N. Jafferian 2004-11-12 17:21:19 UTC
I have tried the previous test with a server running Solaris 8 and
VxFS 3.4 and still a status of EINVAL (22) was not returned.

Comment 105 John Flanagan 2004-12-21 21:39:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-601.html


Comment 106 Neil Horman 2005-01-19 13:13:12 UTC
I'm sorry, I missed the last few updates to this ticket, and it fell off my Radar.

David, Its been awhile now, so I'm going to need to re-read this ticket.  While
I understand that your tests have not reproduced the error, we have several
tcpdumps in this bugzilla that show a solaris NFS server exporting a VxFS
filesystem returning EINVAL to an NFS client for various SETATTR requests.  Can
you please look at those and suggest why the server would return that error
message given the data in the tests?

Paul, I'll dig up those commands for you, as soon as I can find that solaris
machine of mine.



Comment 107 Kostas Georgiou 2005-03-02 11:59:16 UTC
Will it be ok as a temporary workaround to patch copy_acl() to ignore
the errors and fall back in chmod? 
With a quick look it seems that adding EINVAL in the test: 
if (errno == ENOSYS || errno == ENOTSUP)
might be enough.

I really hate to mount the filesystem with noacl since I have people
that use them :(

Comment 108 Neil Horman 2005-03-02 12:24:07 UTC
That, or you can just build an old version of util-linux.  Old copies
of cp don't attempt to copy acl's at all.  Alternatively it might be a
nice feature to add a -noacl flag to the fsutils suite.

Comment 109 Andreas Gruenbacher 2005-09-12 18:35:45 UTC
Ignoring EINVAL results is not a proper fix; this shouldn't happen in the first 
place. I'm attaching the real fixes for 2.4 and 2.6 kernels. 

Comment 110 Andreas Gruenbacher 2005-09-12 18:37:39 UTC
Created attachment 118729 [details]
2.4 fix

Comment 111 Andreas Gruenbacher 2005-09-12 18:38:50 UTC
Created attachment 118730 [details]
2.6 fix

Comment 112 Neil Horman 2005-09-12 18:58:59 UTC
What part of the NFSACL spec indicates that ACL entries need to be in a
canonical order?  I don't see where its required to send ACL entries in order?

Comment 113 Andreas Gruenbacher 2005-10-13 23:01:01 UTC
A complete specification doesn't exist. Those things were partially reverse 
engineered. When I originally wrote this code I found that Solaris didn't care 
in which order it received acl entries, but I didn't test against VxFS. Tests 
later revealed that VxFS requires sorted entries. It's weird that the 
underlying filesystem plays a role at all, but it apparently does. 
 
(I wasn't in the CC of this bug, so I didn't notice your comment up to now.) 

Comment 114 David N. Jafferian 2005-10-14 19:06:44 UTC
Thank you, Andreas.  This is being escalated to Veritas engineering.

Comment 115 David N. Jafferian 2005-11-16 01:09:37 UTC
Symantec, who now owns Veritas, has responded by stating their intent to fix
this in VxFS 5.0 and the next 4.1 MP train.

Comment 116 Andreas Gruenbacher 2005-11-16 01:27:39 UTC
Good news, thanks. IMO it still makes sense to fix this for interacting with 
old systems, as the fix is pretty simple and cheap. It's also fixed this way 
upstream, 
  http://www.kernel.org/hg/linux-2.6/?cs=e5a01bb1b38d. 

Comment 117 Andrew Ferbert 2006-02-06 23:35:43 UTC
Were the patches that Andreas posted going to be released in any upcoming RHEL3
kernel?

Comment 120 Neil Horman 2006-12-12 18:34:25 UTC
Andrew- Please see comment #115.  Despite the attached patches to this BZ, this
is a problem in VXFS, which the good people at symantec have agreed to repair. 
That is where the fix will stem from

Comment 122 Steve Dickson 2006-12-13 16:01:53 UTC
Created attachment 143529 [details]
The backported path that reportedly fixes this bug.

Comment 124 RHEL Program Management 2006-12-13 16:08:29 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 125 Brian Long 2006-12-14 16:16:15 UTC
(In reply to comment #115)
> Symantec, who now owns Veritas, has responded by stating their intent to fix
> this in VxFS 5.0 and the next 4.1 MP train.

Is there a bug number or support request at Veritas that you can share?  I would
like to know if the latest MP for VxFS 4.1 fixes this issue or if the fix is
still in the works.

Thanks.

Comment 128 David N. Jafferian 2007-02-20 13:56:24 UTC
The Sun Change Request is 6337098, the Veritas Incident is 512616.
The fix is in Veritas File System (tm) 4.1 MP1 Rolling Patch 1 for
Solaris, available in Solaris Point Patches 121388-01(Solaris 8),
121389-01(Solaris 9), and 121390-01(Solaris 10).  AFAIK, MP2 for
VxFS 4.1 for Solaris has not been released.