Description of problem: ======================= Configuration : 1x2 replicate volume, nfs mount. On a pure replicate volume (1x2) , when one of the brick is offline and from nfs mount when ln command is executed, the command fails with "Invalid argument" error message. Version-Release number of selected component (if applicable): =========================================================== [11/04/12 - 04:07:56 root@darrel ~]# gluster --version glusterfs 3.3.0.5rhs built on Nov 2 2012 01:29:35 [11/04/12 - 04:11:02 root@darrel ~]# rpm -qa | grep gluster glusterfs-3.3.0.5rhs-35.el6rhs.x86_64 glusterfs-server-3.3.0.5rhs-35.el6rhs.x86_64 How reproducible: ================== Often script1.sh:- ============ mkdir test_hardlink_self_heal; cd test_hardlink_self_heal; for i in `seq 1 5`; do mkdir dir.$i; for j in `seq 1 10`; do dd if=/dev/input_file of=dir.$i/file.$j bs=1k count=$j ; done ; done; cd ../ script2.sh:- ============= cd test_hardlink_self_heal; for i in `seq 1 5`;do for j in `seq 1 10`; do ln dir.$i/file.$j dir.$i/link_file.$j; done ; done; cd ../ Steps to Reproduce: ================== 1.create a pure replicate volume (1x2). start the volume 2.create a nfs mount from the client. 3.execute "script1.sh" from the nfs mount 4.After execution of "script1.sh", bring down brick1 5.executed "script2.sh" from nfs mount Actual results: ================= The ln commands fails with "Invalid argument" ln command output:- ====================== ln: accessing `dir.1/file.1': Invalid argument ln: accessing `dir.1/file.2': Invalid argument ln: accessing `dir.1/file.3': Invalid argument ln: accessing `dir.1/file.4': Invalid argument ln: accessing `dir.1/file.5': Invalid argument ln: accessing `dir.1/file.6': Invalid argument ln: accessing `dir.1/file.7': Invalid argument ln: accessing `dir.1/file.8': Invalid argument ln: accessing `dir.1/file.9': Invalid argument ln: accessing `dir.1/file.10': Invalid argument [2012-11-04 04:48:38.888919] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-rep_new-client-1: remote operation failed: Invalid argument. Path: /test_hardlink_self_heal/dir.1/file.1 (5b8f7106-5b0f-4613-9aee-21ec44b428e4) [2012-11-04 04:48:38.888977] W [nfs3.c:707:nfs3svc_getattr_lookup_cbk] 0-nfs: 7aa2b74c: /test_hardlink_self_heal/dir.1/file.1 => -1 (Invalid argument) [2012-11-04 04:48:38.889008] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 7aa2b74c, GETATTR: NFS: 22(Invalid argument for operation), POSIX: 22(Invalid argument) [2012-11-04 04:48:38.891498] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-rep_new-client-1: remote operation failed: Invalid argument. Path: /test_hardlink_self_heal/dir.1/file.2 (9c1aca42-136f-44a2-abfb-eb543e0446f7) [2012-11-04 04:48:38.891556] W [nfs3.c:707:nfs3svc_getattr_lookup_cbk] 0-nfs: 7ca2b74c: /test_hardlink_self_heal/dir.1/file.2 => -1 (Invalid argument) [2012-11-04 04:48:38.891587] W [nfs3-helpers.c:3389:nfs3_log_common_res] 0-nfs-nfsv3: XID: 7ca2b74c, GETATTR: NFS: 22(Invalid argument for operation), POSIX: 22(Invalid argument) Expected results: ================ ln command execution should be successful.
I have been able to reproduce this once on master, out of >20 attempts.
Use this. Happens every time. #!/bin/bash -x glusterd HOSTNAME=`hostname` mkdir /gfs gluster --mode=script volume create r2 replica 2 `hostname`:/gfs/r2_0 `hostname`:/gfs/r2_1 gluster --mode=script volume start r2 sleep 5 mount -t nfs `hostname`:/r2 /mnt/r2 -o vers=3,nolock cd /mnt/r2 mkdir test_hardlink_self_heal; cd test_hardlink_self_heal; for i in `seq 1 5`; do mkdir dir.$i; for j in `seq 1 10`; do dd if=/dev/zero of=dir.$i/file.$j bs=1k count=$j ; done ; done; cd ../ kill -15 `cat /var/lib/glusterd/vols/r2/run/$HOSTNAME-gfs-r2_0.pid` sleep 2 cd test_hardlink_self_heal; for i in `seq 1 5`;do for j in `seq 1 10`; do ln dir.$i/file.$j dir.$i/link_file.$j; done ; done; cd ../ echo $? cd umount /mnt/r2
I'm still not having much luck. Either there's something very timing-dependent here, or we're using different versions. Will try with 3.3 branch. Also, the suggestion has been made that this started with http://review.gluster.org/#change,4058. If that's the case, then we could trivially make it go away again by having the changed part of nfs3_getattr_resume to check for a null parent GFID and revert to the old behavior if no GFID is present. What worries me is that, without understanding why nfs3_fh_resolve_and_resume is returning such a loc, we might just be covering up a more fundamental problem. That might bring back the problem 4058 was meant to fix, or even introduce new ones.
*** Bug 872924 has been marked as a duplicate of this bug. ***
This bug is to be verified for 2.1. The clone of this bug https://bugzilla.redhat.com/show_bug.cgi?id=874051 is verified for update_3. Moving this bug to ON_QA.
This fix is not yet there in 2.1, moving the bug to MODIFIED.
Cause: Internally, getattr does a lookup after a recent change. For a lookup, the server needs parent gfid and the basename of the file, but since lookup is issued from getattr operation, it does not have the parent gfid ready(is NULL). Consequence: ln fails because lookup from getattr(internally) fails with EINVAL. Fix: Populate the parent inode(which contains gfid) in inode_loc_fill so that servers can lookup based on {parent gfid, basename} Result: The problem described does not happen with the fix. Also, bug 872924 is also solved with this fix.
CHANGE: http://review.gluster.org/4157 (nfs: resolve parent inode during inode_loc_fill) merged in master by Vijay Bellur (vbellur)
Verified the fix on build : root@king [Jul-10-2013-10:53:48] >rpm -qa | grep glusterfs glusterfs-fuse-3.4.0.12rhs.beta3-1.el6rhs.x86_64 glusterfs-geo-replication-3.4.0.12rhs.beta3-1.el6rhs.x86_64 glusterfs-3.4.0.12rhs.beta3-1.el6rhs.x86_64 glusterfs-server-3.4.0.12rhs.beta3-1.el6rhs.x86_64 glusterfs-rdma-3.4.0.12rhs.beta3-1.el6rhs.x86_64 glusterfs-debuginfo-3.4.0.12rhs.beta3-1.el6rhs.x86_64 glusterfs-devel-3.4.0.12rhs.beta3-1.el6rhs.x86_64 root@king [Jul-10-2013-10:54:01] >gluster --version glusterfs 3.4.0.12rhs.beta3 built on Jul 6 2013 14:35:18 Bug is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html