Description of problem: Some of our users reported the appearance of "Stale file handle" messages while working with directories or files within the GlusterFS file system. For example: $ git status fatal: unable to access '.git/config': Stale file handle In this example we were happy to see some log entries in the associated $volume.log on the client (user directory names and volume names replaced with $dir01 & $VOLUME): [2018-04-18 13:52:31.892514] W [MSGID: 109009] [dht-common.c:2210:dht_lookup_linkfile_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid different on data file on $VOLUME-client-2, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4 [2018-04-18 13:52:31.893624] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid differs on subvolume $VOLUME-client-2, gfid local = 39edeb37-ebf8-4ac3-8f93-dda51de5783f, gfid node = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4 [2018-04-18 13:52:31.893832] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid differs on subvolume $VOLUME-client-11, gfid local = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4, gfid node = 39edeb37-ebf8-4ac3-8f93-dda51de5783f [2018-04-18 13:52:31.894042] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 526595853: LOOKUP() /$dir01/.git/config => -1 (Stale file handle) [2018-04-18 13:52:31.896231] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 526595876: LOOKUP() /$dir01/.git/config => -1 (Stale file handle) [2018-04-18 13:52:31.895040] W [MSGID: 109009] [dht-common.c:2210:dht_lookup_linkfile_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid different on data file on $VOLUME-client-2, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4 [2018-04-18 13:52:31.895555] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid differs on subvolume $VOLUME-client-2, gfid local = 39edeb37-ebf8-4ac3-8f93-dda51de5783f, gfid node = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4 [2018-04-18 13:52:31.895992] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid differs on subvolume $VOLUME-client-11, gfid local = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4, gfid node = 39edeb37-ebf8-4ac3-8f93-dda51de5783f Strangely, there is the double appearance of this file (once on each server) and the access privileges look very unusual. This is the state of these 2 files (username, group & path replaced): On gluster01: $ ls -lah /srv/glusterfs/bricks/DATA113/data/$dir01/.git/ total 44K drwxr-sr-x 8 $user $group 4.0K May 11 2017 . drwxr-sr-x 7 $user $group 93 May 11 2017 .. drwxr-sr-x 2 $user $group 10 May 11 2017 branches ---------T 2 root $group 0 May 11 2017 config drwxr-sr-x 2 $user $group 10 Feb 1 21:27 hooks drwxr-sr-x 2 $user $group 10 Feb 1 21:27 info drwxr-sr-x 3 $user $group 25 May 11 2017 logs drwxr-sr-x 19 $user $group 4.0K May 11 2017 objects drwxr-sr-x 5 $user $group 59 May 11 2017 refs On gluster02: $ ls -lah /srv/glusterfs/bricks/DATA203/data/$dir01/.git/ total 48K drwxr-sr-x 8 $user $group 4.0K Sep 26 2017 . drwxr-sr-x 7 $user $group 112 Feb 2 19:28 .. drwxr-sr-x 2 $user $group 10 May 11 2017 branches -rwxr--r-- 2 $user $group 293 May 11 2017 config drwxr-sr-x 2 $user $group 10 Sep 26 2017 hooks drwxr-sr-x 2 $user $group 10 May 11 2017 info drwxr-sr-x 3 $user $group 25 May 11 2017 logs drwxr-sr-x 19 $user $group 4.0K May 11 2017 objects drwxr-sr-x 5 $user $group 59 May 11 2017 refs We use a distributed volume with 2 hardware servers (& 2 test machines within the same peer group) and 23 bricks. The results of "gluster volume status" and "gluster volume info" are at the end of this post. Version-Release number of selected component (if applicable): We use the version 3.12.7 on Debian Stretch (9 / Stable) from the gluster.org repository. $ gluster --version glusterfs 3.12.7 $ dpkg -l | grep glusterfs-server ii glusterfs-server 3.12.7-1 amd64 clustered file-system (server package) $ cat /etc/apt/sources.list.d/gluster.list deb [arch=amd64] https://download.gluster.org/pub/gluster/glusterfs/3.12/LATEST/Debian/stretch/amd64/apt stretch main $ uname -a Linux gluster01 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux How reproducible: The error appears infrequent and in different directories. Usually a second try of the same command doesn't lead to the error message. So reproducibility is difficult. Git repositories seem to be affected in particular. Actual results: The error message: fatal: unable to access '.git/config': Stale file handle Expected results: No error message. Additional info: I'm not sure if it's relevant, but the same volume is affected by this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1564071 $ gluster volume status $VOLUME Status of volume: $VOLUME Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluster02:/srv/glusterfs/bricks/DATA2 01/data 49152 0 Y 4090 Brick gluster02:/srv/glusterfs/bricks/DATA2 02/data 49153 0 Y 4098 Brick gluster02:/srv/glusterfs/bricks/DATA2 03/data 49154 0 Y 4106 Brick gluster02:/srv/glusterfs/bricks/DATA2 04/data 49155 0 Y 4115 Brick gluster02:/srv/glusterfs/bricks/DATA2 05/data 49156 0 Y 4129 Brick gluster02:/srv/glusterfs/bricks/DATA2 06/data 49157 0 Y 4128 Brick gluster02:/srv/glusterfs/bricks/DATA2 07/data 49158 0 Y 4142 Brick gluster02:/srv/glusterfs/bricks/DATA2 08/data 49159 0 Y 4152 Brick gluster01:/srv/glusterfs/bricks/DATA1 10/data 49152 0 Y 4419 Brick gluster01:/srv/glusterfs/bricks/DATA1 11/data 49153 0 Y 4427 Brick gluster01:/srv/glusterfs/bricks/DATA1 12/data 49154 0 Y 4435 Brick gluster01:/srv/glusterfs/bricks/DATA1 13/data 49155 0 Y 4445 Brick gluster01:/srv/glusterfs/bricks/DATA1 14/data 49156 0 Y 4491 Brick gluster02:/srv/glusterfs/bricks/DATA2 09/data 49160 0 Y 4158 Brick gluster01:/srv/glusterfs/bricks/DATA1 01/data 49157 0 Y 4456 Brick gluster01:/srv/glusterfs/bricks/DATA1 02/data 49158 0 Y 4498 Brick gluster01:/srv/glusterfs/bricks/DATA1 03/data 49159 0 Y 4468 Brick gluster01:/srv/glusterfs/bricks/DATA1 04/data 49160 0 Y 4510 Brick gluster01:/srv/glusterfs/bricks/DATA1 05/data 49161 0 Y 4479 Brick gluster01:/srv/glusterfs/bricks/DATA1 06/data 49162 0 Y 4499 Brick gluster01:/srv/glusterfs/bricks/DATA1 07/data 49163 0 Y 4516 Brick gluster01:/srv/glusterfs/bricks/DATA1 08/data 49164 0 Y 4522 Brick gluster01:/srv/glusterfs/bricks/DATA1 09/data 49165 0 Y 4534 Quota Daemon on localhost N/A N/A Y 4403 Quota Daemon on gluster04.omics.uni-luebeck .de N/A N/A Y 800 Quota Daemon on gluster03.omics.uni-luebeck .de N/A N/A Y 689 Quota Daemon on gluster05.omics.uni-luebeck .de N/A N/A Y 2890 Quota Daemon on gluster02.omics.uni-luebeck .de N/A N/A Y 4066 $ gluster volume info $VOLUME Volume Name: $VOLUME Type: Distribute Volume ID: 0d210c70-e44f-46f1-862c-ef260514c9f1 Status: Started Snapshot Count: 0 Number of Bricks: 23 Transport-type: tcp Bricks: Brick1: gluster02:/srv/glusterfs/bricks/DATA201/data Brick2: gluster02:/srv/glusterfs/bricks/DATA202/data Brick3: gluster02:/srv/glusterfs/bricks/DATA203/data Brick4: gluster02:/srv/glusterfs/bricks/DATA204/data Brick5: gluster02:/srv/glusterfs/bricks/DATA205/data Brick6: gluster02:/srv/glusterfs/bricks/DATA206/data Brick7: gluster02:/srv/glusterfs/bricks/DATA207/data Brick8: gluster02:/srv/glusterfs/bricks/DATA208/data Brick9: gluster01:/srv/glusterfs/bricks/DATA110/data Brick10: gluster01:/srv/glusterfs/bricks/DATA111/data Brick11: gluster01:/srv/glusterfs/bricks/DATA112/data Brick12: gluster01:/srv/glusterfs/bricks/DATA113/data Brick13: gluster01:/srv/glusterfs/bricks/DATA114/data Brick14: gluster02:/srv/glusterfs/bricks/DATA209/data Brick15: gluster01:/srv/glusterfs/bricks/DATA101/data Brick16: gluster01:/srv/glusterfs/bricks/DATA102/data Brick17: gluster01:/srv/glusterfs/bricks/DATA103/data Brick18: gluster01:/srv/glusterfs/bricks/DATA104/data Brick19: gluster01:/srv/glusterfs/bricks/DATA105/data Brick20: gluster01:/srv/glusterfs/bricks/DATA106/data Brick21: gluster01:/srv/glusterfs/bricks/DATA107/data Brick22: gluster01:/srv/glusterfs/bricks/DATA108/data Brick23: gluster01:/srv/glusterfs/bricks/DATA109/data Options Reconfigured: diagnostics.brick-sys-log-level: WARNING nfs.addr-namelookup: off transport.address-family: inet nfs.disable: on diagnostics.brick-log-level: WARNING performance.readdir-ahead: on auth.allow: $IP-RANGE features.quota: on features.inode-quota: on features.quota-deem-statfs: on
Hi, according to our users, this affects exclusively git projects. Today, I found something interesting in the logs of our rebalances. They contain the following lines (the path varies): [2018-04-25 16:44:45.443553] E [MSGID: 109023] [dht-rebalance.c:2669:gf_defrag_migrate_single_file] 0-$vol-dht: Migrate file failed: /$somepath/.git/config lookup failed [Stale file handle] [2018-06-19 22:50:55.621752] I [dht-rebalance.c:3437:gf_defrag_process_dir] 0-$vol-dht: Migration operation on dir /$somepath/.git took 0.03 secs Side note: this does not count as a failure in the rebalance status. Not sure whether this is intended or not, since it is classified as an error. The paths varies, but it's always something in a .git-folder. Mostly the config, sometimes other files in a .git folder. This goes back until February. The logs before February contain messages like this: [2017-12-15 23:53:02.919615] W [MSGID: 109023] [dht-rebalance.c:2503:gf_defrag_get_entry] 0-$vol-dht: lookup failed for file:/$somepath/.git/config [Stale file handle] [2018-06-19 22:50:55.621752] I [dht-rebalance.c:3437:gf_defrag_process_dir] 0-$vol-dht: Migration operation on dir /$somepath/.git took 0.03 secs Similar but different. In February, we migrated from 3.8 to 3.12, so perhapes the versions treated the same thing slightly differently. Searching in the rebalance log for one of those files showed the following (for both before and after the upgrade to 3.12): $ zgrep "$somepath/.git/config" $vol-rebalance.log.5 [2018-05-01 12:01:00.633768] W [MSGID: 109009] [dht-common.c:2210:dht_lookup_linkfile_cbk] 0-$vol-dht: /$somepath/.git/config: gfid different on data file on $vol-client-22, gfid local = f61909e4-cfbf-4802-af00-70ffb9e79d28, gfid node = f61909e4-cfbf-4802-af00-70ffb9e79d28 [2018-05-01 12:01:00.634506] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 0-$vol-dht: /$somepath/.git/config: gfid differs on subvolume $vol-client-22, gfid local = 070f37a1-de0c-439b-97dd-ce5311988737, gfid node = f61909e4-cfbf-4802-af00-70ffb9e79d28 [2018-05-01 12:01:00.645447] E [MSGID: 109023] [dht-rebalance.c:2669:gf_defrag_migrate_single_file] 0-$vol-dht: Migrate file failed: /$somepath/.git/config lookup failed [Stale file handle] It looks like the files have two different gfid's somehow. dht_lookup_linkfile_cbk actually reports identical gfid's, but dht_lookup_everywhere_cbk finds two different ones. Could that be the reason for the stale files handles? And if yes, why does this happen (and why only to .git-files) and how can I fix that?
Additional info, I dug around a bit with the gfid and this is the result (those are the gfid's found in the logfile above): # getfattr -n trusted.glusterfs.pathinfo -e text /mnt/.gfid/f61909e4-cfbf-4802-af00-70ffb9e79d28 # file: mnt/.gfid/f61909e4-cfbf-4802-af00-70ffb9e79d28 trusted.glusterfs.pathinfo="(<DISTRIBUTE:$vol-dht> <POSIX(/srv/glusterfs/bricks/DATA109/data):gluster01:/srv/glusterfs/bricks/DATA109/data/.glusterfs/f6/19/f61909e4-cfbf-4802-af00-70ffb9e79d28>)" # getfattr -n trusted.glusterfs.pathinfo -e text /mnt/.gfid/070f37a1-de0c-439b-97dd-ce5311988737 getfattr: mnt/.gfid/070f37a1-de0c-439b-97dd-ce5311988737: No such file or directory # getfattr -n glusterfs.gfid.string /mnt/$somepath # file: mnt/$somepath glusterfs.gfid.string="1d4a5ccc-986f-4051-aee9-d31dd2ca3775" # getfattr -n glusterfs.gfid.string /mnt/$somepath/.git # file: mnt/$somepath/.git glusterfs.gfid.string="ca2dfd69-7466-4949-911a-b8874c364987" # getfattr -n glusterfs.gfid.string /mnt/$somepath/.git/config getfattr: /mnt/$somepath/.git/config: Stale file handle # cat /mnt/$somepath/.git/config cat: /mnt/$somepath/.git/config: Stale file handle # cd /mnt/$somepath/.git/ /mnt/$somepath/.git# ls branches COMMIT_EDITMSG config description FETCH_HEAD HEAD hooks index info logs objects ORIG_HEAD packed-refs refs /mnt/$somepath/.git# cat config [core] repositoryformatversion = 0 filemode = true bare = false logallrefupdates = true [remote "origin"] ... [branch "master"] remote = origin merge = refs/heads/master /mnt/$somepath/.git# getfattr -n glusterfs.gfid.string /mnt/$somepath/.git/config # file: mnt/$somepath/.git/config glusterfs.gfid.string="f61909e4-cfbf-4802-af00-70ffb9e79d28" /mnt/$somepath/.git# cd # getfattr -n glusterfs.gfid.string /mnt/$somepath/.git/config # file: mnt/$somepath/.git/config glusterfs.gfid.string="f61909e4-cfbf-4802-af00-70ffb9e79d28" # getfattr -n trusted.glusterfs.pathinfo -e text /mnt/.gfid/f61909e4-cfbf-4802-af00-70ffb9e79d28 # file: mnt/.gfid/f61909e4-cfbf-4802-af00-70ffb9e79d28 trusted.glusterfs.pathinfo="(<DISTRIBUTE:$vol-dht> <POSIX(/srv/glusterfs/bricks/DATA109/data):gluster01:/srv/glusterfs/bricks/DATA109/data/$somepath/.git/config>)" It seems like cat-ing the file with a relative path name fixes the gfid.
Sooo I went and looked where those two GFID's come from. All those config files have a linkfile. And this linkfile has a different gfid than the original. I checked a few other linkfiles and they seem to have the same gfid as the original file. Can I fix my broken files? And if yes, how?
> Can I fix my broken files? And if yes, how? Recommend to remove linkfile. They can get auto created by glusterfs.
Are you still seeing this problem?
Unknown. The previously "stale'd" files were still stale'd. Removing the linkfile fixed them, thanks for the tip. I checked the GFID's, they now match. Our users didn't report another stale file handle. Which doesn't necessarily mean that there isn't another. The only way for me to make sure there isn't any I know is a rebalance, and there hasn't been the need for one for a while. We'll switch from pure distributed to a distributed dispersed volume soon. Not sure if that influences this behaviour but if yes, this problem might solve itself for us.
(In reply to g.amedick from comment #6) > Unknown. The previously "stale'd" files were still stale'd. Removing the > linkfile fixed them, thanks for the tip. I checked the GFID's, they now > match. > > Our users didn't report another stale file handle. Which doesn't necessarily > mean that there isn't another. The only way for me to make sure there isn't > any I know is a rebalance, and there hasn't been the need for one for a > while. > > We'll switch from pure distributed to a distributed dispersed volume soon. > Not sure if that influences this behaviour but if yes, this problem might > solve itself for us. Thanks. Please let us know if you see this again.
Can I close this BZ ? We can reopen it if you see it again.
Hi, yes, this can be closed
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days