Bug 1569074

Summary:	fatal: unable to access file: Stale file handle
Product:	[Community] GlusterFS	Reporter:	Frank Rühlemann <ruehlemann>
Component:	distribute	Assignee:	bugs <bugs>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	3.12	CC:	atumball, bugs, g.amedick, nbalacha, ruehlemann
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-10 10:25:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frank Rühlemann 2018-04-18 14:54:58 UTC

Description of problem:
Some of our users reported the appearance of "Stale file handle" messages while working with directories or files within the GlusterFS file system.

For example:
$ git status
fatal: unable to access '.git/config': Stale file handle

In this example we were happy to see some log entries in the associated $volume.log on the client (user directory names and volume names replaced with $dir01 & $VOLUME):
[2018-04-18 13:52:31.892514] W [MSGID: 109009] [dht-common.c:2210:dht_lookup_linkfile_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid different on data file on $VOLUME-client-2, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4
[2018-04-18 13:52:31.893624] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid differs on subvolume $VOLUME-client-2, gfid local = 39edeb37-ebf8-4ac3-8f93-dda51de5783f, gfid node = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4
[2018-04-18 13:52:31.893832] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid differs on subvolume $VOLUME-client-11, gfid local = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4, gfid node = 39edeb37-ebf8-4ac3-8f93-dda51de5783f
[2018-04-18 13:52:31.894042] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 526595853: LOOKUP() /$dir01/.git/config => -1 (Stale file handle)
[2018-04-18 13:52:31.896231] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 526595876: LOOKUP() /$dir01/.git/config => -1 (Stale file handle)
[2018-04-18 13:52:31.895040] W [MSGID: 109009] [dht-common.c:2210:dht_lookup_linkfile_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid different on data file on $VOLUME-client-2, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4
[2018-04-18 13:52:31.895555] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid differs on subvolume $VOLUME-client-2, gfid local = 39edeb37-ebf8-4ac3-8f93-dda51de5783f, gfid node = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4
[2018-04-18 13:52:31.895992] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 4-$VOLUME-dht: /$dir01/.git/config: gfid differs on subvolume $VOLUME-client-11, gfid local = e8894c42-63f8-4f66-bdc2-ffef3e2eaef4, gfid node = 39edeb37-ebf8-4ac3-8f93-dda51de5783f

Strangely, there is the double appearance of this file (once on each server) and the access privileges look very unusual. This is the state of these 2 files (username, group & path replaced):

On gluster01:
$ ls -lah /srv/glusterfs/bricks/DATA113/data/$dir01/.git/
total 44K
drwxr-sr-x  8 $user $group 4.0K May 11  2017 .
drwxr-sr-x  7 $user $group   93 May 11  2017 ..
drwxr-sr-x  2 $user $group   10 May 11  2017 branches
---------T  2 root    $group    0 May 11  2017 config
drwxr-sr-x  2 $user $group   10 Feb  1 21:27 hooks
drwxr-sr-x  2 $user $group   10 Feb  1 21:27 info
drwxr-sr-x  3 $user $group   25 May 11  2017 logs
drwxr-sr-x 19 $user $group 4.0K May 11  2017 objects
drwxr-sr-x  5 $user $group   59 May 11  2017 refs

On gluster02:
$ ls -lah /srv/glusterfs/bricks/DATA203/data/$dir01/.git/
total 48K
drwxr-sr-x  8 $user $group 4.0K Sep 26  2017 .
drwxr-sr-x  7 $user $group  112 Feb  2 19:28 ..
drwxr-sr-x  2 $user $group   10 May 11  2017 branches
-rwxr--r--  2 $user $group  293 May 11  2017 config
drwxr-sr-x  2 $user $group   10 Sep 26  2017 hooks
drwxr-sr-x  2 $user $group   10 May 11  2017 info
drwxr-sr-x  3 $user $group   25 May 11  2017 logs
drwxr-sr-x 19 $user $group 4.0K May 11  2017 objects
drwxr-sr-x  5 $user $group   59 May 11  2017 refs

We use a distributed volume with 2 hardware servers (& 2 test machines within the same peer group) and 23 bricks. The results of "gluster volume status" and "gluster volume info" are at the end of this post.

Version-Release number of selected component (if applicable):
We use the version 3.12.7 on Debian Stretch (9 / Stable) from the gluster.org repository.

$ gluster --version
glusterfs 3.12.7

$ dpkg -l | grep glusterfs-server
ii  glusterfs-server               3.12.7-1                       amd64        clustered file-system (server package)

$ cat /etc/apt/sources.list.d/gluster.list
deb [arch=amd64] https://download.gluster.org/pub/gluster/glusterfs/3.12/LATEST/Debian/stretch/amd64/apt stretch main

$ uname -a
Linux gluster01 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux

How reproducible:
The error appears infrequent and in different directories. Usually a second try of the same command doesn't lead to the error message.
So reproducibility is difficult.
Git repositories seem to be affected in particular.

Actual results:
The error message:
fatal: unable to access '.git/config': Stale file handle

Expected results:
No error message.

Additional info:
I'm not sure if it's relevant, but the same volume is affected by this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1564071

$ gluster volume status $VOLUME
Status of volume: $VOLUME
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster02:/srv/glusterfs/bricks/DATA2
01/data                                     49152     0          Y       4090 
Brick gluster02:/srv/glusterfs/bricks/DATA2
02/data                                     49153     0          Y       4098 
Brick gluster02:/srv/glusterfs/bricks/DATA2
03/data                                     49154     0          Y       4106 
Brick gluster02:/srv/glusterfs/bricks/DATA2
04/data                                     49155     0          Y       4115 
Brick gluster02:/srv/glusterfs/bricks/DATA2
05/data                                     49156     0          Y       4129 
Brick gluster02:/srv/glusterfs/bricks/DATA2
06/data                                     49157     0          Y       4128 
Brick gluster02:/srv/glusterfs/bricks/DATA2
07/data                                     49158     0          Y       4142 
Brick gluster02:/srv/glusterfs/bricks/DATA2
08/data                                     49159     0          Y       4152 
Brick gluster01:/srv/glusterfs/bricks/DATA1
10/data                                     49152     0          Y       4419 
Brick gluster01:/srv/glusterfs/bricks/DATA1
11/data                                     49153     0          Y       4427 
Brick gluster01:/srv/glusterfs/bricks/DATA1
12/data                                     49154     0          Y       4435 
Brick gluster01:/srv/glusterfs/bricks/DATA1
13/data                                     49155     0          Y       4445 
Brick gluster01:/srv/glusterfs/bricks/DATA1
14/data                                     49156     0          Y       4491 
Brick gluster02:/srv/glusterfs/bricks/DATA2
09/data                                     49160     0          Y       4158 
Brick gluster01:/srv/glusterfs/bricks/DATA1
01/data                                     49157     0          Y       4456 
Brick gluster01:/srv/glusterfs/bricks/DATA1
02/data                                     49158     0          Y       4498 
Brick gluster01:/srv/glusterfs/bricks/DATA1
03/data                                     49159     0          Y       4468 
Brick gluster01:/srv/glusterfs/bricks/DATA1
04/data                                     49160     0          Y       4510 
Brick gluster01:/srv/glusterfs/bricks/DATA1
05/data                                     49161     0          Y       4479 
Brick gluster01:/srv/glusterfs/bricks/DATA1
06/data                                     49162     0          Y       4499 
Brick gluster01:/srv/glusterfs/bricks/DATA1
07/data                                     49163     0          Y       4516 
Brick gluster01:/srv/glusterfs/bricks/DATA1
08/data                                     49164     0          Y       4522 
Brick gluster01:/srv/glusterfs/bricks/DATA1
09/data                                     49165     0          Y       4534 
Quota Daemon on localhost                   N/A       N/A        Y       4403 
Quota Daemon on gluster04.omics.uni-luebeck
.de                                         N/A       N/A        Y       800  
Quota Daemon on gluster03.omics.uni-luebeck
.de                                         N/A       N/A        Y       689  
Quota Daemon on gluster05.omics.uni-luebeck
.de                                         N/A       N/A        Y       2890 
Quota Daemon on gluster02.omics.uni-luebeck
.de                                         N/A       N/A        Y       4066

$ gluster volume info $VOLUME
 
Volume Name: $VOLUME
Type: Distribute
Volume ID: 0d210c70-e44f-46f1-862c-ef260514c9f1
Status: Started
Snapshot Count: 0
Number of Bricks: 23
Transport-type: tcp
Bricks:
Brick1: gluster02:/srv/glusterfs/bricks/DATA201/data
Brick2: gluster02:/srv/glusterfs/bricks/DATA202/data
Brick3: gluster02:/srv/glusterfs/bricks/DATA203/data
Brick4: gluster02:/srv/glusterfs/bricks/DATA204/data
Brick5: gluster02:/srv/glusterfs/bricks/DATA205/data
Brick6: gluster02:/srv/glusterfs/bricks/DATA206/data
Brick7: gluster02:/srv/glusterfs/bricks/DATA207/data
Brick8: gluster02:/srv/glusterfs/bricks/DATA208/data
Brick9: gluster01:/srv/glusterfs/bricks/DATA110/data
Brick10: gluster01:/srv/glusterfs/bricks/DATA111/data
Brick11: gluster01:/srv/glusterfs/bricks/DATA112/data
Brick12: gluster01:/srv/glusterfs/bricks/DATA113/data
Brick13: gluster01:/srv/glusterfs/bricks/DATA114/data
Brick14: gluster02:/srv/glusterfs/bricks/DATA209/data
Brick15: gluster01:/srv/glusterfs/bricks/DATA101/data
Brick16: gluster01:/srv/glusterfs/bricks/DATA102/data
Brick17: gluster01:/srv/glusterfs/bricks/DATA103/data
Brick18: gluster01:/srv/glusterfs/bricks/DATA104/data
Brick19: gluster01:/srv/glusterfs/bricks/DATA105/data
Brick20: gluster01:/srv/glusterfs/bricks/DATA106/data
Brick21: gluster01:/srv/glusterfs/bricks/DATA107/data
Brick22: gluster01:/srv/glusterfs/bricks/DATA108/data
Brick23: gluster01:/srv/glusterfs/bricks/DATA109/data
Options Reconfigured:
diagnostics.brick-sys-log-level: WARNING
nfs.addr-namelookup: off
transport.address-family: inet
nfs.disable: on
diagnostics.brick-log-level: WARNING
performance.readdir-ahead: on
auth.allow: $IP-RANGE
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on

Comment 1 g.amedick 2018-06-20 12:51:31 UTC

Hi,

according to our users, this affects exclusively git projects. Today, I found something interesting in the logs of our rebalances. They contain the following lines (the path varies):


[2018-04-25 16:44:45.443553] E [MSGID: 109023] [dht-rebalance.c:2669:gf_defrag_migrate_single_file] 0-$vol-dht: Migrate file failed: /$somepath/.git/config lookup failed [Stale file handle]
[2018-06-19 22:50:55.621752] I [dht-rebalance.c:3437:gf_defrag_process_dir] 0-$vol-dht: Migration operation on dir /$somepath/.git took 0.03 secs

Side note: this does not count as a failure in the rebalance status. Not sure whether this is intended or not, since it is classified as an error. 

The paths varies, but it's always something in a .git-folder. Mostly the config, sometimes other files in a .git folder.

This goes back until February. The logs before February contain messages like this:

[2017-12-15 23:53:02.919615] W [MSGID: 109023] [dht-rebalance.c:2503:gf_defrag_get_entry] 0-$vol-dht: lookup failed for file:/$somepath/.git/config [Stale file handle]
[2018-06-19 22:50:55.621752] I [dht-rebalance.c:3437:gf_defrag_process_dir] 0-$vol-dht: Migration operation on dir /$somepath/.git took 0.03 secs

Similar but different. In February, we migrated from 3.8 to 3.12, so perhapes the versions treated the same thing slightly differently.

Searching in the rebalance log for one of those files showed the following (for both before and after the upgrade to 3.12):

$ zgrep "$somepath/.git/config" $vol-rebalance.log.5
[2018-05-01 12:01:00.633768] W [MSGID: 109009] [dht-common.c:2210:dht_lookup_linkfile_cbk] 0-$vol-dht: /$somepath/.git/config: gfid different on data file on $vol-client-22, gfid local = f61909e4-cfbf-4802-af00-70ffb9e79d28, gfid node = f61909e4-cfbf-4802-af00-70ffb9e79d28 
[2018-05-01 12:01:00.634506] W [MSGID: 109009] [dht-common.c:1949:dht_lookup_everywhere_cbk] 0-$vol-dht: /$somepath/.git/config: gfid differs on subvolume $vol-client-22, gfid local = 070f37a1-de0c-439b-97dd-ce5311988737, gfid node = f61909e4-cfbf-4802-af00-70ffb9e79d28
[2018-05-01 12:01:00.645447] E [MSGID: 109023] [dht-rebalance.c:2669:gf_defrag_migrate_single_file] 0-$vol-dht: Migrate file failed: /$somepath/.git/config lookup failed [Stale file handle]


It looks like the files have two different gfid's somehow. dht_lookup_linkfile_cbk actually reports identical gfid's, but dht_lookup_everywhere_cbk finds two different ones.

Could that be the reason for the stale files handles? And if yes, why does this happen (and why only to .git-files) and how can I fix that?

Comment 2 g.amedick 2018-06-20 15:49:05 UTC

Additional info, I dug around a bit with the gfid and this is the result (those are the gfid's found in the logfile above):

# getfattr -n trusted.glusterfs.pathinfo -e text /mnt/.gfid/f61909e4-cfbf-4802-af00-70ffb9e79d28
# file: mnt/.gfid/f61909e4-cfbf-4802-af00-70ffb9e79d28
trusted.glusterfs.pathinfo="(<DISTRIBUTE:$vol-dht> <POSIX(/srv/glusterfs/bricks/DATA109/data):gluster01:/srv/glusterfs/bricks/DATA109/data/.glusterfs/f6/19/f61909e4-cfbf-4802-af00-70ffb9e79d28>)"

# getfattr -n trusted.glusterfs.pathinfo -e text /mnt/.gfid/070f37a1-de0c-439b-97dd-ce5311988737
getfattr: mnt/.gfid/070f37a1-de0c-439b-97dd-ce5311988737: No such file or directory

# getfattr -n glusterfs.gfid.string /mnt/$somepath
# file: mnt/$somepath
glusterfs.gfid.string="1d4a5ccc-986f-4051-aee9-d31dd2ca3775"

# getfattr -n glusterfs.gfid.string /mnt/$somepath/.git
# file: mnt/$somepath/.git
glusterfs.gfid.string="ca2dfd69-7466-4949-911a-b8874c364987"

# getfattr -n glusterfs.gfid.string /mnt/$somepath/.git/config
getfattr: /mnt/$somepath/.git/config: Stale file handle

# cat /mnt/$somepath/.git/config
cat: /mnt/$somepath/.git/config: Stale file handle

# cd /mnt/$somepath/.git/
/mnt/$somepath/.git# ls
branches  COMMIT_EDITMSG  config  description  FETCH_HEAD  HEAD  hooks  index  info  logs  objects  ORIG_HEAD  packed-refs  refs

/mnt/$somepath/.git# cat config 
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[remote "origin"]
...
[branch "master"]
	remote = origin
	merge = refs/heads/master

/mnt/$somepath/.git# getfattr -n glusterfs.gfid.string /mnt/$somepath/.git/config
# file: mnt/$somepath/.git/config
glusterfs.gfid.string="f61909e4-cfbf-4802-af00-70ffb9e79d28"

/mnt/$somepath/.git# cd 
# getfattr -n glusterfs.gfid.string /mnt/$somepath/.git/config
# file: mnt/$somepath/.git/config
glusterfs.gfid.string="f61909e4-cfbf-4802-af00-70ffb9e79d28"

# getfattr -n trusted.glusterfs.pathinfo -e text /mnt/.gfid/f61909e4-cfbf-4802-af00-70ffb9e79d28
# file: mnt/.gfid/f61909e4-cfbf-4802-af00-70ffb9e79d28
trusted.glusterfs.pathinfo="(<DISTRIBUTE:$vol-dht> <POSIX(/srv/glusterfs/bricks/DATA109/data):gluster01:/srv/glusterfs/bricks/DATA109/data/$somepath/.git/config>)"

It seems like cat-ing the file with a relative path name fixes the gfid.

Comment 3 g.amedick 2018-07-19 12:31:11 UTC

Sooo I went and looked where those two GFID's come from. All those config files have a linkfile. And this linkfile has a different gfid than the original. 

I checked a few other linkfiles and they seem to have the same gfid as the original file. 
Can I fix my broken files? And if yes, how?

Comment 4 Amar Tumballi 2018-09-18 09:23:31 UTC

> Can I fix my broken files? And if yes, how?

Recommend to remove linkfile. They can get auto created by glusterfs.

Comment 5 Nithya Balachandran 2018-09-19 04:23:56 UTC

Are you still seeing this problem?

Comment 6 g.amedick 2018-09-20 14:42:50 UTC

Unknown. The previously "stale'd" files were still stale'd. Removing the linkfile fixed them, thanks for the tip. I checked the GFID's, they now match.

Our users didn't report another stale file handle. Which doesn't necessarily mean that there isn't another. The only way for me to make sure there isn't any I know is a rebalance, and there hasn't been the need for one for a while.

We'll switch from pure distributed to a distributed dispersed volume soon. Not sure if that influences this behaviour but if yes, this problem might solve itself for us.

Comment 7 Nithya Balachandran 2018-09-20 16:03:43 UTC

(In reply to g.amedick from comment #6)
> Unknown. The previously "stale'd" files were still stale'd. Removing the
> linkfile fixed them, thanks for the tip. I checked the GFID's, they now
> match.
> 
> Our users didn't report another stale file handle. Which doesn't necessarily
> mean that there isn't another. The only way for me to make sure there isn't
> any I know is a rebalance, and there hasn't been the need for one for a
> while.
> 
> We'll switch from pure distributed to a distributed dispersed volume soon.
> Not sure if that influences this behaviour but if yes, this problem might
> solve itself for us.

Thanks. Please let us know if you see this again.

Comment 8 Nithya Balachandran 2018-10-08 04:43:22 UTC

Can I close this BZ ? We can reopen it if you see it again.

Comment 9 g.amedick 2018-10-10 10:01:36 UTC

Hi,

yes, this can be closed

Comment 10 Red Hat Bugzilla 2023-09-14 04:27:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days