Bug 1480314

Summary:	[Distribute] : File migration failures with ESTALE and ENONENTs during a remove-brick operation
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	distribute	Assignee:	Susant Kumar Palai <spalai>
Status:	CLOSED WORKSFORME	QA Contact:	Sayalee <saraut>
Severity:	high	Docs Contact:
Priority:	medium
Version:	rhgs-3.3	CC:	amukherj, nbalacha, rhinduja, rhs-bugs, saraut, sraut, storage-qa-internal, tdesala
Target Milestone:	---	Keywords:	Reopened, Triaged, ZStream
Target Release:	---	Flags:	tdesala: needinfo-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-05 05:21:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ambarish 2017-08-10 16:30:23 UTC

Description of problem:
-----------------------

Created a huge data set.(Kernel untars in loop  in 4 subdirs,32 *  8G files on /
,10 * 100G files on /) ~1.2T of data.

Removed 4 bricks from a 12*2 to 10*2.

Remove brick was still in progress while opening this BZ.

There was NO ongoing I/O from the clients.It was only the rebalance process that was running.

I could see rebalance failures in rebal status :

[root@gqas013 ~]# gluster v remove-brick butcher gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost           427228        29.9GB       1978372             1             0          in progress        7:44:37
      gqas005.sbu.lab.eng.bos.redhat.com           425012       237.7GB       1976542             2             0          in progress        7:44:36
      gqas006.sbu.lab.eng.bos.redhat.com           392788       237.3GB       1998035             1             0          in progress        7:44:36
      gqas008.sbu.lab.eng.bos.redhat.com           393579       393.2GB       1993747             1             0          in progress        7:44:36
Estimated time left for rebalance to complete :        2:14:47
[root@gqas013 ~]# 




**From gqas005** :

[2017-08-10 12:21:09.640901] E [MSGID: 109023] [dht-rebalance.c:1647:dht_migrate_file] 0-butcher-dht: Migrate file failed:/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c: lookup failed on butcher-replicate-1 (No such file or directory)
[2017-08-10 12:21:09.641266] E [MSGID: 109023] [dht-rebalance.c:2757:gf_defrag_migrate_single_file] 0-butcher-dht: migrate-data failed for /d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c [No such file or directory]


**On gqas013** :

[2017-08-10 11:51:44.859029] W [MSGID: 109023] [dht-rebalance.c:1632:dht_migrate_file] 0-butcher-dht: migrate file failed: /d1/dir12/linux-4.13-rc4/fs/overlayfs/overlayfs.h: failed to lock file on butcher-replicate-1 (Stale file handle)
[2017-08-10 11:51:44.859057] E [MSGID: 109023] [dht-rebalance.c:2757:gf_defrag_migrate_single_file] 0-butcher-dht: migrate-data failed for /d1/dir12/linux-4.13-rc4/fs/overlayfs/overlayfs.h [Stale file handle]


**On gqas006** :

[2017-08-10 12:31:29.315337] E [MSGID: 109023] [dht-rebalance.c:748:__dht_rebalance_create_dst_file] 0-butcher-dht: failed to open /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h on butcher-replicate-0 [No such file or directory]
[2017-08-10 12:31:29.315391] E [MSGID: 0] [dht-rebalance.c:1688:dht_migrate_file] 0-butcher-dht: Create dst failed on - butcher-replicate-0 for file - /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
[2017-08-10 12:31:29.315958] E [MSGID: 109023] [dht-rebalance.c:2757:gf_defrag_migrate_single_file] 0-butcher-dht: migrate-data failed for /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h [No such file or directory]

**On gqas008** :

[2017-08-10 13:23:33.080925] E [MSGID: 109023] [dht-rebalance.c:748:__dht_rebalance_create_dst_file] 0-butcher-dht: failed to open /d2/dir7/linux-4.13-rc4/drivers/memstick/host/r592.h on butcher-replicate-0 [No such file or directory]
[2017-08-10 13:23:33.080973] E [MSGID: 0] [dht-rebalance.c:1688:dht_migrate_file] 0-butcher-dht: Create dst failed on - butcher-replicate-0 for file - /d2/dir7/linux-4.13-rc4/drivers/memstick/host/r592.h
[2017-08-10 13:23:33.081337] E [MSGID: 109023] [dht-rebalance.c:2757:gf_defrag_migrate_single_file] 0-butcher-dht: migrate-data failed for /d2/dir7/linux-4.13-rc4/drivers/memstick/host/r592.h [No such file or directory]



Version-Release number of selected component (if applicable):
--------------------------------------------------------------

3.8.4-39

How reproducible:
-----------------

Reporting the first occurrence.

Additional info:
----------------

Volume Name: butcher
Type: Distributed-Replicate
Volume ID: 99e31cb0-1590-4b1f-b875-2aee77f953d8
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x 2 = 24
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick6: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick7: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick8: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick9: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick10: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick12: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick14: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick17: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick18: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick19: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick20: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick21: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick22: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick23: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick24: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Options Reconfigured:
client.event-threads: 4
server.event-threads: 4
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: on

Comment 2 Nithya Balachandran 2017-08-11 04:52:07 UTC

One clarification here - rebalance itself has not failed. Some files failed to be migrated.

Comment 3 Ambarish 2017-08-11 05:58:31 UTC

(In reply to Nithya Balachandran from comment #2)
> One clarification here - rebalance itself has not failed. Some files failed
> to be migrated.

Remove brick status does show non zero failures on each of the nodes actually :

[root@gqas013 ~]# gluster v remove-brick butcher gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost           552180        31.5GB       2558412             1             0            completed        9:13:16
      gqas005.sbu.lab.eng.bos.redhat.com           548099       239.5GB       2554745             4             0            completed        9:13:15
      gqas006.sbu.lab.eng.bos.redhat.com           506016       238.6GB       2571464             1             0            completed        9:13:15
      gqas008.sbu.lab.eng.bos.redhat.com           508081       394.7GB       2570975             1             0            completed        9:13:15
[root@gqas013 ~]#

Comment 4 Nithya Balachandran 2017-08-11 06:07:12 UTC

That is what I mean by file migration failures. If the rebalance itself had failed it would no longer be running and the status would show "failed" instead of "in progress".

Comment 5 Susant Kumar Palai 2017-08-11 09:32:56 UTC

Update:

 Don't have the RCA yet. But putting here some observations.

Putting my observations for file
/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c.

(Found similar observations for /d1/dir12/linux-4.13-rc4/fs/overlayfs/overlayfs.h, will confirm if all the failures are related)


Here are the activities recorded for file odm.c (as mentioned above).

[node 8 represents: gqas008 and so on]

t0 - file was present on rep-1

t1-  node 8 tried to migrate from rep-1 tp rep-0    12:21:09.556685
     
t2-  node 5 tried to migrate from rep-1 rep-8    12:21:09.631189

t3-  node 8 deleted the linkto file created by migration from node 5   
 12:21:09.639110(attempt)  12:21:09.640050 (deletion confirmation)

t4-  node 8 migrated the file from rep-1 to rep-0    12:21:09.640131

t5-  node 5 failed migration as the destination file on rep-8 was deleted by node 8	12:21:09.640901 (lookup failure)


So from the logs it looks like the file was initially on rep-1. 

rep-1 has two bricks: 1) 006/b1 2) 008/b1

Now interestingly migration was triggered from nodes which are 005 and 008.

Given both the migration logs point to rep-1 as the source, and 005 is not part of rep-1, I am wondering what caused 005 to take up migration for the file. And this is where the bug lies.

Now migration on 008 deletes the linkto file created by migration on 005 leading to a migration failure on 005, but successfully migrates the file to rep-0.


file location post migration:
[root@gqas005 ~]# ssh root@gqas005 ls -l /bricks*/brick/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c
root@gqas005's password: 
-rw-rw-r-- 2 root root 44436 Aug  6 21:44 /bricks1/brick/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c
[root@gqas005 ~]# ssh root@gqas006 ls -l /bricks*/brick/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c
root@gqas006's password: 
ls: cannot access /bricks1/brick/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c: No such file or directory
[root@gqas005 ~]# ssh root@gqas008 ls -l /bricks*/brick/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c
root@gqas008's password: 
ls: cannot access /bricks1/brick/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c: No such file or directory
[root@gqas005 ~]# ssh root@gqas013 ls -l /bricks*/brick/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c
root@gqas013's password: 
-rw-rw-r-- 2 root root 44436 Aug  6 21:44 /bricks1/brick/d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c
[root@gqas005 ~]# 


In one more comment will confirm if all the failures are related.


Susant

Comment 6 Susant Kumar Palai 2017-08-11 09:51:07 UTC

Similar observations for file  d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h (and rest of the failures)

Pasting the logs in case someone wants to take a look.

[root@gqas005 ~]# ssh root@gqas005 cat /var/log/glusterfs/butcher-rebalance.log | grep "d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h"
root@gqas005's password: 
[root@gqas005 ~]# ssh root@gqas006 cat /var/log/glusterfs/butcher-rebalance.log | grep "d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h"
root@gqas006's password: 
[2017-08-10 12:31:29.311198] I [dht-rebalance.c:1570:dht_migrate_file] 0-butcher-dht: /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h: attempting to move from butcher-replicate-9 to butcher-replicate-0
[2017-08-10 12:31:29.315337] E [MSGID: 109023] [dht-rebalance.c:748:__dht_rebalance_create_dst_file] 0-butcher-dht: failed to open /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h on butcher-replicate-0 [No such file or directory]
[2017-08-10 12:31:29.315391] E [MSGID: 0] [dht-rebalance.c:1688:dht_migrate_file] 0-butcher-dht: Create dst failed on - butcher-replicate-0 for file - /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
[2017-08-10 12:31:29.315958] E [MSGID: 109023] [dht-rebalance.c:2757:gf_defrag_migrate_single_file] 0-butcher-dht: migrate-data failed for /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h [No such file or directory]
[root@gqas005 ~]# 
[root@gqas005 ~]# 
[root@gqas005 ~]# 
[root@gqas005 ~]# ssh root@gqas008 cat /var/log/glusterfs/butcher-rebalance.log | grep "d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h"
root@gqas008's password: 
[root@gqas005 ~]# ssh root@gqas013 cat /var/log/glusterfs/butcher-rebalance.log | grep "d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h"
root@gqas013's password: 
[2017-08-10 12:31:29.287690] I [dht-rebalance.c:1570:dht_migrate_file] 0-butcher-dht: /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h: attempting to move from butcher-replicate-10 to butcher-replicate-9
[2017-08-10 12:31:29.312523] I [MSGID: 109045] [dht-common.c:2015:dht_lookup_everywhere_cbk] 0-butcher-dht: attempting deletion of stale linkfile /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h on butcher-replicate-0 (hashed subvol is butcher-replicate-9)
[2017-08-10 12:31:29.315078] I [MSGID: 109069] [dht-common.c:1327:dht_lookup_unlink_cbk] 0-butcher-dht: lookup_unlink returned with op_ret -> 0 and op-errno -> 0 for /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
[2017-08-10 12:31:29.315191] I [MSGID: 109022] [dht-rebalance.c:2199:dht_migrate_file] 0-butcher-dht: completed migration of /d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h from subvolume butcher-replicate-10 to butcher-replicate-9


file is present on rep-9.

[root@gqas005 ~]# ssh root@gqas005 ls -l /bricks*/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
root@gqas005's password: 
ls: cannot access /bricks*/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h: No such file or directory
[root@gqas005 ~]# ssh root@gqas006 ls -l /bricks*/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
root@gqas006's password: 
-rw-rw-r-- 2 root root 478 Aug  6 21:44 /bricks5/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
[root@gqas005 ~]# ssh root@gqas007 ls -l /bricks*/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
Warning: Permanently added 'gqas007,10.16.156.18' (ECDSA) to the list of known hosts.
root@gqas007's password: 
ls: cannot access /bricks*/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h: No such file or directory
[root@gqas005 ~]# ssh root@gqas008 ls -l /bricks*/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
root@gqas008's password: 
-rw-rw-r-- 2 root root 478 Aug  6 21:44 /bricks5/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
[root@gqas005 ~]# ssh root@gqas013 ls -l /bricks*/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h
root@gqas013's password: 
ls: cannot access /bricks*/brick/d1/dir18/linux-4.13-rc4/arch/cris/include/arch-v32/arch/swab.h: No such file or directory
[root@gqas005 ~]#

Comment 13 Susant Kumar Palai 2017-08-16 12:52:11 UTC

More of a theory, can not confirm this as I don't think we have sufficient logs to indicate this.

The file got migrated from replicate-1 to replicate-0.

first attempt was at  12:21:09.556685 from rep-1 to rep-1 on node-8(gqas008).
2017-08-10 12:21:09.556685] I [dht-rebalance.c:1570:dht_migrate_file] 0-butcher-dht: /d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c: attempting to move from butcher-replicate-1 to butcher-replicate-0



2nd log about the file is from node-5 which says lookup_every_cbk found the same file on two different nodes which are rep-1 and rep-0. Now (fair assumption)node-5 could have fetched the file from rep-0(it's local subvol for node-5) in a state where the file had 1000 permission but not the xattr making it to think that the file is present on two different nodes. 


[2017-08-10 12:21:09.628679] W [MSGID: 109007] [dht-common.c:1950:dht_lookup_everywhere_cbk] 0-butcher-dht: multiple subvolumes (butcher-replicate-1 and butcher-replicate-0) have file /d1/dir16/linux-4.13-rc4/drivers/staging/rtl8188eu/hal/odm.c (preferably rename the file in the backend,and do a fresh lookup)

Will update if similar behavior is seen for other failures as well.

Comment 14 Susant Kumar Palai 2017-08-21 06:01:54 UTC

There was an observation here above (comment 7) that the same file got hashed to two different subvols from two different nodes. Point to note is that the layout should have been the same at that point, since the migration was triggered almost at same time.

We are not logging the layout information anymore, hence need to reproduce the issue on my dev setup for any conclusion. Will update if I get any evidence.

- Susant

Comment 15 Susant Kumar Palai 2017-08-24 05:41:52 UTC

Tried to reproduce the layout issue we suspected mentioned in comment 14.

Created a replicate volume.
[root@vm1 glusterfs]# gvi
 
Volume Name: test1
Type: Distributed-Replicate
Volume ID: c392466e-aa80-49d6-99ee-dfa8ef372cec
Status: Started
Snapshot Count: 0
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: vm1:/brick/1
Brick2: vm2:/brick/1
Brick3: vm1:/brick/2
Brick4: vm2:/brick/2
Brick5: vm1:/brick/3
Brick6: vm2:/brick/3
Brick7: vm1:/brick/4
Brick8: vm2:/brick/4

Created two clients with one dir on root, one one vm1, another on vm2 and issued fix-layout.

command: setfattr -n "distribute.fix.layout" -v "1" /mnt/dir

The layouts remained unchanged. Added a replicate pair to trigger a layout change, even then the layout was not changed post the 1st fix-layout. 


Note: Even if the layout had not changed in the current bug, having file picked from two subvolume will lead to false negative failure case which is the current bug. 

Will update at best what should be done from DHT point of view regardless of whether that is intrusive or not. Post that we can take decision on that.


- Susant