Bug 1460629

Summary: [Stress] : Multi-threaded Healing stops to progress after some time during replace brick on an arbitrated volume with lots of I/O from FUSE mounts.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ambarish <asoman>
Component: arbiterAssignee: Ravishankar N <ravishankar>
Status: CLOSED WONTFIX QA Contact: Karan Sandha <ksandha>
Severity: high Docs Contact:
Priority: low    
Version: rhgs-3.3CC: amukherj, bturner, pmulay, ravishankar, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
If there is a pending entry self-heal involving purging files from a sink brick, and an ‘rm -rf’ command is executed on the parent directory before the self-heal, the directory and files awaiting heal may not be removed from the sink brick. This is because the readdir for the ‘rm -rf’ will be served from the source brick, and the file pending entry heal is not deleted from the sink brick. Also, any pending data or metadata heal on such a file are displayed in heal-info output until the issue is fixed. Workaround: If the files and parent directory are not present on other bricks, delete them from the sink brick. This ensures that they are no longer listed in the next heal-info output.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 06:05:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1417153    

Description Ambarish 2017-06-12 09:17:17 UTC
Description of problem:
-----------------------

10*(2+1) volume,mounted via FUSE on multiple clients.

Each client had a different wrokload generator - Bonnie,dbench,kernel untar,iozone etc.

Replaced ten data bricks on one of the nodes  to trigger an MTSH.

Heal seems to be stuck somewhere.It shows 2 entries pending heal since yesterday :


<snip>

[root@gqas013 glusterfs]# gluster v heal butcher info
Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/A1
<gfid:7486c2e2-2021-44ce-a3a2-a4dffa5e9ac1> 
<gfid:8384ce29-1ea6-4564-8d70-4efd4f3a4dfb> 
Status: Connected
Number of entries: 2

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Status: Connected
Number of entries: 0

Brick gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick1
Status: Connected
Number of entries: 0

Brick gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Status: Connected
Number of entries: 0

Brick gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Status: Connected
Number of entries: 0

[root@gqas013 glusterfs]# 

</snip>





Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-27

How reproducible:
-----------------

1/1


Additional info:
----------------

[root@gqas013 glusterfs]# gluster v info
 
Volume Name: butcher
Type: Distributed-Replicate
Volume ID: 602030c8-6926-4024-a188-2bf83fadcc1e
Status: Started
Snapshot Count: 0
Number of Bricks: 10 x (2 + 1) = 30
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick1
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/A1 (arbiter)
Brick4: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick1
Brick5: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick6: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/A1 (arbiter)
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick1
Brick8: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick9: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/A1 (arbiter)
Brick10: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick1
Brick11: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick12: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/A1 (arbiter)
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick1
Brick14: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/A1 (arbiter)
Brick16: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick1
Brick17: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick18: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/A1 (arbiter)
Brick19: gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick1
Brick20: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick21: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/A1 (arbiter)
Brick22: gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick1
Brick23: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick24: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/A1 (arbiter)
Brick25: gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick1
Brick26: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick27: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/A1 (arbiter)
Brick28: gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick1
Brick29: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick30: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/A1 (arbiter)
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
server.event-threads: 4
client.event-threads: 4
cluster.shd-max-threads: 64
cluster.shd-wait-qlength: 655536

Comment 2 Ambarish 2017-06-12 09:21:16 UTC
[root@gqas005 86]# getfattr -d -m . -e hex 7486c2e2-2021-44ce-a3a2-a4dffa5e9ac1
# file: 7486c2e2-2021-44ce-a3a2-a4dffa5e9ac1
trusted.afr.dirty=0x000000010000000000000000
trusted.gfid=0x7486c2e2202144cea3a2a4dffa5e9ac1

[root@gqas005 86]# 



[root@gqas005 84]# getfattr -d -m . -e hex 8384ce29-1ea6-4564-8d70-4efd4f3a4dfb
# file: 8384ce29-1ea6-4564-8d70-4efd4f3a4dfb
trusted.afr.dirty=0x000000010000000000000000
trusted.gfid=0x8384ce291ea645648d704efd4f3a4dfb

[root@gqas005 84]#

Comment 4 Ambarish 2017-06-12 09:42:00 UTC
(In reply to Ambarish from comment #2)
> [root@gqas005 86]# getfattr -d -m . -e hex
> 7486c2e2-2021-44ce-a3a2-a4dffa5e9ac1
> # file: 7486c2e2-2021-44ce-a3a2-a4dffa5e9ac1
> trusted.afr.dirty=0x000000010000000000000000
> trusted.gfid=0x7486c2e2202144cea3a2a4dffa5e9ac1
> 
> [root@gqas005 86]# 
> 
> 
> 
> [root@gqas005 84]# getfattr -d -m . -e hex
> 8384ce29-1ea6-4564-8d70-4efd4f3a4dfb
> # file: 8384ce29-1ea6-4564-8d70-4efd4f3a4dfb
> trusted.afr.dirty=0x000000010000000000000000
> trusted.gfid=0x8384ce291ea645648d704efd4f3a4dfb
> 
> [root@gqas005 84]#

Data heal appears to be pending.

Comment 5 Ravishankar N 2017-06-14 09:58:09 UTC
Examined the setup and noted the following:
1. The 2nd brick in replica  pair consisting of brick-8 is the one which has the  pending heals for 2 gfids (as seen from heal info output).
I have taken one file (gfid) as example for the steps below.

2. The file *and* its parent dir are present only in the 2nd brick:

[root@gqas005 A1]# find . -samefile .glusterfs/74/86/7486c2e2-2021-44ce-a3a2-a4dffa5e9ac1
./.glusterfs/74/86/7486c2e2-2021-44ce-a3a2-a4dffa5e9ac1
./run3223/clients/client6/~dmtmp/PWRPNT/NEWTIPS.PPT

They are not present in the 1st and arbiter bricks. Their getfattrs are:

root@gqas005 A1]#  g ./run3223/clients/client6/~dmtmp/PWRPNT/NEWTIPS.PPT 
# file: run3223/clients/client6/~dmtmp/
/NEWTIPS.PPT
trusted.afr.dirty=0x000000010000000000000000
trusted.gfid=0x7486c2e2202144cea3a2a4dffa5e9ac1

[root@gqas005 A1]# g ./run3223/clients/client6/~dmtmp/PWRPNT
# file: run3223/clients/client6/~dmtmp/PWRPNT
trusted.gfid=0x0c2c29a6eea64e07b7e1989580d7681f
trusted.glusterfs.dht=0x00000001000000004ccccccb66666663

3) It is being show in heal-info output because the gfid is present in .glusterfs/indices/dirty (as opposed to .glusterfs/indices/xattrop)
This, along with the dirty bit being set above means the preo-op happened and then the client lost connection to the brick during a write transaction.

I see lot of disconnect messages in the 4 fuse mount logs to this brick, ie. butcher-client-22 (and also to many other bricks) and corresponding messages in the brick log:

bricks/bricks8-A1.log:994:[2017-06-11 11:45:05.920177] W [inodelk.c:399:pl_inodelk_log_cleanup] 0-butcher-server: releasing lock on 7486c2e2-2021-44ce-a3a2-a4dffa5e9ac1 held by {client=0x7f8b64ff7c90, pid=3257 lk-owner=a05e46b8827f0000}
bricks/bricks8-A1.log:1008:[2017-06-11 11:45:05.920602] I [MSGID: 115013] [server-helpers.c:296:do_fd_cleanup] 0-butcher-server: fd cleanup on /run3223/clients/client6/~dmtmp/PWRPNT/NEWTIPS.PPT

4)The file (NEWTIPS.PPT) is deleted by the fuse client, which also succeeds only on brick1 and 3.

5)Now rm -rf comes on parent directory, the readdir for which is served from b1 or b3 where NEWTIPS.PPT is not present.
Consequently when rmdir comes on PWRPNT, it succeeds on 1 and 3 but fails on 2:
ricks/bricks8-A1.log:1080:[2017-06-11 11:45:16.723007] I [MSGID: 115056] [server-rpc-fops.c:460:server_rmdir_cbk] 0-butcher-server: 474130: RMDIR /run3223/clients/client6/~dmtmp/PWRPNT (41a0f073-e019-4f8e-83cb-29de67a5d1bc/PWRPNT) ==> (Directory not empty) [Directory not empty]

In both 4) and 5), application gets success because quorum is met. Ambarish confirmed that the file/dir in question is a temp file that is deleted by fssanity and hence unlink/rm -rf are expected operations.

Thus self-heal is not able to heal the files because neither the file nor the parent is present in the 2 data bricks.

This looks like a day one  issue to me, i.e. if there are pending entry heals involving files to be deleted from the sink brick and an rmdir is performed on the parent dir, the dir and the file won't get removed from the sink.

Comment 8 Pratik Mulay 2017-08-14 10:49:14 UTC
Hi Team,

I've edited the Doc Text in order to be a part of the Release Notes.

Request you to review the same and let me know in case of any concerns.

If no changes are required, request you to provide your approval for the same.

Comment 9 Ravishankar N 2017-08-16 04:01:26 UTC
Looks good to me Pratik.