Bug 1662206

Summary:	Directory pending heal in heal info output
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Anees Patel <anepatel>
Component:	arbiter	Assignee:	Mohammed Rafi KC <rkavunga>
Status:	CLOSED DEFERRED	QA Contact:	Tamil <tmuthami>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	amukherj, bmekala, nchilaka, pasik, ravishankar, rhs-bugs, rkavunga, sheggodu, storage-qa-internal
Target Milestone:	---	Keywords:	Reopened, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:	shd-multiplexing
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1727256 (view as bug list)		Environment:
Last Closed:	2020-01-20 07:59:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1727256

Description Anees Patel 2018-12-27 04:00:26 UTC

Description of problem:

While verifying bz#1645480, hit an issue where directory is pending heal

Version-Release number of selected component (if applicable):
# rpm -qa | grep gluster
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-rdma-3.12.2-34.el7rhgs.x86_64
glusterfs-server-3.12.2-34.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-34.el7rhgs.x86_64
glusterfs-fuse-3.12.2-34.el7rhgs.x86_64
glusterfs-events-3.12.2-34.el7rhgs.x86_64


How reproducible:

1/1

Steps to Reproduce:
1. Create a 2X(2+1) arbiter volume, brick{1..6}
2. Start the volume and fuse mount it.
3. Create few files (200nos) inside few directories, create hardlinks and softlinks
4. Bring brick 2 and brick 5 down
5. Now do continuous metadata operations like rename, chgrp, chown, this is done continuously
6. Occasionally bring brick 1 and brick 4 down and bring the down bricks from step 4 up.
7. We see few transport endpoint not connected errors as sometime the good brick is down, which is expected
8. Now again bring brick 2 and brick 5 down, brick 1 and brick 4 up.
9. All the above steps are done when IO's per step 5 is ongoing.

Actual results:
Initial output
# gluster v heal test info
Brick 10.70.46.55:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.184:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.46.193:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.67:/bricks/brick1/testing
<gfid:725ab1d7-bed9-4a25-b1a0-95e2c15605b7> 
Status: Connected
Number of entries: 1

Brick 10.70.46.169:/bricks/brick1/testing
<gfid:725ab1d7-bed9-4a25-b1a0-95e2c15605b7> 
Status: Connected
Number of entries: 1

Brick 10.70.47.122:/bricks/brick1/testing
Status: Connected
Number of entries: 0

After few minutes heal output started giving dir level00 as pending heal

~]# gluster v heal test info
Brick 10.70.46.55:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.184:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.46.193:/bricks/brick1/testing
Status: Connected
Number of entries: 0

Brick 10.70.47.67:/bricks/brick1/testing
/level00 
Status: Connected
Number of entries: 1

Brick 10.70.46.169:/bricks/brick1/testing
/level00 
Status: Connected
Number of entries: 1

Brick 10.70.47.122:/bricks/brick1/testing
Status: Connected
Number of entries: 0


Expected results:
No files/dir should be in pending heal.

Additional info:
Volume Info
~]# gluster v info test
 
Volume Name: test
Type: Distributed-Replicate
Volume ID: 11ee3f35-f99d-49ce-95c7-bbee829bc6f1
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.46.55:/bricks/brick1/testing
Brick2: 10.70.47.184:/bricks/brick1/testing
Brick3: 10.70.46.193:/bricks/brick1/testing (arbiter)
Brick4: 10.70.47.67:/bricks/brick1/testing
Brick5: 10.70.46.169:/bricks/brick1/testing
Brick6: 10.70.47.122:/bricks/brick1/testing (arbiter)
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off


Also the getfattr's for the directory level00, for the second sub-vol where client3 blames client 4 and vice versa

-55 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000


184 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-0=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000


193 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-0=0x000000000000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000


67 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-4=0x000000000000000000000192
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff


169 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-3=0x0000000000000000000001a2
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff


-122 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-3=0x000000000000000000000000
trusted.afr.test-client-4=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff

Comment 13 Mohammed Rafi KC 2019-07-05 12:08:48 UTC

*** Bug 1727257 has been marked as a duplicate of this bug. ***

Comment 14 Mohammed Rafi KC 2019-07-05 15:11:20 UTC

upstream patch: https://review.gluster.org/#/c/glusterfs/+/23005/1

Comment 18 Yaniv Kaul 2019-12-17 07:27:39 UTC

(In reply to Mohammed Rafi KC from comment #14)
> upstream patch: https://review.gluster.org/#/c/glusterfs/+/23005/1

This was merged upstream in August, how come it's not part of 3.5.1?

Comment 19 Mohammed Rafi KC 2019-12-24 07:41:14 UTC

(In reply to Yaniv Kaul from comment #18)
> (In reply to Mohammed Rafi KC from comment #14)
> > upstream patch: https://review.gluster.org/#/c/glusterfs/+/23005/1
> 
> This was merged upstream in August, how come it's not part of 3.5.1?

It was a patch that comes with shd multiplex feature. This bug doesn't exist without the feature. Since the shd feature is reverted from 3.5.0, it is not part of the 3.5 branches yet.