Description of problem:
While verifying bz#1645480, hit an issue where directory is pending heal
Version-Release number of selected component (if applicable):
# rpm -qa | grep gluster
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-rdma-3.12.2-34.el7rhgs.x86_64
glusterfs-server-3.12.2-34.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-34.el7rhgs.x86_64
glusterfs-fuse-3.12.2-34.el7rhgs.x86_64
glusterfs-events-3.12.2-34.el7rhgs.x86_64
How reproducible:
1/1
Steps to Reproduce:
1. Create a 2X(2+1) arbiter volume, brick{1..6}
2. Start the volume and fuse mount it.
3. Create few files (200nos) inside few directories, create hardlinks and softlinks
4. Bring brick 2 and brick 5 down
5. Now do continuous metadata operations like rename, chgrp, chown, this is done continuously
6. Occasionally bring brick 1 and brick 4 down and bring the down bricks from step 4 up.
7. We see few transport endpoint not connected errors as sometime the good brick is down, which is expected
8. Now again bring brick 2 and brick 5 down, brick 1 and brick 4 up.
9. All the above steps are done when IO's per step 5 is ongoing.
Actual results:
Initial output
# gluster v heal test info
Brick 10.70.46.55:/bricks/brick1/testing
Status: Connected
Number of entries: 0
Brick 10.70.47.184:/bricks/brick1/testing
Status: Connected
Number of entries: 0
Brick 10.70.46.193:/bricks/brick1/testing
Status: Connected
Number of entries: 0
Brick 10.70.47.67:/bricks/brick1/testing
<gfid:725ab1d7-bed9-4a25-b1a0-95e2c15605b7>
Status: Connected
Number of entries: 1
Brick 10.70.46.169:/bricks/brick1/testing
<gfid:725ab1d7-bed9-4a25-b1a0-95e2c15605b7>
Status: Connected
Number of entries: 1
Brick 10.70.47.122:/bricks/brick1/testing
Status: Connected
Number of entries: 0
After few minutes heal output started giving dir level00 as pending heal
~]# gluster v heal test info
Brick 10.70.46.55:/bricks/brick1/testing
Status: Connected
Number of entries: 0
Brick 10.70.47.184:/bricks/brick1/testing
Status: Connected
Number of entries: 0
Brick 10.70.46.193:/bricks/brick1/testing
Status: Connected
Number of entries: 0
Brick 10.70.47.67:/bricks/brick1/testing
/level00
Status: Connected
Number of entries: 1
Brick 10.70.46.169:/bricks/brick1/testing
/level00
Status: Connected
Number of entries: 1
Brick 10.70.47.122:/bricks/brick1/testing
Status: Connected
Number of entries: 0
Expected results:
No files/dir should be in pending heal.
Additional info:
Volume Info
~]# gluster v info test
Volume Name: test
Type: Distributed-Replicate
Volume ID: 11ee3f35-f99d-49ce-95c7-bbee829bc6f1
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.46.55:/bricks/brick1/testing
Brick2: 10.70.47.184:/bricks/brick1/testing
Brick3: 10.70.46.193:/bricks/brick1/testing (arbiter)
Brick4: 10.70.47.67:/bricks/brick1/testing
Brick5: 10.70.46.169:/bricks/brick1/testing
Brick6: 10.70.47.122:/bricks/brick1/testing (arbiter)
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
Also the getfattr's for the directory level00, for the second sub-vol where client3 blames client 4 and vice versa
-55 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000
184 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-0=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000
193 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-0=0x000000000000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x000000000000000000000000aaac18d5
trusted.glusterfs.dht.mds=0x00000000
67 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-4=0x000000000000000000000192
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff
169 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-3=0x0000000000000000000001a2
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff
-122 ~]# getfattr -d -m . -e hex /bricks/brick1/testing/level00
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick1/testing/level00
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.test-client-3=0x000000000000000000000000
trusted.afr.test-client-4=0x000000000000000000000000
trusted.gfid=0x725ab1d7bed94a25b1a095e2c15605b7
trusted.glusterfs.dht=0x0000000000000000aaac18d6ffffffff
Comment 13Mohammed Rafi KC
2019-07-05 12:08:48 UTC
*** Bug 1727257 has been marked as a duplicate of this bug. ***
Comment 14Mohammed Rafi KC
2019-07-05 15:11:20 UTC
Comment 19Mohammed Rafi KC
2019-12-24 07:41:14 UTC
(In reply to Yaniv Kaul from comment #18)
> (In reply to Mohammed Rafi KC from comment #14)
> > upstream patch: https://review.gluster.org/#/c/glusterfs/+/23005/1
>
> This was merged upstream in August, how come it's not part of 3.5.1?
It was a patch that comes with shd multiplex feature. This bug doesn't exist without the feature. Since the shd feature is reverted from 3.5.0, it is not part of the 3.5 branches yet.