Bug 1658870

Summary: Files pending heal in Distribute-Replicate (Arbiter) volume.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Anees Patel <anepatel>
Component: arbiterAssignee: Ravishankar N <ravishankar>
Status: CLOSED DUPLICATE QA Contact: Anees Patel <anepatel>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: anepatel, nchilaka, rhs-bugs, sankarshan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-13 09:57:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anees Patel 2018-12-13 04:30:19 UTC
Description of problem:

Automated test-case fails with files pending healing,
Test-case name:test_entry_self_heal_heal_command
Protocol used: nfs
Volume type: 2X (2+1)

Retried the automate test-run on a local cluster, observed same results, heal is pending for few files

Version-Release number of selected component (if applicable):
# rpm -qa | grep gluster
glusterfs-client-xlators-3.12.2-31.el7rhgs.x86_64
glusterfs-debuginfo-3.12.2-31.el7rhgs.x86_64
glusterfs-cli-3.12.2-31.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.3.x86_64
glusterfs-libs-3.12.2-31.el7rhgs.x86_64
glusterfs-api-3.12.2-31.el7rhgs.x86_64


How reproducible:
2/2

Steps to Reproduce:
1. Create 2 X(2+1) Volume
2. NFS mount the volume
3. Disable client side healing (metadata, data and entry)
4. Write data, directories and files from mount-point.
5. Now set self-heal-deamon to off
6. Bring down one brick from each set,  example b2 and b4
7. Modify data from client. (create, mv and cp)
8. Bring all the bricks up.
9. Set self-heal-deamon to on
10. Check if all bricks are up and check if all shd as running
11. Issue heal
12 Heal should be completed, with no files pending and no files in split-brain.


Actual results:

At Step 12, heal is pending for a few files.

Expected results:

Heal should be completed for all files/dirs
Additional info:

# gluster v info testvol_distributed-replicated
 
Volume Name: testvol_distributed-replicated
Type: Distributed-Replicate
Volume ID: 2dad8909-862f-42c9-923d-8eafdfd1e50c
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.43.62:/bricks/brick5/testvol_distributed-replicated_brick0
Brick2: 10.70.42.103:/bricks/brick5/testvol_distributed-replicated_brick1
Brick3: 10.70.41.187:/bricks/brick5/testvol_distributed-replicated_brick2 (arbiter)
Brick4: 10.70.41.216:/bricks/brick9/testvol_distributed-replicated_brick3
Brick5: 10.70.42.104:/bricks/brick9/testvol_distributed-replicated_brick4
Brick6: 10.70.43.64:/bricks/brick9/testvol_distributed-replicated_brick5 (arbiter)
Options Reconfigured:
cluster.self-heal-daemon: on
cluster.data-self-heal: off
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: off
performance.client-io-threads: off
cluster.server-quorum-ratio: 51


]# gluster v heal testvol_distributed-replicated info
Brick 10.70.43.62:/bricks/brick5/testvol_distributed-replicated_brick0
Status: Connected
Number of entries: 0

Brick 10.70.42.103:/bricks/brick5/testvol_distributed-replicated_brick1
Status: Connected
Number of entries: 0

Brick 10.70.41.187:/bricks/brick5/testvol_distributed-replicated_brick2
Status: Connected
Number of entries: 0

Brick 10.70.41.216:/bricks/brick9/testvol_distributed-replicated_brick3
/files/user2_a/dir0_a/dir0_a
/files/user2_a/dir0_a
Status: Connected
Number of entries: 2

Brick 10.70.42.104:/bricks/brick9/testvol_distributed-replicated_brick4
<gfid:d00340d5-f1a1-4e5d-8e78-a8fe7ad93e78>/user2_a/dir0_a/dir0_a
<gfid:d00340d5-f1a1-4e5d-8e78-a8fe7ad93e78>/user2_a/dir0_a
Status: Connected
Number of entries: 2

Brick 10.70.43.64:/bricks/brick9/testvol_distributed-replicated_brick5
/files/user2_a/dir0_a/dir0_a
/files/user2_a/dir0_a
Status: Connected
Number of entries: 2


Chage-logs for directory: dir0_a

[root@dhcp43-64 ~]# getfattr -de hex -m . /bricks/brick9/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick9/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol_distributed-replicated-client-4=0x000000000000000e00000001
trusted.gfid=0xd5e24a6225434db68e46b9e261641e9c
trusted.glusterfs.dht=0x00000000000000007fffffffffffffff
trusted.glusterfs.dht.mds=0x00000000

[root@dhcp41-216 ~]# getfattr -de hex -m . /bricks/brick9/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick9/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol_distributed-replicated-client-4=0x000000000000000e00000001
trusted.gfid=0xd5e24a6225434db68e46b9e261641e9c
trusted.glusterfs.dht=0x00000000000000007fffffffffffffff
trusted.glusterfs.dht.mds=0x00000000

[root@dhcp42-104 ~]# getfattr -de hex -m . /bricks/brick9/testvol_distributed-replicated_brick4/files/user2_a/dir0_a/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick9/testvol_distributed-replicated_brick4/files/user2_a/dir0_a/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.gfid=0xd5e24a6225434db68e46b9e261641e9c

change-logs for directory dir0_a/dir0_a

[root@dhcp43-64 ~]# getfattr -de hex -m . /bricks/brick9/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/dir0_a/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick9/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/dir0_a/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol_distributed-replicated-client-4=0x000000000000000e00000001
trusted.gfid=0x197c24ca87ad49668d2773e8af8e8684
trusted.glusterfs.dht=0x0000000000000000000000007ffffffe


[root@dhcp41-216 ~]# getfattr -de hex -m . /bricks/brick9/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/dir0_a/
getfattr: Removing leading '/' from absolute path names
# file: bricks/brick9/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/dir0_a/
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.testvol_distributed-replicated-client-4=0x000000000000000600000001
trusted.gfid=0x197c24ca87ad49668d2773e8af8e8684
trusted.glusterfs.dht=0x0000000000000000000000007ffffffe

No file entry present for directory dir0_a/dir0_a in back-end brick 42-104 (data brick)

Comment 3 Ravishankar N 2018-12-13 05:01:57 UTC
Hi Anees,
The afr pending xattrs indicate entry self-heal is pending on '10.70.42.104:/bricks/brick9/testvol_distributed-replicated_brick4' and heal should be hindered. Can you check if this is a duplicate of BZ 1640148, where entry heal is not able to proceed because of the missing gfid symlink for the directory inside .glusterfs? If yes, we can close it as a duplicate.
-Ravi

Comment 4 Anees Patel 2018-12-13 09:57:17 UTC
Closing as a duplicate

*** This bug has been marked as a duplicate of bug 1640148 ***