Description of problem: Healing is unable to complete on arbiter bricks, testcase test_entry_self_heal_heal_command fails with files pending heal Version-Release number of selected component (if applicable): Discovered in 3.4 BU4 Async # rpm -qa | grep gluster glusterfs-libs-3.12.2-47.2.el7rhgs.x86_64 python2-gluster-3.12.2-47.2.el7rhgs.x86_64 glusterfs-devel-3.12.2-47.2.el7rhgs.x86_64 glusterfs-events-3.12.2-47.2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-3.12.2-47.2.el7rhgs.x86_64 Also reproduced in 3.5 # rpm -qa | grep gluster glusterfs-libs-6.0-2.el7rhgs.x86_64 glusterfs-geo-replication-6.0-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-cli-6.0-2.el7rhgs.x86_64 glusterfs-client-xlators-6.0-2.el7rhgs. How reproducible: 2/5. Steps to Reproduce: TC is automated, with the following steps 1. Create a dist-arbiter (2X2+1) volume 2. Set client-side heal options to "off" "metadata-self-heal": "off" "entry-self-heal": "off" "data-self-heal": "off" 3. Write IO's 4. Set volume option self-heal-daemon": "off" 5. Bring down arbiter brick from each subvol 6. Modify data written in step 3 7. List files from mount pt 8. Bring bricks online 9. Set option "self-heal-daemon": "on" 10. Trigger heal once all volume process and shd's are online gluster volume heal <vol-name> 11. Wait for heal completion Actual results: At step 11, heal is unable to complete, and files are pending heal Expected results: Heal should be completed with no files pending heal Additional info: # gluster v info testvol_distributed-replicated Volume Name: testvol_distributed-replicated Type: Distributed-Replicate Volume ID: 9f335738-25fc-4e2e-ad69-3d3b25212491 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: 10.70.35.50:/bricks/brick2/testvol_distributed-replicated_brick0 Brick2: 10.70.46.132:/bricks/brick1/testvol_distributed-replicated_brick1 Brick3: 10.70.46.216:/bricks/brick2/testvol_distributed-replicated_brick2 (arbiter) Brick4: 10.70.46.42:/bricks/brick1/testvol_distributed-replicated_brick3 Brick5: 10.70.47.41:/bricks/brick1/testvol_distributed-replicated_brick4 Brick6: 10.70.46.231:/bricks/brick2/testvol_distributed-replicated_brick5 (arbiter) Options Reconfigured: cluster.self-heal-daemon: on cluster.data-self-heal: off cluster.metadata-self-heal: off cluster.entry-self-heal: off transport.address-family: inet nfs.disable: on performance.client-io-threads: off Files pending heal # gluster v heal testvol_distributed-replicated info Brick 10.70.35.50:/bricks/brick2/testvol_distributed-replicated_brick0 Status: Connected Number of entries: 0 Brick 10.70.46.132:/bricks/brick1/testvol_distributed-replicated_brick1 Status: Connected Number of entries: 0 Brick 10.70.46.216:/bricks/brick2/testvol_distributed-replicated_brick2 Status: Connected Number of entries: 0 Brick 10.70.46.42:/bricks/brick1/testvol_distributed-replicated_brick3 /files/user2_a/dir0_a/dir0_a /files/user2_a/dir0_a Status: Connected Number of entries: 2 Brick 10.70.47.41:/bricks/brick1/testvol_distributed-replicated_brick4 /files/user2_a/dir0_a/dir0_a /files/user2_a/dir0_a Status: Connected Number of entries: 2 Brick 10.70.46.231:/bricks/brick2/testvol_distributed-replicated_brick5 <gfid:05893f69-b962-48c3-8838-523857703ce3>/user2_a/dir0_a/dir0_a <gfid:05893f69-b962-48c3-8838-523857703ce3>/user2_a/dir0_a Status: Connected Number of entries: 2 It is noticed that every time the issue is hit, the same set of files (also always only 2 entries pending heal) are pending heal viz., <>//user2_a/dir0_a/dir0_a , <>//user2_a/dir0_a It is always the same files, hence it can be tough reproducing this issue manually, Data-brick for subvol2 [root@dhcp47-41 dir0_a]# getfattr -m. -d -e hex /bricks/brick1/testvol_distributed-replicated_brick4/files/user2_a/dir0_a/dir0_a getfattr: Removing leading '/' from absolute path names # file: bricks/brick1/testvol_distributed-replicated_brick4/files/user2_a/dir0_a/dir0_a security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol_distributed-replicated-client-5=0x000000000000000e00000001 trusted.gfid=0x71d4d0b120944636ad95d802e15be1f2 trusted.glusterfs.dht=0x0000000000000000000000007ffffffe Another Data brick for subvol-2 [root@dhcp46-42 x86_64]# getfattr -m. -d -e hex /bricks/brick1/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/dir0_a/ getfattr: Removing leading '/' from absolute path names # file: bricks/brick1/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/dir0_a/ security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.afr.dirty=0x000000000000000000000000 trusted.afr.testvol_distributed-replicated-client-5=0x000000000000000e00000001 trusted.gfid=0x71d4d0b120944636ad95d802e15be1f2 trusted.glusterfs.dht=0x0000000000000000000000007ffffffe No entries are present is arbiter brick [root@dhcp46-231 dir0_a]# getfattr -m. -d -e hex /bricks/brick2/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/dir0_a/ getfattr: /bricks/brick2/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/dir0_a/: No such file or directory [root@dhcp46-231 dir0_a]# ls /bricks/brick2/testvol_distributed-replicated_brick5/files/user2_a/dir0_a/ <empty dir> whereas entries are present in other data-bricks t@dhcp46-42 ~]# ls /bricks/brick1/testvol_distributed-replicated_brick3/files/user2_a/dir0_a/dir0_a/ testfile0_a.txt testfile11_a.txt testfile13_a.txt testfile15_a.txt testfile17_a.txt testfile2_a.txt testfile4_a.txt testfile6_a.txt testfile9_a.txt testfile10_a.txt testfile12_a.txt testfile14_a.txt testfile16_a.txt testfile19_a.txt testfile3_a.txt testfile5_a.txt testfile7_a.txt
*** This bug has been marked as a duplicate of bug 1640148 ***