Description of problem: While running automation runs, found that healing is not completed on Distributed-Replicated ( Arbiter ) Version-Release number of selected component (if applicable): glusterfs-3.12.2-18.1.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1) create distributed-replicated volume ( Arbiter:2 x (2 + 1) ) and mount the volume 2) Disable client side heals 3) write IO using below script #python /usr/share/glustolibs/io/scripts/file_dir_ops.py create_deep_dirs_with_files --dir-length 2 --dir-depth 2 --max-num-of-dirs 2 --num-of-files 20 /mnt/testvol_distributed-replicated_glusterfs/files 4) Disable self-heal-daemon 5) bring bricks offline from each set ( brick2 and brick3 ) 6) create files from mount point #python /usr/share/glustolibs/io/scripts/file_dir_ops.py create_files -f 20 /mnt/testvol_distributed-replicated_glusterfs/files 7) bring bricks online 8) Enable self-heal-daemon 9) issue volume heal 10) wait for heal to complete 11) Disable self-heal-daemon 12) bring bricks offline from each set ( brick0 and brick5 ) 13) Modify data python /usr/share/glustolibs/io/scripts/file_dir_ops.py mv /mnt/testvol_distributed-replicated_glusterfs/files 14) bring bricks online 15) Enable self-heal-daemon 16) Issue volume heal 17) Wait for heal to complete Actual results: After step 17, heal info is still pending [root@rhsauto039 ~]# gluster vol heal testvol_distributed-replicated info Brick rhsauto039.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick0 Status: Connected Number of entries: 0 Brick rhsauto045.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick1 Status: Connected Number of entries: 0 Brick rhsauto025.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick2 Status: Connected Number of entries: 0 Brick rhsauto047.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick3 /files/user2_a/dir0_a/dir0_a /files/user2_a/dir0_a Status: Connected Number of entries: 2 Brick rhsauto040.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick4 /files/user2_a/dir0_a/dir0_a /files/user2_a/dir0_a Status: Connected Number of entries: 2 Brick rhsauto026.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick5 <gfid:950fbe25-b5a1-4999-a718-f3424100189a>/user2_a/dir0_a/dir0_a <gfid:950fbe25-b5a1-4999-a718-f3424100189a>/user2_a/dir0_a Status: Connected Number of entries: 2 [root@rhsauto039 ~]# Expected results: healing should complete Additional info: [root@rhsauto039 ~]# gluster vol info Volume Name: testvol_distributed-replicated Type: Distributed-Replicate Volume ID: 521dc7f1-0b1f-46f8-b802-6894a1828b32 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: rhsauto039.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick0 Brick2: rhsauto045.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick1 Brick3: rhsauto025.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick2 (arbiter) Brick4: rhsauto047.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick3 Brick5: rhsauto040.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick4 Brick6: rhsauto026.lab.eng.blr.redhat.com:/bricks/brick0/testvol_distributed-replicated_brick5 (arbiter) Options Reconfigured: cluster.self-heal-daemon: on cluster.data-self-heal: off cluster.metadata-self-heal: off cluster.entry-self-heal: off transport.address-family: inet nfs.disable: on performance.client-io-threads: off [root@rhsauto039 ~]# [root@rhsauto039 ~]# gluster vol status Status of volume: testvol_distributed-replicated Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick rhsauto039.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed-replicated_ brick0 49153 0 Y 18060 Brick rhsauto045.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed-replicated_ brick1 49152 0 Y 21012 Brick rhsauto025.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed-replicated_ brick2 49152 0 Y 21449 Brick rhsauto047.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed-replicated_ brick3 49152 0 Y 20558 Brick rhsauto040.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed-replicated_ brick4 49152 0 Y 20536 Brick rhsauto026.lab.eng.blr.redhat.com:/br icks/brick0/testvol_distributed-replicated_ brick5 49153 0 Y 22118 Self-heal Daemon on localhost N/A N/A Y 18083 Self-heal Daemon on rhsauto047.lab.eng.blr. redhat.com N/A N/A Y 20889 Self-heal Daemon on rhsauto040.lab.eng.blr. redhat.com N/A N/A Y 21114 Self-heal Daemon on rhsauto045.lab.eng.blr. redhat.com N/A N/A Y 21593 Self-heal Daemon on rhsauto025.lab.eng.blr. redhat.com N/A N/A Y 21733 Self-heal Daemon on rhsauto026.lab.eng.blr. redhat.com N/A N/A Y 22141 Task Status of Volume testvol_distributed-replicated ------------------------------------------------------------------------------ There are no active volume tasks [root@rhsauto039 ~]# SOS Reports , health-report and State Dumps : http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/vavuthu/arbiter_heal_issue/ > the same scenario is passing in plain Arbiter volume.
Also note, this was hit by a User in community, and would surely propose it as 'Blocker', as this is a common activity in all scenarios, OCS, RHHI, RHGS.
Vijay was telling me that manual testing did not find any issues with the scratch build. While he is running the automated tests, moving the BZ to POST. Upstream patch link is https://review.gluster.org/#/c/21380/
*** Bug 1638947 has been marked as a duplicate of this bug. ***
Can this be prevented by using 'sdfs' feature? (serializing directory entry ops) ?
Verified the fix on build glusterfs-libs-3.12.2-23.el7rhgs.x86_64. The is no heal hang issue observed anymore, but the heal is pending and is tracked in bug: 1640148, hence setting this bz to verified state.
(In reply to Amar Tumballi from comment #17) > Can this be prevented by using 'sdfs' feature? (serializing directory entry > ops) ? Without serializing all entry ops irrespective of the parent-directory on which the fop comes, I don't think it is possible. But doing this will lead to very bad performance. So at the moment I will try to fix it in AFR/EC as the xlators are doing things that posix is not well equipped to do.
Vijay, For all upgrade/healing tests, can we have an extra step after each upgrade completes, where we add a fresh mount and a way to create a new file in existing directory and add data to existing files? This is the only way to ensure that this bug doesn't repeat in future. Pranith
*** Bug 1635967 has been marked as a duplicate of this bug. ***
(In reply to Pranith Kumar K from comment #21) > Vijay, > For all upgrade/healing tests, can we have an extra step after each > upgrade completes, where we add a fresh mount and a way to create a new file > in existing directory and add data to existing files? This is the only way > to ensure that this bug doesn't repeat in future. > > Pranith sure pranith. We include that in our upgrade testing.
(In reply to Vijay Avuthu from comment #23) > (In reply to Pranith Kumar K from comment #21) > > Vijay, > > For all upgrade/healing tests, can we have an extra step after each > > upgrade completes, where we add a fresh mount and a way to create a new file > > in existing directory and add data to existing files? This is the only way > > to ensure that this bug doesn't repeat in future. > > > > Pranith > > sure pranith. We include that in our upgrade testing. Forgot to mention, even for EC volumes the tests should be modified in similar fashion.
Changing doc text to be identical to BZ 1638026
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3432
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days