+++ This bug was initially created as a clone of Bug #1402841 +++ Description of problem: 1. Create a 1x2 replica vol using a 2 node cluster. 2. Fuse mount the vol and create 2000 files 3. Bring one brick down, write to those files, leading to 2000 pending data heals. 4. Bring back the brick and launch index heal 5. The shd log on the source brick prints completed heals for the the processed files. 6. Before the heal completes, do a `gluster vol set volname self-heal-daemon off` 7. The heal stops as expected. 8. Re-enable the shd: `gluster vol set volname self-heal-daemon on` 9. Observe the shd log, we don't see any files getting healed. 10. Launching index heal manually also has no effect. The only workaround is to restart shd with a `volume start force`. --- Additional comment from Worker Ant on 2016-12-08 07:55:33 EST --- REVIEW: http://review.gluster.org/16073 (syncop: fix conditional wait bug in parallel dir scan) posted (#1) for review on master by Ravishankar N (ravishankar) --- Additional comment from Worker Ant on 2016-12-09 00:27:13 EST --- REVIEW: http://review.gluster.org/16073 (syncop: fix conditional wait bug in parallel dir scan) posted (#2) for review on master by Ravishankar N (ravishankar)
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/92588/
validation: I have run the test on 3.8.4-10 and the fix is working 1. Create a 1x2 replica vol using a 2 node cluster. 2. Fuse mount the vol and create 2000 files 3. Bring one brick down, write to those files, leading to 2000 pending data heals. 4. Bring back the brick and launch index heal 5. The shd log on the source brick prints completed heals for the the processed files. 6. Before the heal completes, do a `gluster vol set volname self-heal-daemon off` 7. The heal stops as expected. 8. Re-enable the shd: `gluster vol set volname self-heal-daemon on` 9. Observe the shd log, the heal started to work and shd log gets populated with heal info moving to verified While verifying, I hit the bz 1409084 - heal enable/disable is restarting the selfheal deamon we don't see any files getting healed.
[root@dhcp46-131 ~]# gluster v status nagtest Status of volume: nagtest Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.111:/bricks/brick5/nagtest 49157 0 Y 24148 Brick 10.70.46.115:/bricks/brick5/nagtest 49157 0 Y 22323 Brick 10.70.46.139:/bricks/brick5/nagtest 49157 0 Y 29066 Brick 10.70.46.124:/bricks/brick5/nagtest 49152 0 Y 21470 Self-heal Daemon on localhost N/A N/A Y 25456 Self-heal Daemon on dhcp46-152.lab.eng.blr. redhat.com N/A N/A Y 20233 Self-heal Daemon on dhcp46-124.lab.eng.blr. redhat.com N/A N/A Y 21505 Self-heal Daemon on dhcp46-139.lab.eng.blr. redhat.com N/A N/A Y 7467 Self-heal Daemon on dhcp46-111.lab.eng.blr. redhat.com N/A N/A Y 24181 Self-heal Daemon on dhcp46-115.lab.eng.blr. redhat.com N/A N/A Y 32083 Task Status of Volume nagtest ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp46-131 ~]# gluster v info nagtest Volume Name: nagtest Type: Distributed-Replicate Volume ID: df313590-6db1-47ff-ab4e-6167d681ee80 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.46.111:/bricks/brick5/nagtest Brick2: 10.70.46.115:/bricks/brick5/nagtest Brick3: 10.70.46.139:/bricks/brick5/nagtest Brick4: 10.70.46.124:/bricks/brick5/nagtest Options Reconfigured: ganesha.enable: on cluster.self-heal-daemon: enable cluster.use-compound-fops: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: off transport.address-family: inet performance.readdir-ahead: on nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html