Bug 1403120 - Files remain unhealed forever if shd is disabled and re-enabled while healing is in progress.
Summary: Files remain unhealed forever if shd is disabled and re-enabled while healing...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.2.0
Assignee: Ravishankar N
QA Contact: nchilaka
URL:
Whiteboard:
Depends On: 1402841 1403187 1403192
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-12-09 06:11 UTC by Ravishankar N
Modified: 2017-03-23 05:55 UTC (History)
6 users (show)

Fixed In Version: glusterfs-3.8.4-9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1402841
Environment:
Last Closed: 2017-03-23 05:55:23 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Ravishankar N 2016-12-09 06:11:30 UTC
+++ This bug was initially created as a clone of Bug #1402841 +++

Description of problem:

1. Create a 1x2 replica vol using a 2 node cluster.
2. Fuse mount the vol and create 2000 files
3. Bring one brick down, write to those files, leading to 2000 pending data heals.
4. Bring back the brick and launch index heal
5. The shd log on the source brick prints completed heals for the the processed files.
6. Before the heal completes, do a `gluster vol set volname self-heal-daemon off`
7. The heal stops as expected.
8. Re-enable the shd: `gluster vol set volname self-heal-daemon on`
9. Observe the shd log, we don't see any files getting healed.
10. Launching index heal manually also has no effect.

The only workaround is to restart shd with a `volume start force`.

--- Additional comment from Worker Ant on 2016-12-08 07:55:33 EST ---

REVIEW: http://review.gluster.org/16073 (syncop: fix conditional wait bug in parallel dir scan) posted (#1) for review on master by Ravishankar N (ravishankar@redhat.com)

--- Additional comment from Worker Ant on 2016-12-09 00:27:13 EST ---

REVIEW: http://review.gluster.org/16073 (syncop: fix conditional wait bug in parallel dir scan) posted (#2) for review on master by Ravishankar N (ravishankar@redhat.com)

Comment 2 Ravishankar N 2016-12-09 11:31:40 UTC
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/92588/

Comment 6 nchilaka 2016-12-29 11:58:03 UTC
validation:
I have run the test on 3.8.4-10 and the fix is working

1. Create a 1x2 replica vol using a 2 node cluster.
2. Fuse mount the vol and create 2000 files
3. Bring one brick down, write to those files, leading to 2000 pending data heals.
4. Bring back the brick and launch index heal
5. The shd log on the source brick prints completed heals for the the processed files.
6. Before the heal completes, do a `gluster vol set volname self-heal-daemon off`
7. The heal stops as expected.
8. Re-enable the shd: `gluster vol set volname self-heal-daemon on`
9. Observe the shd log, the heal started to work and shd log gets populated with heal info

moving to verified

While verifying, I hit the bz 1409084 - heal enable/disable is restarting the selfheal deamon we don't see any files getting healed.

Comment 7 nchilaka 2016-12-29 13:02:07 UTC
[root@dhcp46-131 ~]# gluster v status nagtest
Status of volume: nagtest
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.111:/bricks/brick5/nagtest   49157     0          Y       24148
Brick 10.70.46.115:/bricks/brick5/nagtest   49157     0          Y       22323
Brick 10.70.46.139:/bricks/brick5/nagtest   49157     0          Y       29066
Brick 10.70.46.124:/bricks/brick5/nagtest   49152     0          Y       21470
Self-heal Daemon on localhost               N/A       N/A        Y       25456
Self-heal Daemon on dhcp46-152.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       20233
Self-heal Daemon on dhcp46-124.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       21505
Self-heal Daemon on dhcp46-139.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       7467 
Self-heal Daemon on dhcp46-111.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       24181
Self-heal Daemon on dhcp46-115.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       32083
 
Task Status of Volume nagtest
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp46-131 ~]# gluster v info nagtest
 
Volume Name: nagtest
Type: Distributed-Replicate
Volume ID: df313590-6db1-47ff-ab4e-6167d681ee80
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.111:/bricks/brick5/nagtest
Brick2: 10.70.46.115:/bricks/brick5/nagtest
Brick3: 10.70.46.139:/bricks/brick5/nagtest
Brick4: 10.70.46.124:/bricks/brick5/nagtest
Options Reconfigured:
ganesha.enable: on
cluster.self-heal-daemon: enable
cluster.use-compound-fops: on
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable

Comment 9 errata-xmlrpc 2017-03-23 05:55:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.