Bug 1471742 - Optimize by not stopping (restart) selfheal deamon (shd) when a volume is stopped unless it is the last volume
Optimize by not stopping (restart) selfheal deamon (shd) when a volume is st...
Status: CLOSED WONTFIX
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity medium
: ---
: ---
Assigned To: Ravishankar N
nchilaka
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-17 07:33 EDT by nchilaka
Modified: 2017-07-18 07:44 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-07-18 05:37:10 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description nchilaka 2017-07-17 07:33:18 EDT
Description of problem:
=======================
When there is more than 1 volume in a cluster setup, if we stop one volume, we stop and start even the selfheal deamons on all nodes.
This is completely unncessary and waste of resources.
We are immediately restarting shd because there are still other volumes available. This means we are interfering momentarly in the heal process for any file which was getting healed , and this means the file takes a bit more time to heal.

before stopping the healdeamon as part of volume stop, we must check if there are any other volumes associated with the heal deamon and avoid stopping and regenerating a new shd process when it is not required
Version-Release number of selected component (if applicable):


How reproducible:
========
always


Steps to Reproduce:
1.create 3 volumes all 1x3
2.note down the shd pids
3.now stop v1
4. check the shd pid, it can be seen they are restarted
5. now stop v2, again it can be seen the shd pids are restarted

Actual results:
===========

Expected results:
don't stop shds unless it is the last volume
Comment 2 Ravishankar N 2017-07-18 05:37:10 EDT
This behaviour is by design. Not killing and restarting shd with the new graph (containing only active volumes) means we would need to perform graph switch in shd. Unlike other processes (say fuse mount/ brick process etc ) which has a inode table maintained by the top most xlator, in glustershd, each afr maintains its own inode table, so we need to migrate all of them etc. which is not worth it.

Also there is not much loss in state due to a restart because shd only restarts heals of files that were not completed earlier.
Comment 3 nchilaka 2017-07-18 06:08:20 EDT
I still see this as an issue, for eg take a case when a 100 gb file was 99gb healed, then in this case we try to scan and heal from beginning.
If the fix is not worth the effort, then we need to move to won't fix or can't fix instead of "not a bug"
Comment 4 Pranith Kumar K 2017-07-18 06:32:14 EDT
(In reply to nchilaka from comment #3)
> I still see this as an issue, for eg take a case when a 100 gb file was 99gb
> healed, then in this case we try to scan and heal from beginning.
> If the fix is not worth the effort, then we need to move to won't fix or
> can't fix instead of "not a bug"

yeah I agree, let's move it to WONTFIX. It is a design tradeoff.
Comment 5 Ravishankar N 2017-07-18 06:34:57 EDT
(In reply to nchilaka from comment #3)
> I still see this as an issue, for eg take a case when a 100 gb file was 99gb
> healed, then in this case we try to scan and heal from beginning.

For the record, this is something which can be fixed by introducing granular data-self-heal.
Comment 6 Pranith Kumar K 2017-07-18 06:59:49 EDT
(In reply to Ravishankar N from comment #5)
> (In reply to nchilaka from comment #3)
> > I still see this as an issue, for eg take a case when a 100 gb file was 99gb
> > healed, then in this case we try to scan and heal from beginning.
> 
> For the record, this is something which can be fixed by introducing granular
> data-self-heal.

Or supporting general purpose sharding, which seems to be needed for DHT as well.
Comment 7 Ravishankar N 2017-07-18 07:10:23 EDT
(In reply to Pranith Kumar K from comment #6)
> Or supporting general purpose sharding, which seems to be needed for DHT as
> well.

Yes that would work too. I guess whether we should *not* be doing granular data self-heal just because we can piggy back on files being striped is a discussion we need to have upstream.  I can imagine users who wouldn't want to stripe their files just because faster heal time is a by-product of the striping, if granular data-heal can guarantee faster heal times without it.

Note You need to log in before you can comment on or make changes to this bug.