Bug 1229914

Summary:	glusterfs self heal takes too long following node outage
Product:	[Community] GlusterFS	Reporter:	Paul Cuzner <pcuzner>
Component:	replicate	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED EOL	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.7.1	CC:	amukherj, bkunal, bugs, sasundar, smohan, tcole
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1318792 (view as bug list)		Environment:	RHEL 7.1 glusterfs 3.7.1 replica 3 volume 3 hypervisor/glusterfs converged nodes
Last Closed:	2017-03-08 10:52:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1318792

Description Paul Cuzner 2015-06-09 22:52:21 UTC

Description of problem:
Using glusterfs 3.7.1 on rhel 7.1 with ovirt providing the virt layer. Taking a node down for maintenance forces the bulk of the vms that are active to be tracked for self heal once the node comes back.

Current mechanisms for self heal (full and diff) address the whole vdisk.

In a managed test, I had only 10 vms active and introduced ~34g of change to the vms to measure the time take to return the environment to a consistent state.

It took nearly 2 hours before the self heal was complete - and this was dedicated time i.e. without further vm load/changes.

2 hours for 34g of change is too long for admins to wait and represents a big window of opportunity for a further node to be lost sending the data into split brain.

As I understand it all vm's will change during the outage, which means regardless of the data change injected into the system - self heal has to look at each and every vdisk. For example, in my test I had a data disk for the changes - but the self heal list showed the data disks and the OS disks needed healing.

Version-Release number of selected component (if applicable):
glusterfs 3.7.1
ovirt 3.5
rhel 7.1

How reproducible:
every time

Steps to Reproduce:
1. establish a virt environment with a 10-30 vm's (full copies not clones)
2. take one gluster/ovirt node down
3. add data to a number of the vms
4. bring the node back online
5. record the time taken for the self heal to complete
6. note the cpu consumption during diff and impact to running vm's

Actual results:
The environment had 34g of change and returned to full redundancy in 2 hours.

Problems
1. Even though this doesn't reflect the work undertaken, looked at simplistically a 2 hour recovery for 34g of change is < 5MB/s!
2. cpu consumption of self heal diff is high and impacts cpu available to running vm's. Most of the cpu consumed during self heal is within the glusterfsd process, not glusterfs shd - and is also kernel calls (sys%) not usr, which impacts cpu availability to the vms's
3. during self heal, even with available cpu - vdisks being healed send the owning vm into a non-responsive state ('?' symbol).
4. if the vm running the ovirt manager goes non-responsive - you lose management of the cluster
5. Waiting hours before returning to full redundancy is problematic to administration and maintenance of an ovirt/gluster environment.

Expected results:
1. full redundancy should be restored within 1/2 hour as a goal - or a specific recovery rate should be offered (i.e. XXMB/s, with a sensible default)
2. the amount of change (self heal work) should be visible to the admin - worst case at the cli, best case in the ovirt ui (i.e. XXGB to heal, done YYGB + files affected)
3. self heal should not constrain cpu consumption of running vm's
4. vm's should never go to a non-responding state due to self heal

Additional info:

Comment 1 SATHEESARAN 2015-06-10 05:58:04 UTC

Changing the component to REPLICATE as this is a self-heal related issue

Comment 3 Ravishankar N 2016-03-18 01:04:26 UTC

For VM use cases, the the recommendation as of today is to enable sharding to improve self-heal performance: http://blog.gluster.org/2015/12/introducing-shard-translator/

There is activity going on in master to improve self-heal performance via multi-threaded selfheal (BZ 1221737) and granular entry self-heal (BZ 1269461).

Comment 4 Kaushal 2017-03-08 10:52:33 UTC

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.