Bug 1196433

Summary:	[RFE] [HC] entry into maintenance mode should consider whether self-heal is ongoing
Product:	[oVirt] ovirt-engine	Reporter:	Paul Cuzner <pcuzner>
Component:	RFEs	Assignee:	Ramesh N <rnachimu>
Status:	CLOSED CURRENTRELEASE	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	high
Version:	---	CC:	bmcclain, bugs, eheftman, gklein, lsurette, pkarampu, rbalakri, rnachimu, sabose, srevivo, ykaul
Target Milestone:	ovirt-4.1.0-beta	Keywords:	FutureFeature, Improvement
Target Release:	4.1.0.2	Flags:	rule-engine: ovirt-4.1+ bmcclain: planning_ack+ sabose: devel_ack+ rule-engine: testing_ack+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ovirt-engine-4.1.0-0.4	Doc Type:	Enhancement
Doc Text:	Previously, in GlusterFS, if a node went down and then returned, GlusterFS would automatically initiate a self-heal operation. During this operation, which could be time-consuming, a subsequent maintenance mode action within the same GlusterFS replica set could result in a split brain scenario. In this release, if a Gluster host is performing a self-heal activity, administrators will not be able to move it into maintenance mode. In extreme cases, administrators can use the force option to forcefully move a host into maintenance mode.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-27 11:07:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Gluster	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1205641
Bug Blocks:	1177771, 1277939, 1415664

Description Paul Cuzner 2015-02-26 00:16:19 UTC

Description of problem:
In hyperconverged use case with glusterfs, when a node is down but then returns glusterfs will initiate automatic self heal. This operation may take time to bring the bricks back into sync, during which a subsequent maintenance mode action within the same glusterfs replica set could result in a split brain scenario.

This rfe seeks to link the status of the volume to the maintenance mode workflow.
i.e.
- if the volume is not in self heal, maintenance mode continues as before
- if the volume is healing, and the maintenance mode request is for a node taking part in self heal, the request should be denied with a message to the admin
- if the volume is healing, but the maintnenance mode request is for a node not participating in self heal operations - the request can continue.

background operations like self heal and rebalance need greater visibility in the ovirt UI


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sahina Bose 2015-03-03 07:34:18 UTC

Is there a way to know when self-heal is going on for a volume.

Using "gluster status all tasks" - we know when rebalance/remove-brick is going on. Is there something similar for self-heal?

Comment 2 Sahina Bose 2015-03-25 11:23:16 UTC

Added Pranith's reply -
"I think gluster volume heal statistics command tells whether self-heal is going on or not. But once every 10 minutes it will show in-progress."

Comment 3 Red Hat Bugzilla Rules Engine 2015-10-19 11:00:01 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 4 Sahina Bose 2016-03-31 14:12:33 UTC

*** Bug 1196438 has been marked as a duplicate of this bug. ***

Comment 5 Sandro Bonazzola 2016-05-02 10:07:49 UTC

Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 6 Yaniv Lavi 2016-05-23 13:22:26 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 7 Yaniv Lavi 2016-05-23 13:25:08 UTC

oVirt 4.0 beta has been released, moving to RC milestone.

Comment 8 Emma Heftman 2017-01-22 10:02:09 UTC

Hi Ramesh
Will this feature affect the UI in any way? Does this change need to be described in the Administration Guide, possibly in the Gluster chapter:

https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/single/administration-guide/#sect-Cluster_Utilization

Comment 9 Ramesh N 2017-01-23 05:19:36 UTC

(In reply to emma heftman from comment #8)
> Hi Ramesh
> Will this feature affect the UI in any way? Does this change need to be
> described in the Administration Guide, possibly in the Gluster chapter:
> 
> https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/single/
> administration-guide/#sect-Cluster_Utilization

Yes. It does affects the UI. You will see following new options in the host maintenance dialog box. This will be shown only when a host supports Gluster services. 

1. Ignore Gluster Quorum and Self-Heal validations
   By default oVirt/RHEV-M will check the gluster quorum is not lost when you move the host to maintenance. Also It checks that there is no self-heal activity which will be affected as part of moving the host to maintenance. User can avoid these checks by checking this option. This should be used only in rare situation when there is no other way to do maintenance activity on the node.

2. Stop Gluster service
  This option can be used if the user wants to stop all gluster services while moving the host maintenance.

Comment 10 RamaKasturi 2017-03-21 13:44:27 UTC

Verified and works fine with build  Red Hat Virtualization Manager Version: 4.1.1.2-0.1.el7

Ovirt does not allow the host to be moved to maintenance if there are any unsynced entries present in the brick. It throws the following error "Error while executing action: Cannot switch the following Host(s) to Maintenance mode: host_name.
Unsynced entries present in following gluster bricks: [<gluster_ip>:/gluster_bricks/data/data, <gluster_ip>:/gluster_bricks/engine/engine, <gluster_ip>:/gluster_bricks/vmstore/vmstore].

When one of the brick in the volume is down and if user tries to move another node to maintenance by stopping glusterd services, ovirt displays an error "Error while executing action: Cannot switch the following Host(s) to Maintenance mode: <hostname>.Gluster quorum will be lost for the following Volumes: data,vmstore,engine.

When one of the node is already in  maintenance with glusterd services stopped, Ovirt does not allow you to move another node into maintenance since quourm will be lost for the volumes.

Ovirt allows user to move more than one node to maintenance with out stopping glusterd services as all the bricks will be up on the nodes and quorum for volumes will not be lost in this case.

ovirt allows user to move node to maintenance though self heal is going on if user ignores the  quourm and self heal validations by checking "Ignore quorum and self-heal validations"