Bug 1196433 - [RFE] [HC] entry into maintenance mode should consider whether self-heal is ongoing
Summary: [RFE] [HC] entry into maintenance mode should consider whether self-heal is o...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: RFEs
Version: ---
Hardware: Unspecified
OS: Unspecified
high
medium vote
Target Milestone: ovirt-4.1.0-beta
: 4.1.0.2
Assignee: Ramesh N
QA Contact: RamaKasturi
URL:
Whiteboard:
: 1196438 (view as bug list)
Depends On: 1205641
Blocks: Generic_Hyper_Converged_Host Gluster-HC-2 1415664
TreeView+ depends on / blocked
 
Reported: 2015-02-26 00:16 UTC by Paul Cuzner
Modified: 2017-03-27 11:07 UTC (History)
11 users (show)

Fixed In Version: ovirt-engine-4.1.0-0.4
Doc Type: Enhancement
Doc Text:
Previously, in GlusterFS, if a node went down and then returned, GlusterFS would automatically initiate a self-heal operation. During this operation, which could be time-consuming, a subsequent maintenance mode action within the same GlusterFS replica set could result in a split brain scenario. In this release, if a Gluster host is performing a self-heal activity, administrators will not be able to move it into maintenance mode. In extreme cases, administrators can use the force option to forcefully move a host into maintenance mode.
Clone Of:
Environment:
Last Closed: 2017-03-27 11:07:53 UTC
oVirt Team: Gluster
rule-engine: ovirt-4.1+
bmcclain: planning_ack+
sabose: devel_ack+
rule-engine: testing_ack+


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 43773 master MERGED engine: check gluster params while moving Host to maintenance 2016-09-20 05:48:27 UTC
oVirt gerrit 59102 master NEW webadmin: Enable force option in host maintenance 2016-09-23 13:14:32 UTC

Description Paul Cuzner 2015-02-26 00:16:19 UTC
Description of problem:
In hyperconverged use case with glusterfs, when a node is down but then returns glusterfs will initiate automatic self heal. This operation may take time to bring the bricks back into sync, during which a subsequent maintenance mode action within the same glusterfs replica set could result in a split brain scenario.

This rfe seeks to link the status of the volume to the maintenance mode workflow.
i.e.
- if the volume is not in self heal, maintenance mode continues as before
- if the volume is healing, and the maintenance mode request is for a node taking part in self heal, the request should be denied with a message to the admin
- if the volume is healing, but the maintnenance mode request is for a node not participating in self heal operations - the request can continue.

background operations like self heal and rebalance need greater visibility in the ovirt UI


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Sahina Bose 2015-03-03 07:34:18 UTC
Is there a way to know when self-heal is going on for a volume.

Using "gluster status all tasks" - we know when rebalance/remove-brick is going on. Is there something similar for self-heal?

Comment 2 Sahina Bose 2015-03-25 11:23:16 UTC
Added Pranith's reply -
"I think gluster volume heal statistics command tells whether self-heal is going on or not. But once every 10 minutes it will show in-progress."

Comment 3 Red Hat Bugzilla Rules Engine 2015-10-19 11:00:01 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 4 Sahina Bose 2016-03-31 14:12:33 UTC
*** Bug 1196438 has been marked as a duplicate of this bug. ***

Comment 5 Sandro Bonazzola 2016-05-02 10:07:49 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 6 Yaniv Lavi 2016-05-23 13:22:26 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 7 Yaniv Lavi 2016-05-23 13:25:08 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 8 Emma Heftman 2017-01-22 10:02:09 UTC
Hi Ramesh
Will this feature affect the UI in any way? Does this change need to be described in the Administration Guide, possibly in the Gluster chapter:

https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/single/administration-guide/#sect-Cluster_Utilization

Comment 9 Ramesh N 2017-01-23 05:19:36 UTC
(In reply to emma heftman from comment #8)
> Hi Ramesh
> Will this feature affect the UI in any way? Does this change need to be
> described in the Administration Guide, possibly in the Gluster chapter:
> 
> https://access.redhat.com/documentation/en/red-hat-virtualization/4.0/single/
> administration-guide/#sect-Cluster_Utilization

Yes. It does affects the UI. You will see following new options in the host maintenance dialog box. This will be shown only when a host supports Gluster services. 

1. Ignore Gluster Quorum and Self-Heal validations
   By default oVirt/RHEV-M will check the gluster quorum is not lost when you move the host to maintenance. Also It checks that there is no self-heal activity which will be affected as part of moving the host to maintenance. User can avoid these checks by checking this option. This should be used only in rare situation when there is no other way to do maintenance activity on the node.

2. Stop Gluster service
  This option can be used if the user wants to stop all gluster services while moving the host maintenance.

Comment 10 RamaKasturi 2017-03-21 13:44:27 UTC
Verified and works fine with build  Red Hat Virtualization Manager Version: 4.1.1.2-0.1.el7

Ovirt does not allow the host to be moved to maintenance if there are any unsynced entries present in the brick. It throws the following error "Error while executing action: Cannot switch the following Host(s) to Maintenance mode: host_name.
Unsynced entries present in following gluster bricks: [<gluster_ip>:/gluster_bricks/data/data, <gluster_ip>:/gluster_bricks/engine/engine, <gluster_ip>:/gluster_bricks/vmstore/vmstore].

When one of the brick in the volume is down and if user tries to move another node to maintenance by stopping glusterd services, ovirt displays an error "Error while executing action: Cannot switch the following Host(s) to Maintenance mode: <hostname>.Gluster quorum will be lost for the following Volumes: data,vmstore,engine.

When one of the node is already in  maintenance with glusterd services stopped, Ovirt does not allow you to move another node into maintenance since quourm will be lost for the volumes.

Ovirt allows user to move more than one node to maintenance with out stopping glusterd services as all the bricks will be up on the nodes and quorum for volumes will not be lost in this case.

ovirt allows user to move node to maintenance though self heal is going on if user ignores the  quourm and self heal validations by checking "Ignore quorum and self-heal validations"


Note You need to log in before you can comment on or make changes to this bug.