Bug 1826908

Summary: [OCP4.4][NMO] Need to provide a good UI experience when putting nodes into maintenance mode fails
Product: OpenShift Container Platform Reporter: mlammon
Component: Console Metal3 PluginAssignee: Jiri Tomasek <jtomasek>
Status: CLOSED CURRENTRELEASE QA Contact: mlammon
Severity: low Docs Contact:
Priority: low    
Version: 4.4CC: abeekhof, achernet, aos-bugs, gharden, jtomasek, msluiter, rawagner, tjelinek
Target Milestone: ---Keywords: Triaged
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1826914 (view as bug list) Environment:
Last Closed: 2020-10-01 16:46:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1826914    
Bug Blocks:    
Attachments:
Description Flags
Screen Image of warning
none
Failed maintenance on second master
none
newer message CNV 2.5 NMO v0.7.0 none

Description mlammon 2020-04-22 18:20:57 UTC
Description of problem:
[NMO] More then one master node should be prevented or not allowed.  The current workflows
allow the attempt to place more then one master node into maintenance.  It will not go
into maintenance because trying to prevent etcd quorum loss.

It probably needs to be prevented from UI (this bug) and/ or both for API. A duplicate
will be generated for API.

Tested version steps:
Version:

[root@sealusa6 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-21-210658   True        False         98m     Cluster version is 4.4.0-0.nightly-2020-04-21-210658  

Steps to reproduce:
1. Deploy cluster with CNV (which has node-maintenace-operator)
2. select compute -> Nodes
3. select 3 dots on any master-0-0 node and "Start Maintenance"
 
< wait for successfully putting master-0-0 node Under maintenance >

4. select 3 dots on any master-0-1 node and "Start Maintenance"

< It will not go into maintenance > because of the etcd-quorum-guard >
The undestanding is

Actual results:
It attempts to put the master-0-1 node into maintenance and eventually throw a warning
"Workloads failing to move
drain did not complete after 1m0s" < see attached image>

This is because etcd-quorum-guard-xxxxxx is preventing. 


Expected results:
It should prevent the user from trying the second and/or third master node from UI/API



Additional info:

Looking at doc even though moved to machine-config-operator [1]

Taken from [2]
"The etcd Quorum Guard ensures that quorum is maintained for etcd for OpenShift.

For the etcd cluster to remain usable, we must maintain quorum, which is a majority of all etcd members. For example, an etcd cluster with 3 members (i.e. a 3 master deployment) must have at least 2 healthy etcd member to meet the quorum limit.

There are situations where 2 etcd members could be down at once:

a master has gone offline and the MachineConfig Controller (MCC) tries to rollout a new MachineConfig (MC) by rebooting masters
the MCC is doing a MachineConfig rollout and doesn't wait for the etcd on the previous master to become healthy again before rebooting the next master
In short, we need a way to ensure that a drain on a master is not allowed to proceed if the reboot of the master would cause etcd quorum loss."


[1] https://github.com/openshift/machine-config-operator/blob/master/install/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml
[2] https://github.com/openshift/etcd-quorum-guard

Comment 1 mlammon 2020-04-22 18:26:17 UTC
Created attachment 1680968 [details]
Screen Image of warning

Comment 2 Andrew Beekhof 2020-04-23 00:23:49 UTC
This should be enforced at the NMO level, not the UI.

Comment 3 mlammon 2020-04-23 14:03:47 UTC
If there is use case for more then (3) masters, perhaps we need to implement how many can be put into a formula so we don't allow (total - 2)

Comment 4 Andrew Beekhof 2020-05-13 12:39:09 UTC
Changing the subject, I don't believe there are any other ways that maintenance mode can "fail" rather than take longer than expected.
So it would be worth making sure we have UI in place to handle it gracefully.

Comment 5 Jiri Tomasek 2020-05-15 10:26:38 UTC
(In reply to Andrew Beekhof from comment #4)
> Changing the subject, I don't believe there are any other ways that
> maintenance mode can "fail" rather than take longer than expected.
> So it would be worth making sure we have UI in place to handle it gracefully.

Right, IIUC this relates to proper maintenance progress tracking (https://bugzilla.redhat.com/show_bug.cgi?id=1812354) - as mentioned there NMO does not report pod counts frequently enough

Comment 7 Tomas Jelinek 2020-05-29 08:30:03 UTC
Unfortunately there was no capacity this sprint to do this. Moving to upcoming.

Comment 8 Andrew Beekhof 2020-06-10 13:00:41 UTC
Lets talk about what we can do in this area for 4.6

Comment 9 Jiri Tomasek 2020-07-10 08:46:17 UTC
Moving to upcoming sprint as there is no action available until dependent bug is fixed.

Comment 10 mlammon 2020-09-10 16:03:40 UTC
I think this might have been implemented?   It appears now in 4.6 it is preventing user with (3) Master from more then (1) in Maintenance

Comment 11 Rastislav Wagner 2020-09-24 08:58:03 UTC
@Andrew what do you expect from UI to do here ? Should we disable Start maintenance action if there's already some master in maintenance ?

Comment 12 Rastislav Wagner 2020-09-24 08:59:29 UTC
Created attachment 1716255 [details]
Failed maintenance on second master

Im attaching a screenshot so you can take a look how UI looks like when user tries to start maintenance on second master

Comment 13 Andrew Beekhof 2020-09-25 04:42:51 UTC
I like the new warning, looks good!

(In reply to Rastislav Wagner from comment #11)
> @Andrew what do you expect from UI to do here ? Should we disable Start
> maintenance action if there's already some master in maintenance ?

In later builds the NMO will intercept and reject the create operation (webhook).  
So the ask for the UI is to look for and handle that "somehow".

Comment 14 Tomas Jelinek 2020-09-25 07:42:59 UTC
Based on Comment 13, moving to 4.7

Comment 17 mlammon 2020-10-01 12:52:46 UTC
Created attachment 1718164 [details]
newer message CNV 2.5 NMO v0.7.0

Comment 18 mlammon 2020-10-01 12:54:07 UTC
This no longer is an issue and the latest message seems appropriate with CNV 2.5 with v0.7.0 NMO
We can close/verify

Comment 19 Tomas Jelinek 2020-10-01 16:46:40 UTC
As per comment 18, closing.