Bug 1826908

Summary:

[OCP4.4][NMO] Need to provide a good UI experience when putting nodes into maintenance mode fails

Product:

OpenShift Container Platform

Reporter:

mlammon

Component:

Console Metal3 Plugin

Assignee:

Jiri Tomasek <jtomasek>

Status:

CLOSED CURRENTRELEASE

QA Contact:

mlammon

Severity:

low

Docs Contact:

Priority:

low

Version:

4.4

CC:

abeekhof, achernet, aos-bugs, gharden, jtomasek, msluiter, rawagner, tjelinek

Target Milestone:

---

Keywords:

Triaged

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1826914 (view as bug list)

Environment:

Last Closed:

2020-10-01 16:46:40 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1826914

Bug Blocks:

Attachments:

Description	Flags
Screen Image of warning	none
Failed maintenance on second master	none
newer message CNV 2.5 NMO v0.7.0	none

Description mlammon 2020-04-22 18:20:57 UTC

Description of problem:
[NMO] More then one master node should be prevented or not allowed. The current workflows
allow the attempt to place more then one master node into maintenance. It will not go
into maintenance because trying to prevent etcd quorum loss.

It probably needs to be prevented from UI (this bug) and/ or both for API. A duplicate
will be generated for API.

Tested version steps:
Version:

[root@sealusa6 ~]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.4.0-0.nightly-2020-04-21-210658 True False 98m Cluster version is 4.4.0-0.nightly-2020-04-21-210658

Steps to reproduce:
1. Deploy cluster with CNV (which has node-maintenace-operator)
2. select compute -> Nodes
3. select 3 dots on any master-0-0 node and "Start Maintenance"

< wait for successfully putting master-0-0 node Under maintenance >

4. select 3 dots on any master-0-1 node and "Start Maintenance"

< It will not go into maintenance > because of the etcd-quorum-guard >
The undestanding is

Actual results:
It attempts to put the master-0-1 node into maintenance and eventually throw a warning
"Workloads failing to move
drain did not complete after 1m0s" < see attached image>

This is because etcd-quorum-guard-xxxxxx is preventing.

Expected results:
It should prevent the user from trying the second and/or third master node from UI/API

Additional info:

Looking at doc even though moved to machine-config-operator [1]

Taken from [2]
"The etcd Quorum Guard ensures that quorum is maintained for etcd for OpenShift.

For the etcd cluster to remain usable, we must maintain quorum, which is a majority of all etcd members. For example, an etcd cluster with 3 members (i.e. a 3 master deployment) must have at least 2 healthy etcd member to meet the quorum limit.

There are situations where 2 etcd members could be down at once:

a master has gone offline and the MachineConfig Controller (MCC) tries to rollout a new MachineConfig (MC) by rebooting masters
the MCC is doing a MachineConfig rollout and doesn't wait for the etcd on the previous master to become healthy again before rebooting the next master
In short, we need a way to ensure that a drain on a master is not allowed to proceed if the reboot of the master would cause etcd quorum loss."

[1] https://github.com/openshift/machine-config-operator/blob/master/install/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml
[2] https://github.com/openshift/etcd-quorum-guard

Comment 1 mlammon 2020-04-22 18:26:17 UTC

Created attachment 1680968 [details]
Screen Image of warning

Comment 2 Andrew Beekhof 2020-04-23 00:23:49 UTC

This should be enforced at the NMO level, not the UI.

Comment 3 mlammon 2020-04-23 14:03:47 UTC

If there is use case for more then (3) masters, perhaps we need to implement how many can be put into a formula so we don't allow (total - 2)

Comment 4 Andrew Beekhof 2020-05-13 12:39:09 UTC

Changing the subject, I don't believe there are any other ways that maintenance mode can "fail" rather than take longer than expected.
So it would be worth making sure we have UI in place to handle it gracefully.

Comment 5 Jiri Tomasek 2020-05-15 10:26:38 UTC

(In reply to Andrew Beekhof from comment #4)
> Changing the subject, I don't believe there are any other ways that
> maintenance mode can "fail" rather than take longer than expected.
> So it would be worth making sure we have UI in place to handle it gracefully.

Right, IIUC this relates to proper maintenance progress tracking (https://bugzilla.redhat.com/show_bug.cgi?id=1812354) - as mentioned there NMO does not report pod counts frequently enough

Comment 7 Tomas Jelinek 2020-05-29 08:30:03 UTC

Unfortunately there was no capacity this sprint to do this. Moving to upcoming.

Comment 8 Andrew Beekhof 2020-06-10 13:00:41 UTC

Lets talk about what we can do in this area for 4.6

Comment 9 Jiri Tomasek 2020-07-10 08:46:17 UTC

Moving to upcoming sprint as there is no action available until dependent bug is fixed.

Comment 10 mlammon 2020-09-10 16:03:40 UTC

I think this might have been implemented?   It appears now in 4.6 it is preventing user with (3) Master from more then (1) in Maintenance

Comment 11 Rastislav Wagner 2020-09-24 08:58:03 UTC

@Andrew what do you expect from UI to do here ? Should we disable Start maintenance action if there's already some master in maintenance ?

Comment 12 Rastislav Wagner 2020-09-24 08:59:29 UTC

Created attachment 1716255 [details]
Failed maintenance on second master

Im attaching a screenshot so you can take a look how UI looks like when user tries to start maintenance on second master

Comment 13 Andrew Beekhof 2020-09-25 04:42:51 UTC

I like the new warning, looks good!

(In reply to Rastislav Wagner from comment #11)
> @Andrew what do you expect from UI to do here ? Should we disable Start
> maintenance action if there's already some master in maintenance ?

In later builds the NMO will intercept and reject the create operation (webhook).  
So the ask for the UI is to look for and handle that "somehow".

Comment 14 Tomas Jelinek 2020-09-25 07:42:59 UTC

Based on Comment 13, moving to 4.7

Comment 17 mlammon 2020-10-01 12:52:46 UTC

Created attachment 1718164 [details]
newer message CNV 2.5 NMO v0.7.0

Comment 18 mlammon 2020-10-01 12:54:07 UTC

This no longer is an issue and the latest message seems appropriate with CNV 2.5 with v0.7.0 NMO
We can close/verify

Comment 19 Tomas Jelinek 2020-10-01 16:46:40 UTC

As per comment 18, closing.