Description of problem: [NMO] More then one master node should be prevented or not allowed. The current workflows allow the attempt to place more then one master node into maintenance. It will not go into maintenance because trying to prevent etcd quorum loss. It probably needs to be prevented from UI (this bug) and/ or both for API. A duplicate will be generated for API. Tested version steps: Version: [root@sealusa6 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-04-21-210658 True False 98m Cluster version is 4.4.0-0.nightly-2020-04-21-210658 Steps to reproduce: 1. Deploy cluster with CNV (which has node-maintenace-operator) 2. select compute -> Nodes 3. select 3 dots on any master-0-0 node and "Start Maintenance" < wait for successfully putting master-0-0 node Under maintenance > 4. select 3 dots on any master-0-1 node and "Start Maintenance" < It will not go into maintenance > because of the etcd-quorum-guard > The undestanding is Actual results: It attempts to put the master-0-1 node into maintenance and eventually throw a warning "Workloads failing to move drain did not complete after 1m0s" < see attached image> This is because etcd-quorum-guard-xxxxxx is preventing. Expected results: It should prevent the user from trying the second and/or third master node from UI/API Additional info: Looking at doc even though moved to machine-config-operator [1] Taken from [2] "The etcd Quorum Guard ensures that quorum is maintained for etcd for OpenShift. For the etcd cluster to remain usable, we must maintain quorum, which is a majority of all etcd members. For example, an etcd cluster with 3 members (i.e. a 3 master deployment) must have at least 2 healthy etcd member to meet the quorum limit. There are situations where 2 etcd members could be down at once: a master has gone offline and the MachineConfig Controller (MCC) tries to rollout a new MachineConfig (MC) by rebooting masters the MCC is doing a MachineConfig rollout and doesn't wait for the etcd on the previous master to become healthy again before rebooting the next master In short, we need a way to ensure that a drain on a master is not allowed to proceed if the reboot of the master would cause etcd quorum loss." [1] https://github.com/openshift/machine-config-operator/blob/master/install/0000_80_machine-config-operator_07_etcdquorumguard_deployment.yaml [2] https://github.com/openshift/etcd-quorum-guard
Created attachment 1680968 [details] Screen Image of warning
This should be enforced at the NMO level, not the UI.
If there is use case for more then (3) masters, perhaps we need to implement how many can be put into a formula so we don't allow (total - 2)
Changing the subject, I don't believe there are any other ways that maintenance mode can "fail" rather than take longer than expected. So it would be worth making sure we have UI in place to handle it gracefully.
(In reply to Andrew Beekhof from comment #4) > Changing the subject, I don't believe there are any other ways that > maintenance mode can "fail" rather than take longer than expected. > So it would be worth making sure we have UI in place to handle it gracefully. Right, IIUC this relates to proper maintenance progress tracking (https://bugzilla.redhat.com/show_bug.cgi?id=1812354) - as mentioned there NMO does not report pod counts frequently enough
Unfortunately there was no capacity this sprint to do this. Moving to upcoming.
Lets talk about what we can do in this area for 4.6
Moving to upcoming sprint as there is no action available until dependent bug is fixed.
I think this might have been implemented? It appears now in 4.6 it is preventing user with (3) Master from more then (1) in Maintenance
@Andrew what do you expect from UI to do here ? Should we disable Start maintenance action if there's already some master in maintenance ?
Created attachment 1716255 [details] Failed maintenance on second master Im attaching a screenshot so you can take a look how UI looks like when user tries to start maintenance on second master
I like the new warning, looks good! (In reply to Rastislav Wagner from comment #11) > @Andrew what do you expect from UI to do here ? Should we disable Start > maintenance action if there's already some master in maintenance ? In later builds the NMO will intercept and reject the create operation (webhook). So the ask for the UI is to look for and handle that "somehow".
Based on Comment 13, moving to 4.7
Created attachment 1718164 [details] newer message CNV 2.5 NMO v0.7.0
This no longer is an issue and the latest message seems appropriate with CNV 2.5 with v0.7.0 NMO We can close/verify
As per comment 18, closing.