Bug 1826914
Summary: | [OCP4.4][NMO] [API] Putting more than one master node into maintenance should be prevented or not allowed. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | mlammon |
Component: | Node Maintenance Operator | Assignee: | Marc Sluiter <msluiter> |
Status: | CLOSED ERRATA | QA Contact: | mlammon |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4.4 | CC: | abeekhof, aos-bugs, gharden, jokerman, jtomasek, mfojtik, msluiter, scuppett, spadgett, ukalifon |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
URL: | https://github.com/kubevirt/node-maintenance-operator/pull/76 | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: |
Feature: Validate maintenance requests for master nodes.
Reason: Prevent master (etcd) quorum violation.
Result: Master nodes can only be set into maintenance if the etcd-quorum-guard PDB allows it.
|
Story Points: | --- |
Clone Of: | 1826908 | Environment: | |
Last Closed: | 2020-10-27 15:58:27 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1826908 |
Description
mlammon
2020-04-22 18:30:40 UTC
Setting target release to current development version (4.5) for investigation. Where fixes (if any) are required/requested for prior versions, cloned BZs will be created when appropriate. We should fix this at the operator level so that the same logic applies to CLI driven changes I think that the following check should be added to the reconcile loop, prior to proceeding with taking a node to maintenance mode: if node_is_a_master_node(nodeName) { if (number_of_active_master_nodes() - 1 < quorum_size() { report error, don't allow reconcile to proceed. } } bool node_is_a_master_node(nodeName) node object for nodeName has label node-role.kubernetes.io/master int quorum_size() return number_of_master_nodes() / 2 + 1 int number_of_master_nodes() return number of nodes with label node-role.kubernetes.io/master int number_of_active_master_nodes() return number of nodes with label with label node-role.kubernetes.io/master and node.Spec.Unschedulable == False PR for this issue was added https://github.com/kubevirt/node-maintenance-operator/pull/76 and is under review. During review of the PR an issue was identified by Nir: If baremetal machine turns unhealthy then the node object for that node is deleted. This will skew the number of master nodes counted and will alter the calculation of the required quorum size. We must therefore count the number of currently unhealthy/not running master nodes. - Andrew suggested to use the etcd client for that task, I was not able to copy the etcd client3 library into the vendored dependencies of node-maintenance-operator. The currently vendored google protobufs is in conflict with that required by cluster-etcd-operator. (in addition it is not clear how to get the required certificates to operate the etcd client3) So this approach is now blocked by the task of transitioning to mod packaging. There is an alternative solution to get the number of disabled master nodes from the pod disruption budget, but that's an ongoing discussion that has not been concuded. Fixed in https://github.com/kubevirt/node-maintenance-operator/pull/107 and https://github.com/kubevirt/node-maintenance-operator/pull/111 This is resolved. We can move to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |