Bug 1864352
Summary: | Duplicate Machine Controller Deployed if Control Plane Unreachable | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Michael McCune <mimccune> |
Component: | Cloud Compute | Assignee: | Michael McCune <mimccune> |
Cloud Compute sub component: | Other Providers | QA Contact: | Milind Yadav <miyadav> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | agarcial, mgugino, miyadav, mwoodson |
Version: | 4.5 | Keywords: | NeedsTestCase |
Target Milestone: | --- | ||
Target Release: | 4.5.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: If a control plane kubelet goes unreachable but pods are still running (eg, someone/something causes the kubelet to stop or otherwise the kubelet is prevented from communicating with the cluster), machine-api pods that were running on that node will be rescheduled to another.
Consequence: Multiple machine-api pods are created and will compete to control machine-api resources in the cluster. This can result in an excess number of instances being created as well as the possibility for the machine-api controllers to leak instances, requiring manual intervention.
Fix: Leader election has been added to all machine-api controllers.
Result: Leader election for machine-api controllers ensures that only a single instance of a controller is allowed to manage machine-api resources. With only a single leader for each controller, excess instances are no longer created or leaked.
|
Story Points: | --- |
Clone Of: | 1861896 | Environment: | |
Last Closed: | 2020-10-19 14:54:24 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1861896 | ||
Bug Blocks: |
Description
Michael McCune
2020-08-03 19:18:49 UTC
i am working on backporting the leader election changes to 4.5. i have been backporting changes to address this issue on 4.5. the changes are currently on hold while we work to get a few final patches in place. also need to figure out the proper way to catalog this in bugzilla as the tooling is not recognizing the current pull requests. i think we are just waiting on patches to merge, all the backports have been proposed. @Michael McCune, one of the PRs seems to be marked as draft, any reason we were holding that? (In reply to Joel Speed from comment #5) > @Michael McCune, one of the PRs seems to be marked as draft, any reason we > were holding that? just a fail on my part, i meant to go back and remove draft after everything else had passed but forgot. Validated on - [miyadav@miyadav ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-10-09-042011 True False 67m Cluster version is 4.5.0-0.nightly-2020-10-09-042011 Steps: 1. Make master node unreachable which is running machine-controller [miyadav@miyadav ~]$ oc project openshift-machine-api oc Now using project "openshift-machine-api" on server "https://api.miyadav-bug.qe.azure.devcluster.openshift.com:6443". [miyadav@miyadav ~]$ oc get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-7d97c59bc7-6zs9h 2/2 Running 0 36m 10.128.0.31 miyadav-bug-qjbhm-master-2 <none> <none> machine-api-controllers-749d688988-wdr7j 4/4 Running 0 36m 10.129.0.22 miyadav-bug-qjbhm-master-0 <none> <none> machine-api-operator-7c87745bbb-5vxhj 2/2 Running 0 36m 10.129.0.16 miyadav-bug-qjbhm-master-0 <none> <none> [miyadav@miyadav ~]$ oc get nodes NAME STATUS ROLES AGE VERSION miyadav-bug-qjbhm-master-0 NotReady master 48m v1.18.3+06da727 miyadav-bug-qjbhm-master-1 Ready master 48m v1.18.3+06da727 miyadav-bug-qjbhm-master-2 Ready master 48m v1.18.3+06da727 miyadav-bug-qjbhm-worker-northcentralus-6tv5g Ready worker 33m v1.18.3+06da727 miyadav-bug-qjbhm-worker-northcentralus-m85r9 Ready worker 33m v1.18.3+06da727 miyadav-bug-qjbhm-worker-northcentralus-rkw97 Ready worker 33m v1.18.3+06da727 2.machine-controller moves to new master nodes [miyadav@miyadav ~]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-7d97c59bc7-6zs9h 2/2 Running 0 43m 10.128.0.31 miyadav-bug-qjbhm-master-2 <none> <none> machine-api-controllers-749d688988-v9wq5 4/4 Running 0 95s 10.130.0.52 miyadav-bug-qjbhm-master-1 <none> <none> machine-api-controllers-749d688988-wdr7j 4/4 Terminating 0 43m 10.129.0.22 miyadav-bug-qjbhm-master-0 <none> <none> machine-api-operator-7c87745bbb-5vxhj 2/2 Terminating 0 43m 10.129.0.16 miyadav-bug-qjbhm-master-0 <none> <none> machine-api-operator-7c87745bbb-g55qt 2/2 Running 0 95s 10.130.0.51 miyadav-bug-qjbhm-master-1 <none> <none> 3.Scale machineset [miyadav@miyadav ~]$ oc scale machineset miyadav-bug-qjbhm-worker-northcentralus --replicas 5 [miyadav@miyadav ~]$ oc get machines NAME PHASE TYPE REGION ZONE AGE miyadav-bug-qjbhm-master-0 Running Standard_D8s_v3 northcentralus 67m miyadav-bug-qjbhm-master-1 Running Standard_D8s_v3 northcentralus 67m miyadav-bug-qjbhm-master-2 Running Standard_D8s_v3 northcentralus 67m miyadav-bug-qjbhm-worker-northcentralus-6tv5g Running Standard_D2s_v3 northcentralus 55m miyadav-bug-qjbhm-worker-northcentralus-hkspk Running Standard_D2s_v3 northcentralus 11m miyadav-bug-qjbhm-worker-northcentralus-ltg5j Running Standard_D2s_v3 northcentralus 11m miyadav-bug-qjbhm-worker-northcentralus-m85r9 Running Standard_D2s_v3 northcentralus 55m miyadav-bug-qjbhm-worker-northcentralus-rkw97 Running Standard_D2s_v3 northcentralus 55m Result : Actual and expected - machines scaled as per replicas mentioned. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4228 |