Bug 1864352

Summary:	Duplicate Machine Controller Deployed if Control Plane Unreachable
Product:	OpenShift Container Platform	Reporter:	Michael McCune <mimccune>
Component:	Cloud Compute	Assignee:	Michael McCune <mimccune>
Cloud Compute sub component:	Other Providers	QA Contact:	Milind Yadav <miyadav>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	agarcial, mgugino, miyadav, mwoodson
Version:	4.5	Keywords:	NeedsTestCase
Target Milestone:	---
Target Release:	4.5.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: If a control plane kubelet goes unreachable but pods are still running (eg, someone/something causes the kubelet to stop or otherwise the kubelet is prevented from communicating with the cluster), machine-api pods that were running on that node will be rescheduled to another. Consequence: Multiple machine-api pods are created and will compete to control machine-api resources in the cluster. This can result in an excess number of instances being created as well as the possibility for the machine-api controllers to leak instances, requiring manual intervention. Fix: Leader election has been added to all machine-api controllers. Result: Leader election for machine-api controllers ensures that only a single instance of a controller is allowed to manage machine-api resources. With only a single leader for each controller, excess instances are no longer created or leaked.	Story Points:	---
Clone Of:	1861896	Environment:
Last Closed:	2020-10-19 14:54:24 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1861896
Bug Blocks:

Description Michael McCune 2020-08-03 19:18:49 UTC

+++ This bug was initially created as a clone of Bug #1861896 +++

Description of problem:

In 4.5 and below, if the control plane kubelet goes unreachable but pods are still running (eg, someone/something causes the kubelet to stop or otherwise the kubelet is prevented from communicating with the cluster), machine-api pods that were running on that node will be rescheduled to another.

If this happens, essentially you have duplicate machine-api controllers running.

The effect is bad.

[mgugino@mguginop50 4.5-nightly]$ ./oc get machines -A
NAMESPACE               NAME                                          PHASE         TYPE        REGION      ZONE         AGE
openshift-machine-api   mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj   Running       m5.large    us-west-2   us-west-2b   9m12s
openshift-machine-api   mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7   Running       m5.large    us-west-2   us-west-2b   9m12s

(From AWS console)
i-029a2e8f1a6fa7f79 (mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj), i-080add2aa273b8aec (mgugino-deva2-pgdsh-worker-us-west-2b-4pctx), i-057b15daa3fcb3ab8 (mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj), i-022b14a051a7320fe (mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7), i-0c24f513eeeec5212 (mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7)

As you can see, I have 5 instances where I should have two.  This is a result of scaling to 2 from 0 after stopping the kubelet on the node where machine-api components are running.

First, the machinesets over provision machines (ended up with 3 machines temporarily instead of 2).  Then, each machine controller races to create an instance.  So, we can see we have two duplicates and an extra instance from the machine that was immediately terminated (but the machine-controller doing the delete didn't know about the instance the other machine-controller created).

Version-Release number of selected component (if applicable):
4.1+

How reproducible:
100%


Steps to Reproduce:
1. Identify what node machine-api controllers are running on.
2. Stop kubelet on that host.
3. Wait for several minutes until pods are rescheduled onto another host.
4. Scale up a machineset.

Actual results:
Too many instances and machines created, and machines are leaked.

Expected results:
Extra instances and machines should not be created and leaked.

Additional info:
We need to come up with a plan to make an advisory as there is no way to detect this condition in-cluster.

--- Additional comment from Joel Speed on 2020-07-30 10:27:24 UTC ---

Have you tried this in 4.6? I assume because of the leader election that has been added, this is not a problem from 4.6 onwards?

--- Additional comment from Michael Gugino on 2020-07-30 13:03:16 UTC ---

(In reply to Joel Speed from comment #1)
> Have you tried this in 4.6? I assume because of the leader election that has
> been added, this is not a problem from 4.6 onwards?

I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.

One indication that you may have this problem is excess CSRs being generated.  This may or may not happen depending on if the instances boot successfully.  If there were any problems with your machinesets/machine that would have caused them to not boot, then there would be no excess CSRs (this sound extremely unlikely as it's an edge case of an edge case).

--- Additional comment from Joel Speed on 2020-08-03 13:45:44 UTC ---

> I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.

I have just verified that this isn't a problem in 4.6.
When disabling kubelet on the master, this does not cause any issue for the running pod and as such it keeps the leader election lease up to date, preventing the secondary controller from starting.

--- Additional comment from Michael Gugino on 2020-08-03 14:23:01 UTC ---

(In reply to Joel Speed from comment #3)
> > I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.
> 
> I have just verified that this isn't a problem in 4.6.
> When disabling kubelet on the master, this does not cause any issue for the
> running pod and as such it keeps the leader election lease up to date,
> preventing the secondary controller from starting.

Thanks for verifying this.

Any good suggestions for making sure we don't regress on the machine-controller in this area?  The other components I'm not as worried about, but the machine-controller leaking instances is obviously really bad.

--- Additional comment from Joel Speed on 2020-08-03 15:29:56 UTC ---

This is quite a tough one to test, I can't really think of a way to check that we don't regress that doesn't involve testing the implementation details, ie, is leader election working.
We could write a test that takes the leader election lock and verifies that the running controller restarts (since it's lost its lease), then create a machine and verify that nothing happens because there is no running machine controller (it being blocked from starting by the test holding the lease)

Comment 1 Michael McCune 2020-08-03 19:29:06 UTC

i am working on backporting the leader election changes to 4.5.

Comment 3 Michael McCune 2020-08-17 19:13:35 UTC

i have been backporting changes to address this issue on 4.5. the changes are currently on hold while we work to get a few final patches in place.

also need to figure out the proper way to catalog this in bugzilla as the tooling is not recognizing the current pull requests.

Comment 4 Michael McCune 2020-09-10 17:56:22 UTC

i think we are just waiting on patches to merge, all the backports have been proposed.

Comment 5 Joel Speed 2020-10-01 16:32:56 UTC

@Michael McCune, one of the PRs seems to be marked as draft, any reason we were holding that?

Comment 6 Michael McCune 2020-10-01 17:03:33 UTC

(In reply to Joel Speed from comment #5)
> @Michael McCune, one of the PRs seems to be marked as draft, any reason we
> were holding that?

just a fail on my part, i meant to go back and remove draft after everything else had passed but forgot.

Comment 8 Milind Yadav 2020-10-09 10:54:44 UTC

Validated on - 

[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-10-09-042011   True        False         67m     Cluster version is 4.5.0-0.nightly-2020-10-09-042011



Steps:
1. Make master node unreachable which is running machine-controller
[miyadav@miyadav ~]$ oc project openshift-machine-api
oc Now using project "openshift-machine-api" on server "https://api.miyadav-bug.qe.azure.devcluster.openshift.com:6443".
[miyadav@miyadav ~]$ oc get po -o wide
NAME                                           READY   STATUS    RESTARTS   AGE   IP            NODE                         NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-7d97c59bc7-6zs9h   2/2     Running   0          36m   10.128.0.31   miyadav-bug-qjbhm-master-2   <none>           <none>
machine-api-controllers-749d688988-wdr7j       4/4     Running   0          36m   10.129.0.22   miyadav-bug-qjbhm-master-0   <none>           <none>
machine-api-operator-7c87745bbb-5vxhj          2/2     Running   0          36m   10.129.0.16   miyadav-bug-qjbhm-master-0   <none>           <none>
[miyadav@miyadav ~]$ oc get nodes
NAME                                            STATUS     ROLES    AGE   VERSION
miyadav-bug-qjbhm-master-0                      NotReady   master   48m   v1.18.3+06da727
miyadav-bug-qjbhm-master-1                      Ready      master   48m   v1.18.3+06da727
miyadav-bug-qjbhm-master-2                      Ready      master   48m   v1.18.3+06da727
miyadav-bug-qjbhm-worker-northcentralus-6tv5g   Ready      worker   33m   v1.18.3+06da727
miyadav-bug-qjbhm-worker-northcentralus-m85r9   Ready      worker   33m   v1.18.3+06da727
miyadav-bug-qjbhm-worker-northcentralus-rkw97   Ready      worker   33m   v1.18.3+06da727

2.machine-controller moves to new master nodes
[miyadav@miyadav ~]$ oc get pods -o wide
NAME                                           READY   STATUS        RESTARTS   AGE   IP            NODE                         NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-7d97c59bc7-6zs9h   2/2     Running       0          43m   10.128.0.31   miyadav-bug-qjbhm-master-2   <none>           <none>
machine-api-controllers-749d688988-v9wq5       4/4     Running       0          95s   10.130.0.52   miyadav-bug-qjbhm-master-1   <none>           <none>
machine-api-controllers-749d688988-wdr7j       4/4     Terminating   0          43m   10.129.0.22   miyadav-bug-qjbhm-master-0   <none>           <none>
machine-api-operator-7c87745bbb-5vxhj          2/2     Terminating   0          43m   10.129.0.16   miyadav-bug-qjbhm-master-0   <none>           <none>
machine-api-operator-7c87745bbb-g55qt          2/2     Running       0          95s   10.130.0.51   miyadav-bug-qjbhm-master-1   <none>           <none>

3.Scale machineset 
[miyadav@miyadav ~]$ oc scale machineset miyadav-bug-qjbhm-worker-northcentralus --replicas 5
[miyadav@miyadav ~]$ oc get machines
NAME                                            PHASE     TYPE              REGION           ZONE   AGE
miyadav-bug-qjbhm-master-0                      Running   Standard_D8s_v3   northcentralus          67m
miyadav-bug-qjbhm-master-1                      Running   Standard_D8s_v3   northcentralus          67m
miyadav-bug-qjbhm-master-2                      Running   Standard_D8s_v3   northcentralus          67m
miyadav-bug-qjbhm-worker-northcentralus-6tv5g   Running   Standard_D2s_v3   northcentralus          55m
miyadav-bug-qjbhm-worker-northcentralus-hkspk   Running   Standard_D2s_v3   northcentralus          11m
miyadav-bug-qjbhm-worker-northcentralus-ltg5j   Running   Standard_D2s_v3   northcentralus          11m
miyadav-bug-qjbhm-worker-northcentralus-m85r9   Running   Standard_D2s_v3   northcentralus          55m
miyadav-bug-qjbhm-worker-northcentralus-rkw97   Running   Standard_D2s_v3   northcentralus          55m

Result : Actual and expected - machines scaled as per replicas mentioned.

Comment 11 errata-xmlrpc 2020-10-19 14:54:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4228