1861896 – Duplicate Machine Controller Deployed if Control Plane Unreachable

Bug 1861896 - Duplicate Machine Controller Deployed if Control Plane Unreachable

Summary: Duplicate Machine Controller Deployed if Control Plane Unreachable

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Michael McCune
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1864352 1868102 1887373
TreeView+	depends on / blocked

Reported:	2020-07-29 20:20 UTC by Michael Gugino
Modified:	2020-10-27 16:21 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: In rare circumstances (for example, if a control plane node becomes unhealthy, or a machine-api controller exits unexpectedly) it is possible for multiple Machine API controller instances to be deployed where only a single instance should be running. Consequence: With multiple Machine API controllers running duplicate machines are created and are not fully joined into the cluster. This condition can result in machines being leaked and becoming inaccessible from within the cluster. Fix: Leader election mechanisms have been added to all Machine API components to ensure the proper number of running instances. Result: Machine API controllers will only run the prescribed number of instances and will not create duplicate controllers.
Clone Of:
Clones:	1864352 1868102 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:21:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-aws pull 315	None	closed	[OCPCLOUD-492]; Run all machine api Controllers using leader election	2021-01-25 10:43:50 UTC
Github	openshift cluster-api-provider-azure pull 122	None	closed	OCPCLOUD-492: Run all machine api Controllers using leader election	2021-01-25 10:43:51 UTC
Github	openshift cluster-api-provider-baremetal pull 81	None	closed	Change leader election flags for MAO support	2021-01-25 10:43:51 UTC
Github	openshift cluster-api-provider-gcp pull 85	None	closed	[OCPCLOUD-492] Implement leader election for gcp provider	2021-01-25 10:43:51 UTC
Github	openshift cluster-api-provider-openstack pull 108	None	closed	Add Leader election flags for machine controller	2021-01-25 10:43:51 UTC
Github	openshift cluster-api-provider-ovirt pull 55	None	closed	Implement leader election for manager	2021-01-25 10:43:51 UTC
Github	openshift machine-api-operator pull 571	None	closed	OCPCLOUD-492: Leader election for machineSets, nodelink, vsphere and healthchecks	2021-01-25 10:43:52 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:21:49 UTC

Description Michael Gugino 2020-07-29 20:20:12 UTC

Description of problem:

In 4.5 and below, if the control plane kubelet goes unreachable but pods are still running (eg, someone/something causes the kubelet to stop or otherwise the kubelet is prevented from communicating with the cluster), machine-api pods that were running on that node will be rescheduled to another.

If this happens, essentially you have duplicate machine-api controllers running.

The effect is bad.

[mgugino@mguginop50 4.5-nightly]$ ./oc get machines -A
NAMESPACE NAME PHASE TYPE REGION ZONE AGE
openshift-machine-api mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj Running m5.large us-west-2 us-west-2b 9m12s
openshift-machine-api mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7 Running m5.large us-west-2 us-west-2b 9m12s

(From AWS console)
i-029a2e8f1a6fa7f79 (mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj), i-080add2aa273b8aec (mgugino-deva2-pgdsh-worker-us-west-2b-4pctx), i-057b15daa3fcb3ab8 (mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj), i-022b14a051a7320fe (mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7), i-0c24f513eeeec5212 (mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7)

As you can see, I have 5 instances where I should have two. This is a result of scaling to 2 from 0 after stopping the kubelet on the node where machine-api components are running.

First, the machinesets over provision machines (ended up with 3 machines temporarily instead of 2). Then, each machine controller races to create an instance. So, we can see we have two duplicates and an extra instance from the machine that was immediately terminated (but the machine-controller doing the delete didn't know about the instance the other machine-controller created).

Version-Release number of selected component (if applicable):
4.1+

How reproducible:
100%

Steps to Reproduce:
1. Identify what node machine-api controllers are running on.
2. Stop kubelet on that host.
3. Wait for several minutes until pods are rescheduled onto another host.
4. Scale up a machineset.

Actual results:
Too many instances and machines created, and machines are leaked.

Expected results:
Extra instances and machines should not be created and leaked.

Additional info:
We need to come up with a plan to make an advisory as there is no way to detect this condition in-cluster.

Comment 1 Joel Speed 2020-07-30 10:27:24 UTC

Have you tried this in 4.6? I assume because of the leader election that has been added, this is not a problem from 4.6 onwards?

Comment 2 Michael Gugino 2020-07-30 13:03:16 UTC

(In reply to Joel Speed from comment #1)
> Have you tried this in 4.6? I assume because of the leader election that has
> been added, this is not a problem from 4.6 onwards?

I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.

One indication that you may have this problem is excess CSRs being generated.  This may or may not happen depending on if the instances boot successfully.  If there were any problems with your machinesets/machine that would have caused them to not boot, then there would be no excess CSRs (this sound extremely unlikely as it's an edge case of an edge case).

Comment 3 Joel Speed 2020-08-03 13:45:44 UTC

> I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.

I have just verified that this isn't a problem in 4.6.
When disabling kubelet on the master, this does not cause any issue for the running pod and as such it keeps the leader election lease up to date, preventing the secondary controller from starting.

Comment 4 Michael Gugino 2020-08-03 14:23:01 UTC

(In reply to Joel Speed from comment #3)
> > I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.
> 
> I have just verified that this isn't a problem in 4.6.
> When disabling kubelet on the master, this does not cause any issue for the
> running pod and as such it keeps the leader election lease up to date,
> preventing the secondary controller from starting.

Thanks for verifying this.

Any good suggestions for making sure we don't regress on the machine-controller in this area?  The other components I'm not as worried about, but the machine-controller leaking instances is obviously really bad.

Comment 5 Joel Speed 2020-08-03 15:29:56 UTC

This is quite a tough one to test, I can't really think of a way to check that we don't regress that doesn't involve testing the implementation details, ie, is leader election working.
We could write a test that takes the leader election lock and verifies that the running controller restarts (since it's lost its lease), then create a machine and verify that nothing happens because there is no running machine controller (it being blocked from starting by the test holding the lease)

Comment 6 Michael McCune 2020-08-03 19:30:19 UTC

i've just created bz 1864352 to track the backport, i am working on bringing the leader election patches back to 4.5.

Comment 7 Joel Speed 2020-08-11 10:05:07 UTC

@mimmcune I think we can close this bug as "NOTABUG" since we know this is working in 4.6, that will unblock the 4.5 cherry-pick. I would suggest cloning from this for a separate 4.6 targeted bug if we want to track adding some tests

Comment 8 Michael Gugino 2020-08-11 13:24:28 UTC

I would say "Current Release" for this bug, but we still need something to track the work for ensuring we don't regress.  That doesn't need to block the backports though, so that can happen in Jira.

Comment 9 Joel Speed 2020-08-11 14:27:20 UTC

Ahh thanks Mike, I always forget which options there are, "Current Release" is a better resolution, agreed, separate bug or Jira to ensure we don't regress is a must

Comment 10 Michael McCune 2020-08-11 17:07:10 UTC

after discussion with the team, we are closing this bug for 4.6 as it is currently fixed. 2 other bugs have been added, #1868102 to track the testing progress, and #1864352 to track the backports for 4.5.

Comment 11 Michael McCune 2020-08-20 17:11:41 UTC

i reopened this issue to help with the backports we are doing for 4.5. the attached github PRs have all been merged and address the issue in this report. i am hoping to get this validated through QE so that we can enable the related backports.

Comment 12 Michael McCune 2020-08-31 13:46:11 UTC

just wondering if there is any update from QA on this?

we have some backports that are dependent on this bug getting verified.

Comment 13 Milind Yadav 2020-09-04 05:49:27 UTC

Hi Michael , I have also VERIIFIED this on 4.6 , does not have the problem of duplicate machine controller deployment because of leader election , we have added a test case on 4.6 for this - https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-33455

Since it is in POST status I havent moved this to VERIFIED , let me know if you think anything else is needed to make it move forward ..

Comment 14 Michael McCune 2020-09-04 12:52:00 UTC

thanks for the update Milind, i guess i could use some advice at this point. we would like to backport the fixes for this into 4.5, and have this dependent bug https://bugzilla.redhat.com/show_bug.cgi?id=1864352 , but i am not sure the best way forward here since the CI bot tooling is rejecting this bugzilla for those changes (see https://github.com/openshift/cluster-api-provider-aws/pull/342#issuecomment-687119180).

any advice on how to proceed to get those backports confirmed?

Comment 15 Joel Speed 2020-09-07 10:24:16 UTC

None of the relevant PRs for this PR had the BZ automation attached to them, however, all of the attached PRs are now merged, this should be on modified

@milind, sounds like you've already verified this? If that's the case, please post your results and move to verified as normal, I don't think there's anything more to do here

Comment 16 Milind Yadav 2020-09-07 13:01:11 UTC

Validated at - [miyadav@miyadav aws]$ oc get clusterversion --config aws
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-07-065523   True        False         19m     Cluster version is 4.6.0-0.nightly-2020-09-07-065523

Steps :
1.Node running machine-controller pods , was made Not-Ready 
[miyadav@miyadav aws]$ oc get pods --config aws
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-6c78cf4b55-twb4b   2/2     Running   0          11m
machine-api-controllers-784c4ff84d-lgq4k       7/7     Running   1          11m
machine-api-operator-7548866bdc-2q76f          2/2     Running   0          11m
[miyadav@miyadav aws]$ oc get machines -o wide --config aws
NAME                                              PHASE     TYPE        REGION      ZONE         AGE   NODE                                         PROVIDERID                              STATE
oc46-miyadav-0709-s8xrw-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   47m   ip-10-0-158-44.us-east-2.compute.internal    aws:///us-east-2a/i-0b7247ed5594a4bc9   running
oc46-miyadav-0709-s8xrw-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   47m   ip-10-0-171-227.us-east-2.compute.internal   aws:///us-east-2b/i-0214916379e3b82fa   running
oc46-miyadav-0709-s8xrw-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   47m   ip-10-0-210-212.us-east-2.compute.internal   aws:///us-east-2c/i-0a06607aabc36546e   running
oc46-miyadav-0709-s8xrw-worker-us-east-2a-zz25r   Running   m5.large    us-east-2   us-east-2a   33m   ip-10-0-150-47.us-east-2.compute.internal    aws:///us-east-2a/i-06adf5d5ca8c5935e   running
oc46-miyadav-0709-s8xrw-worker-us-east-2b-blpm9   Running   m5.large    us-east-2   us-east-2b   33m   ip-10-0-167-238.us-east-2.compute.internal   aws:///us-east-2b/i-06b957c422f0f988a   running
oc46-miyadav-0709-s8xrw-worker-us-east-2c-4mggl   Running   m5.large    us-east-2   us-east-2c   33m   ip-10-0-199-168.us-east-2.compute.internal   aws:///us-east-2c/i-0a381c8d6966970cf   running
[miyadav@miyadav aws]$ oc get nodes --config aws
NAME                                         STATUS     ROLES    AGE   VERSION
ip-10-0-150-47.us-east-2.compute.internal    Ready      worker   28m   v1.19.0-rc.2+068702d
ip-10-0-158-44.us-east-2.compute.internal    Ready      master   40m   v1.19.0-rc.2+068702d
ip-10-0-167-238.us-east-2.compute.internal   Ready      worker   28m   v1.19.0-rc.2+068702d
ip-10-0-171-227.us-east-2.compute.internal   Ready      master   39m   v1.19.0-rc.2+068702d
ip-10-0-199-168.us-east-2.compute.internal   Ready      worker   28m   v1.19.0-rc.2+068702d
ip-10-0-210-212.us-east-2.compute.internal   NotReady   master   40m   v1.19.0-rc.2+068702d


2.Machine-controller pods moved to other master node.
[miyadav@miyadav aws]$ oc scale machineset oc46-miyadav-0709-s8xrw-worker-us-east-2b --replicas 3 --config aws
machineset.machine.openshift.io/oc46-miyadav-0709-s8xrw-worker-us-east-2b scaled
[miyadav@miyadav aws]$ oc get pods -o wide --config aws
NAME                                           READY   STATUS              RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-6c78cf4b55-twb4b   2/2     Running             0          12m   10.129.0.23   ip-10-0-210-212.us-east-2.compute.internal   <none>           <none>
machine-api-controllers-784c4ff84d-lgq4k       7/7     Terminating         1          12m   10.129.0.24   ip-10-0-210-212.us-east-2.compute.internal   <none>           <none>
machine-api-controllers-784c4ff84d-tlkx5       0/7     ContainerCreating   0          5s    <none>        ip-10-0-171-227.us-east-2.compute.internal   <none>           <none>
machine-api-operator-7548866bdc-2q76f          2/2     Terminating         0          12m   10.129.0.22   ip-10-0-210-212.us-east-2.compute.internal   <none>           <none>
machine-api-operator-7548866bdc-7ptqq          0/2     ContainerCreating   0          6s    <none>        ip-10-0-158-44.us-east-2.compute.internal    <none>           <none>


3.Scale the machineset and monitor controller logs
oc get machineset --config aws
NAME                                        DESIRED   CURRENT   READY   AVAILABLE   AGE
oc46-miyadav-0709-s8xrw-worker-us-east-2a   1         1         1       1           47m
oc46-miyadav-0709-s8xrw-worker-us-east-2b   1         1         1       1           47m
oc46-miyadav-0709-s8xrw-worker-us-east-2c   1         1         1       1           47m
[miyadav@miyadav aws]$ oc scale machineset oc46-miyadav-0709-s8xrw-worker-us-east-2b --replicas 3 --config aws
machineset.machine.openshift.io/oc46-miyadav-0709-s8xrw-worker-us-east-2b scaled
.
..machine-controller logs
I0907 12:44:58.243180       1 request.go:557] Throttling request took 2.314776795s, request: GET:https://172.30.0.1:443/apis/tuned.openshift.io/v1?timeout=32s
I0907 12:44:58.256064       1 leaderelection.go:242] attempting to acquire leader lease  openshift-machine-api/cluster-api-provider-aws-leader...
I0907 12:44:58.256406       1 internal.go:393] controller-runtime/manager "msg"="starting metrics server"  "path"="/metrics"
machineset-controller logs:
..
I0907 12:44:57.807291       1 leaderelection.go:242] attempting to acquire leader lease  openshift-machine-api/cluster-api-provider-machineset-leader...
I0907 12:45:32.946345       1 machine_webhook.go:432] Validate webhook called for Machine: oc46-miyadav-0709-s8xrw-worker-us-east-2b-4r8vc
I0907 12:45:32.946434       1 machine_webhook.go:553] Validating AWS providerSpec


The logs are as expected after the machine-controller kubelet is killed and the number of machines scaled as per replicas scaled

Additional info:
Moving to verified

Comment 19 errata-xmlrpc 2020-10-27 16:21:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.