1868102 – Machine API component should have automated tests for leader election

Bug 1868102 - Machine API component should have automated tests for leader election

Summary: Machine API component should have automated tests for leader election

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Michael McCune
QA Contact:	Milind Yadav
Docs Contact:
URL:
Whiteboard:
Depends On:	1861896
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-11 17:03 UTC by Michael McCune
Modified:	2020-11-12 13:37 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1861896
Environment:
Last Closed:	2020-11-12 13:37:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michael McCune 2020-08-11 17:03:12 UTC

This bug was created as a clone of #1861896 to address concerns around automated tests for the leader election pieces of the Machine API components (machine-api-operator, machinehealthcheck, machineset controller, and all provider controllers).

The original bug is caused by an absence of leader election, to ensure we do not regress this behavior we should add tests to exercise the leader election and we should enable leader election by default.


+++ This bug was initially created as a clone of Bug #1861896 +++

Description of problem:

In 4.5 and below, if the control plane kubelet goes unreachable but pods are still running (eg, someone/something causes the kubelet to stop or otherwise the kubelet is prevented from communicating with the cluster), machine-api pods that were running on that node will be rescheduled to another.

If this happens, essentially you have duplicate machine-api controllers running.

The effect is bad.

[mgugino@mguginop50 4.5-nightly]$ ./oc get machines -A
NAMESPACE               NAME                                          PHASE         TYPE        REGION      ZONE         AGE
openshift-machine-api   mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj   Running       m5.large    us-west-2   us-west-2b   9m12s
openshift-machine-api   mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7   Running       m5.large    us-west-2   us-west-2b   9m12s

(From AWS console)
i-029a2e8f1a6fa7f79 (mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj), i-080add2aa273b8aec (mgugino-deva2-pgdsh-worker-us-west-2b-4pctx), i-057b15daa3fcb3ab8 (mgugino-deva2-pgdsh-worker-us-west-2b-ljzsj), i-022b14a051a7320fe (mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7), i-0c24f513eeeec5212 (mgugino-deva2-pgdsh-worker-us-west-2b-r9wv7)

As you can see, I have 5 instances where I should have two.  This is a result of scaling to 2 from 0 after stopping the kubelet on the node where machine-api components are running.

First, the machinesets over provision machines (ended up with 3 machines temporarily instead of 2).  Then, each machine controller races to create an instance.  So, we can see we have two duplicates and an extra instance from the machine that was immediately terminated (but the machine-controller doing the delete didn't know about the instance the other machine-controller created).

Version-Release number of selected component (if applicable):
4.1+

How reproducible:
100%


Steps to Reproduce:
1. Identify what node machine-api controllers are running on.
2. Stop kubelet on that host.
3. Wait for several minutes until pods are rescheduled onto another host.
4. Scale up a machineset.

Actual results:
Too many instances and machines created, and machines are leaked.

Expected results:
Extra instances and machines should not be created and leaked.

Additional info:
We need to come up with a plan to make an advisory as there is no way to detect this condition in-cluster.

--- Additional comment from Joel Speed on 2020-07-30 10:27:24 UTC ---

Have you tried this in 4.6? I assume because of the leader election that has been added, this is not a problem from 4.6 onwards?

--- Additional comment from Michael Gugino on 2020-07-30 13:03:16 UTC ---

(In reply to Joel Speed from comment #1)
> Have you tried this in 4.6? I assume because of the leader election that has
> been added, this is not a problem from 4.6 onwards?

I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.

One indication that you may have this problem is excess CSRs being generated.  This may or may not happen depending on if the instances boot successfully.  If there were any problems with your machinesets/machine that would have caused them to not boot, then there would be no excess CSRs (this sound extremely unlikely as it's an edge case of an edge case).

--- Additional comment from Joel Speed on 2020-08-03 13:45:44 UTC ---

> I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.

I have just verified that this isn't a problem in 4.6.
When disabling kubelet on the master, this does not cause any issue for the running pod and as such it keeps the leader election lease up to date, preventing the secondary controller from starting.

--- Additional comment from Michael Gugino on 2020-08-03 14:23:01 UTC ---

(In reply to Joel Speed from comment #3)
> > I have not tried it in 4.6.  I tried it in 4.5  I'm assuming it does not happen in 4.6 due to leader election, but definitely we should verify.
> 
> I have just verified that this isn't a problem in 4.6.
> When disabling kubelet on the master, this does not cause any issue for the
> running pod and as such it keeps the leader election lease up to date,
> preventing the secondary controller from starting.

Thanks for verifying this.

Any good suggestions for making sure we don't regress on the machine-controller in this area?  The other components I'm not as worried about, but the machine-controller leaking instances is obviously really bad.

--- Additional comment from Joel Speed on 2020-08-03 15:29:56 UTC ---

This is quite a tough one to test, I can't really think of a way to check that we don't regress that doesn't involve testing the implementation details, ie, is leader election working.
We could write a test that takes the leader election lock and verifies that the running controller restarts (since it's lost its lease), then create a machine and verify that nothing happens because there is no running machine controller (it being blocked from starting by the test holding the lease)

--- Additional comment from Michael McCune on 2020-08-03 19:30:19 UTC ---

i've just created bz 1864352 to track the backport, i am working on bringing the leader election patches back to 4.5.

--- Additional comment from Joel Speed on 2020-08-11 10:05:07 UTC ---

@mimmcune I think we can close this bug as "NOTABUG" since we know this is working in 4.6, that will unblock the 4.5 cherry-pick. I would suggest cloning from this for a separate 4.6 targeted bug if we want to track adding some tests

--- Additional comment from Michael Gugino on 2020-08-11 13:24:28 UTC ---

I would say "Current Release" for this bug, but we still need something to track the work for ensuring we don't regress.  That doesn't need to block the backports though, so that can happen in Jira.

--- Additional comment from Joel Speed on 2020-08-11 14:27:20 UTC ---

Ahh thanks Mike, I always forget which options there are, "Current Release" is a better resolution, agreed, separate bug or Jira to ensure we don't regress is a must

Comment 1 Michael McCune 2020-08-11 17:10:11 UTC

just to add a little more context for this bug.

although we need to test that the leader election is enabled, we would also like to confirm that we do not have the same breakage which was reported in bz #1868102. ideally if we can confirm that multiple machine controllers will not create excess machines, and that machines will not be silently lost, that would be ideal.



https://bugzilla.redhat.com/show_bug.cgi?id=1868102

Comment 4 Michael McCune 2020-09-10 17:55:03 UTC

adding the upcomingsprint tag, this bug will take some time to properly fix. we have some ideas about creating the necessary tests for this condition, but have not scheduled the work yet. stay tuned ;)

Comment 5 Michael McCune 2020-09-30 18:54:03 UTC

the team has some ideas about how to test this condition, it will most likely get picked up during the 4.7 time frame.

Comment 6 Joel Speed 2020-11-12 13:37:52 UTC

This is going to be a larger piece of work, as such we are tracking this in Jira with it's own epic https://issues.redhat.com/browse/OCPCLOUD-1009

Note You need to log in before you can comment on or make changes to this bug.