2115308 – Kube API server operator should not update replicas when Machine/Node is being removed

Bug 2115308 - Kube API server operator should not update replicas when Machine/Node is being removed

Summary: Kube API server operator should not update replicas when Machine/Node is bein...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-04 11:35 UTC by Joel Speed
Modified:	2023-01-17 19:54 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:54:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 1051	0	None	open	Bug 2115308: Ensure failed drains are subject to exponential backoff	2022-08-08 11:11:51 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:54:35 UTC

Description Joel Speed 2022-08-04 11:35:32 UTC

Description of problem:

I was running some initial testing with the ControlPlaneMachineSet and ran into an issue during a rolling update.

For some reason the KASO decided to update the KAS pods, so it started running the installer to upgrade from version 10 to 11.

Because the node was going away (ie it was unschedulable) and the Machine API operator was trying to remove the Machine, it started evicting the installer pod as soon as it was created. This lead to an infinite battle between MAO trying to drain the node and KASO trying to update the KAS pod on a node going away.


Version-Release number of selected component (if applicable):


How reproducible:

Not sure, I've managed to hit this once from the 4 times I've tried to do a rolling update.


Steps to Reproduce:
1. Create a cluster on AWS
2. Install the CPMSO from https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/manifests
3. Create a ControlPlaneMachineSet manifest
4. Make an update to perform a rolling update

Actual results:
The rollout got stuck because KASO was blocking the Machine from going away


Expected results:
KASO should not prevent us from removing a Control Plane Machine when a replacement machine has already been created and etcd has moved on from this node

Additional info:
Must Gather: https://drive.google.com/file/d/1dA6800u7isofSsqGROg-PQB1oDqnvnNM/view?usp=sharing

Comment 1 Joel Speed 2022-08-04 11:38:00 UTC

I managed to get the rollout to complete by scaling down CVO, MAO and the Machine API Controllers, preventing the drain operation from happening and allowing the installer pod to complete. Once it completed, it seemed to move on. Scaling back up the Machine API Controllers continued the drain and the rollout completed.

Comment 2 Joel Speed 2022-08-04 16:56:21 UTC

Discussed this on slack with David who pointed out that, if we used exponential back-off in the drain controller, this wouldn't have been an issue. I think this is something we should probably fix on the MAPI drain controller side.

https://coreos.slack.com/archives/CB48XQ4KZ/p1659613173353759

Comment 4 sunzhaohua 2022-08-09 14:46:36 UTC

Tried several times, didn't meet this again, machine can be deleted.

Comment 7 errata-xmlrpc 2023-01-17 19:54:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.