Bug 2115308

Summary:	Kube API server operator should not update replicas when Machine/Node is being removed
Product:	OpenShift Container Platform	Reporter:	Joel Speed <jspeed>
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	mfojtik, xxia
Version:	4.12
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:54:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joel Speed 2022-08-04 11:35:32 UTC

Description of problem:

I was running some initial testing with the ControlPlaneMachineSet and ran into an issue during a rolling update.

For some reason the KASO decided to update the KAS pods, so it started running the installer to upgrade from version 10 to 11.

Because the node was going away (ie it was unschedulable) and the Machine API operator was trying to remove the Machine, it started evicting the installer pod as soon as it was created. This lead to an infinite battle between MAO trying to drain the node and KASO trying to update the KAS pod on a node going away.


Version-Release number of selected component (if applicable):


How reproducible:

Not sure, I've managed to hit this once from the 4 times I've tried to do a rolling update.


Steps to Reproduce:
1. Create a cluster on AWS
2. Install the CPMSO from https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/manifests
3. Create a ControlPlaneMachineSet manifest
4. Make an update to perform a rolling update

Actual results:
The rollout got stuck because KASO was blocking the Machine from going away


Expected results:
KASO should not prevent us from removing a Control Plane Machine when a replacement machine has already been created and etcd has moved on from this node

Additional info:
Must Gather: https://drive.google.com/file/d/1dA6800u7isofSsqGROg-PQB1oDqnvnNM/view?usp=sharing

Comment 1 Joel Speed 2022-08-04 11:38:00 UTC

I managed to get the rollout to complete by scaling down CVO, MAO and the Machine API Controllers, preventing the drain operation from happening and allowing the installer pod to complete. Once it completed, it seemed to move on. Scaling back up the Machine API Controllers continued the drain and the rollout completed.

Comment 2 Joel Speed 2022-08-04 16:56:21 UTC

Discussed this on slack with David who pointed out that, if we used exponential back-off in the drain controller, this wouldn't have been an issue. I think this is something we should probably fix on the MAPI drain controller side.

https://coreos.slack.com/archives/CB48XQ4KZ/p1659613173353759

Comment 4 sunzhaohua 2022-08-09 14:46:36 UTC

Tried several times, didn't meet this again, machine can be deleted.

Comment 7 errata-xmlrpc 2023-01-17 19:54:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399