Bug 2115308 - Kube API server operator should not update replicas when Machine/Node is being removed
Summary: Kube API server operator should not update replicas when Machine/Node is bein...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.12
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.12.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-04 11:35 UTC by Joel Speed
Modified: 2023-01-17 19:54 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:54:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 1051 0 None open Bug 2115308: Ensure failed drains are subject to exponential backoff 2022-08-08 11:11:51 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:54:35 UTC

Description Joel Speed 2022-08-04 11:35:32 UTC
Description of problem:

I was running some initial testing with the ControlPlaneMachineSet and ran into an issue during a rolling update.

For some reason the KASO decided to update the KAS pods, so it started running the installer to upgrade from version 10 to 11.

Because the node was going away (ie it was unschedulable) and the Machine API operator was trying to remove the Machine, it started evicting the installer pod as soon as it was created. This lead to an infinite battle between MAO trying to drain the node and KASO trying to update the KAS pod on a node going away.


Version-Release number of selected component (if applicable):


How reproducible:

Not sure, I've managed to hit this once from the 4 times I've tried to do a rolling update.


Steps to Reproduce:
1. Create a cluster on AWS
2. Install the CPMSO from https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/manifests
3. Create a ControlPlaneMachineSet manifest
4. Make an update to perform a rolling update

Actual results:
The rollout got stuck because KASO was blocking the Machine from going away


Expected results:
KASO should not prevent us from removing a Control Plane Machine when a replacement machine has already been created and etcd has moved on from this node

Additional info:
Must Gather: https://drive.google.com/file/d/1dA6800u7isofSsqGROg-PQB1oDqnvnNM/view?usp=sharing

Comment 1 Joel Speed 2022-08-04 11:38:00 UTC
I managed to get the rollout to complete by scaling down CVO, MAO and the Machine API Controllers, preventing the drain operation from happening and allowing the installer pod to complete. Once it completed, it seemed to move on. Scaling back up the Machine API Controllers continued the drain and the rollout completed.

Comment 2 Joel Speed 2022-08-04 16:56:21 UTC
Discussed this on slack with David who pointed out that, if we used exponential back-off in the drain controller, this wouldn't have been an issue. I think this is something we should probably fix on the MAPI drain controller side.

https://coreos.slack.com/archives/CB48XQ4KZ/p1659613173353759

Comment 4 sunzhaohua 2022-08-09 14:46:36 UTC
Tried several times, didn't meet this again, machine can be deleted.

Comment 7 errata-xmlrpc 2023-01-17 19:54:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.