Description of problem: I was running some initial testing with the ControlPlaneMachineSet and ran into an issue during a rolling update. For some reason the KASO decided to update the KAS pods, so it started running the installer to upgrade from version 10 to 11. Because the node was going away (ie it was unschedulable) and the Machine API operator was trying to remove the Machine, it started evicting the installer pod as soon as it was created. This lead to an infinite battle between MAO trying to drain the node and KASO trying to update the KAS pod on a node going away. Version-Release number of selected component (if applicable): How reproducible: Not sure, I've managed to hit this once from the 4 times I've tried to do a rolling update. Steps to Reproduce: 1. Create a cluster on AWS 2. Install the CPMSO from https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/manifests 3. Create a ControlPlaneMachineSet manifest 4. Make an update to perform a rolling update Actual results: The rollout got stuck because KASO was blocking the Machine from going away Expected results: KASO should not prevent us from removing a Control Plane Machine when a replacement machine has already been created and etcd has moved on from this node Additional info: Must Gather: https://drive.google.com/file/d/1dA6800u7isofSsqGROg-PQB1oDqnvnNM/view?usp=sharing
I managed to get the rollout to complete by scaling down CVO, MAO and the Machine API Controllers, preventing the drain operation from happening and allowing the installer pod to complete. Once it completed, it seemed to move on. Scaling back up the Machine API Controllers continued the drain and the rollout completed.
Discussed this on slack with David who pointed out that, if we used exponential back-off in the drain controller, this wouldn't have been an issue. I think this is something we should probably fix on the MAPI drain controller side. https://coreos.slack.com/archives/CB48XQ4KZ/p1659613173353759
Tried several times, didn't meet this again, machine can be deleted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399