In 4.11, when we switched over the MCO to using controller to drain, we skip draining the MCC since we want it to finish the node it is working on. In practice this means the master node the MCC runs on retains the MCC pod. The pod won't get restarted until the master node has finished os updates and reboots. In some setups, this could mean a long time for the MCO to stall while waiting for the master node to come up. This may get resolved with graceful shutdown (?), or maybe I am misunderstanding how pod scheduling works alongside shutdown, but I think in the short term the MCC pod should get drained. With John's recent fix to leader elections, this will make the new controller pick up where the old one left off.
verified on 4.12.0-0.nightly-2022-07-07-092951 1. create mc to trigger update on master nodes $ oc create -f change-masters-chrony-configuration.yaml machineconfig.machineconfiguration.openshift.io/change-masters-chrony-configuration created $ cat change-masters-chrony-configuration.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: >> machineconfiguration.openshift.io/role: master name: change-masters-chrony-configuration spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 3.2.0 networkd: {} passwd: {} storage: files: - contents: source: data:text/plain;charset=utf-8;base64,cG9vbCAwLnJoZWwucG9vbC5udHAub3JnIGlidXJzdApkcmlmdGZpbGUgL3Zhci9saWIvY2hyb255L2RyaWZ0Cm1ha2VzdGVwIDEuMCAzCnJ0Y3N5bmMKbG9nZGlyIC92YXIvbG9nL2Nocm9ueQo= mode: 420 overwrite: true path: /etc/chrony.conf osImageURL: "" 2. check node name of mcc pod $ oc get pod -n openshift-machine-config-operator NAME READY STATUS RESTARTS AGE machine-config-controller-7fbd48c6fc-lwttm 2/2 Running 0 21m ... $ oc get pod/machine-config-controller-7fbd48c6fc-lwttm -n openshift-machine-config-operator -o yaml|yq -y '.spec.nodeName' >> ip-10-0-193-82.ec2.internal 3. make sure node drain is happened on this node $ oc get events -n default --field-selector involvedObject.name=ip-10-0-193-82.ec2.internal,type!=Warning LAST SEEN TYPE REASON OBJECT MESSAGE ... 23m Normal Cordon node/ip-10-0-193-82.ec2.internal Cordoned node to apply update >> 23m Normal Drain node/ip-10-0-193-82.ec2.internal Draining node to update config. 23m Normal NodeNotSchedulable node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeNotSchedulable 22m Normal OSUpdateStarted node/ip-10-0-193-82.ec2.internal 22m Normal OSUpgradeSkipped node/ip-10-0-193-82.ec2.internal OS upgrade skipped; new MachineConfig (rendered-master-5beee16903bea1f4444aaa5362b5cca8) has same OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbeb2b82e57a51be17177daaffb33b7d557ea7595db2c52c83e8806cc0104100) as old MachineConfig (rendered-master-1341b162472e3d65bde2fa96a3a7e1a8) 22m Normal OSUpdateStaged node/ip-10-0-193-82.ec2.internal Changes to OS staged 22m Normal PendingConfig node/ip-10-0-193-82.ec2.internal Written pending config rendered-master-5beee16903bea1f4444aaa5362b5cca8 >> 22m Normal Reboot node/ip-10-0-193-82.ec2.internal Node will reboot into config rendered-master-5beee16903bea1f4444aaa5362b5cca8 21m Normal NodeNotReady node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeNotReady 19m Normal Starting node/ip-10-0-193-82.ec2.internal Starting kubelet. 19m Normal NodeAllocatableEnforced node/ip-10-0-193-82.ec2.internal Updated Node Allocatable limit across pods 19m Normal NodeHasSufficientMemory node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeHasSufficientMemory 19m Normal NodeHasNoDiskPressure node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeHasNoDiskPressure 19m Normal NodeHasSufficientPID node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeHasSufficientPID 19m Normal NodeReady node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeReady 19m Normal NodeNotSchedulable node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeNotSchedulable 19m Normal Starting node/ip-10-0-193-82.ec2.internal openshift-sdn done initializing node networking. 19m Normal NodeDone node/ip-10-0-193-82.ec2.internal Setting node ip-10-0-193-82.ec2.internal, currentConfig rendered-master-5beee16903bea1f4444aaa5362b5cca8 to Done 19m Normal NodeSchedulable node/ip-10-0-193-82.ec2.internal Node ip-10-0-193-82.ec2.internal status is now: NodeSchedulable 19m Normal Uncordon node/ip-10-0-193-82.ec2.internal Update completed for config rendered-master-5beee16903bea1f4444aaa5362b5cca8 and node has been uncordoned 19m Normal ConfigDriftMonitorStarted node/ip-10-0-193-82.ec2.internal Config Drift Monitor started, watching against rendered-master-5beee16903bea1f4444aaa5362b5c 4. check pod running on above node when update is completed on master pool, no mcc pod found $ oc get mcp/master NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-5beee16903bea1f4444aaa5362b5cca8 True False False 3 3 3 0 40m $ oc get pod -n openshift-machine-config-operator --field-selector spec.nodeName=ip-10-0-193-82.ec2.internal NAME READY STATUS RESTARTS AGE machine-config-daemon-7nv52 2/2 Running 2 41m machine-config-operator-7b567bfc64-lbc6x 1/1 Running 0 12m machine-config-server-bl2sb 1/1 Running 1 39m 5. mcc pod is scheduled on another node $ oc get pod -n openshift-machine-config-operator |grep machine-config-controller machine-config-controller-7fbd48c6fc-bmwf2 2/2 Running 0 9m49s $ oc get pod machine-config-controller-7fbd48c6fc-bmwf2 -n openshift-machine-config-operator -o yaml | yq -y '.spec.nodeName' >> ip-10-0-144-46.ec2.internal
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399