Related to https://bugzilla.redhat.com/show_bug.cgi?id=2053255 When a static pod fails to rollout, there is no detection in the static pod operator library today. This makes it impossible to determine an accurate reliability number of frequency of static pod rollouts. Lack of this information, makes it impossible to assess whether a bug like https://bugzilla.redhat.com/show_bug.cgi?id=2053255 is a stop ship problem and makes it harder to identify and debug. working from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26831/pull-ci-openshift-origin-master-e2e-aws-fips/1491798226607542272 as an example TIMELINE OF EVENTS: 0."Kubelet version" kubeletVersion="v1.23.3+f14faf2" - contains rphillips fix for static pods 1. kcm revision/7 node/215 installer (normal pod) started at 16:22:03Z, finished at 16:22:57Z 2. kcm revision/7 node/215 static pod file created before 16:22:57. pod in API created at 16:23:10Z, started kcm container 16:23:10Z 3. kcm-o observes node/215 as available at 16:23:30 4. kcm revision/7 node/177 installer (normal pod) started at 16:23:48Z, finished at 16:24:21Z 5. kcm revision/7 node/177 static pod file created before 16:24:21Z. pod in API not updated by 16:42\:35 This detection could be built by checking the latest installer pods for each node. If the installer pod was successful, and if it has been longer than the terminationGracePeriodSeconds+10seconds since the installer pod completed successfully, and if the static pod is not at the correct revision, this controller should go degraded. It should also emit an event for detection in CI. Since this lack of this blocks our ability to identify, assess frequency, and decide on whether the kubelet bug is a blocker, this becomes a blocker. The fix will apply to the following operators and should be tested by QE on all: 1. kube-apiserver-operator 2. kube-controller-manager-operator 3. kube-scheduler-operator 4. etcd-operator
We need this to determine the scope of our problem. If we ship 4.10.0 with a problem that prevents upgrade to 4.10.z+1 we have a serious problem. marking blocker.
** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity. ** INFORMATION REQUIRED ** Please answer these questions before escalation to engineering: 1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather. 2. Give the output of "oc get clusteroperators -o yaml". 3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no] 4. List the top 5 relevant errors from the logs of the operators and operands in (3). 5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top. 6. Explain why (5) is likely the right order and list the information used for that assessment. 7. Explain why Engineering is necessary to make progress.
Setting to blocker+ based on convo with David.
https://github.com/openshift/origin/pull/26837 is now associated with this, but it is only the easy HALF of the solution. Once the event exists, that test will fail, but the operators still need to produce the events
> This detection could be built by checking the latest installer pods for each node. If the installer pod was successful, and if it has been longer than the terminationGracePeriodSeconds+10seconds since the installer pod completed successfully, and if the static pod is not at the correct revision, this controller should go degraded. It should also emit an event for detection in CI. kubeapiserver.operator.openshift.io/v1 today has this in its status: nodeStatuses: - currentRevision: 0 nodeName: ip-10-0-210-17.us-west-1.compute.internal targetRevision: 3 This came from a static pod installer controller AFAIK. if we extend this to report 'installedRevision:2' that will actually be reported by the installer pod itself once it finish copying the static pod manifests into target kubelet directory, will that make debugging this easier? In addition, we can have a controller that watches updated to this CRD and detect and report time delays (if currentRevision=1 but installedRevision=2 for more than 60s - event/degraded/etc.)
*** Bug 2053620 has been marked as a duplicate of this bug. ***
*** Bug 2053618 has been marked as a duplicate of this bug. ***
*** Bug 2053616 has been marked as a duplicate of this bug. ***
moving it to assigned, 4 operator(kas, scheduler, ectd, kcm) PRs are on its way
I am moving this BZ to POST because I want to merge a small improvment.
When I was trying to install one cluster with Jenkins CI job, the installation got stuck in cluster operator kube-controller-manager degradation. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-02-16-171622 True False 7h14m Error while reconciling 4.10.0-0.nightly-2022-02-16-171622: the cluster operator kube-controller-manager is degraded $ oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.10.0-0.nightly-2022-02-16-171622 True True True 7h29m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 7 on node: "kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s $ oc describe kubecontrollermanagers Name: cluster Namespace: Labels: <none> Annotations: include.release.openshift.io/ibm-cloud-managed: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true release.openshift.io/create-only: true API Version: operator.openshift.io/v1 Kind: KubeControllerManager ... Status: Conditions: Last Transition Time: 2022-02-17T08:21:01Z Reason: AsExpected Status: False Type: GuardControllerDegraded Last Transition Time: 2022-02-17T08:11:53Z Status: False Type: InstallerControllerDegraded Last Transition Time: 2022-02-17T08:12:57Z Message: 3 nodes are active; 1 nodes are at revision 5; 1 nodes are at revision 6; 1 nodes are at revision 7 Status: True Type: StaticPodsAvailable Last Transition Time: 2022-02-17T08:09:28Z Message: 1 nodes are at revision 5; 1 nodes are at revision 6; 1 nodes are at revision 7 Status: True Type: NodeInstallerProgressing Last Transition Time: 2022-02-17T08:09:16Z Status: False Type: NodeInstallerDegraded Last Transition Time: 2022-02-17T08:22:07Z Status: False Type: StaticPodsDegraded Last Transition Time: 2022-02-17T08:09:16Z Message: All master nodes are ready Reason: MasterNodesReady Status: False Type: NodeControllerDegraded Last Transition Time: 2022-02-17T08:09:17Z Reason: NoUnsupportedConfigOverrides Status: True Type: UnsupportedConfigOverridesUpgradeable Last Transition Time: 2022-02-17T08:09:20Z Status: False Type: CertRotation_CSRSigningCert_Degraded Last Transition Time: 2022-02-17T08:09:21Z Reason: AsExpected Status: False Type: BackingResourceControllerDegraded Last Transition Time: 2022-02-17T08:09:23Z Status: False Type: ResourceSyncControllerDegraded Last Transition Time: 2022-02-17T08:09:36Z Status: False Type: ConfigObservationDegraded Last Transition Time: 2022-02-17T08:09:25Z Status: False Type: InstallerPodPendingDegraded Last Transition Time: 2022-02-17T08:09:25Z Status: False Type: InstallerPodContainerWaitingDegraded Last Transition Time: 2022-02-17T08:09:25Z Status: False Type: InstallerPodNetworkingDegraded Last Transition Time: 2022-02-17T08:11:11Z Status: False Type: RevisionControllerDegraded Last Transition Time: 2022-02-17T08:25:29Z Message: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 7 on node: "kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s Reason: SyncError Status: True Type: MissingStaticPodControllerDegraded Last Transition Time: 2022-02-17T08:09:33Z Reason: AsExpected Status: False Type: KubeControllerManagerStaticResourcesDegraded Last Transition Time: 2022-02-17T08:09:34Z Status: False Type: SATokenSignerDegraded Last Transition Time: 2022-02-17T08:09:51Z Status: True Type: Upgradeable Last Transition Time: 2022-02-17T08:09:51Z Status: True Type: CloudControllerOwner Last Transition Time: 2022-02-17T08:09:51Z Status: False Type: TargetConfigControllerDegraded Latest Available Revision: 7 Latest Available Revision Reason: Node Statuses: Current Revision: 7 Node Name: kewang-17410g1-vm7zr-master-2.c.openshift-qe.internal Current Revision: 5 Node Name: kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal Target Revision: 7 Current Revision: 6 Node Name: kewang-17410g1-vm7zr-master-1.c.openshift-qe.internal Ready Replicas: 0 Events: <none> ---- The static pod of kube-controller-manager is not at the correct revision, this controller should go degraded indeed, but it kept 7h14m so long, so there is a potential risk that the upgrade will fail. Is that what we expect?
The following verification steps confirmed with Devs, The following operators should be tested all. 1. kube-apiserver-operator 2. kube-controller-manager-operator 3. kube-scheduler-operator 4. etcd-operator Checkpoints for each opeator: 1. When static pod rollouts, the latest installer pods for each node successfully. 2. After the kubelet deletes the currently running pod, for some reason the kubelet doesn’t run static pod at the correct revision, this controller should go degraded and emit an event. -------------------- For kube-apiserver-operator, Force a rolling out of kube-apiserver in one terminal, $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]' When new kube-apiserver installer pod is starting , $ oc get pods -n openshift-kube-apiserver --show-labels | grep Running | grep install installer-15-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal 1/1 Running 0 5s app=installer In another terminal, oc debug node/xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-apiserver sh-4.4# ip link |grep ens 2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000 sh-4.4# cat test.sh ifconfig ens4 down sleep 600 ifconfig ens4 up sh-4.4# chmod +x test.sh # Check the latest installer pods status, after the installer pod completed successfully, wait about 30s, then run script sh-4.4# /tmp/test.sh & After a while, check the cluster operators, # oc get co | grep -v "True.*False.*False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.10.0-0.nightly-2022-02-17-234353 True True False 6h41m NodeInstallerProgressing: 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15 Check if there is some events about why the static pod cannnot be sarted $ masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}') $ for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -ir 'static pod lifecycle failure' /var/log/ | grep -v debug";done | tail -5 ... Removing debug pod ... /var/log/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-68ddc9cc8c-7hmp7_5a922cf8-4725-40aa-952b-9c300d14ce95/kube-apiserver-operator/0.log:2022-02-21T09:44:35.924962534+00:00 stderr F I0221 09:44:35.924611 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"34684bef-914e-4bf8-b99a-adfdb0d134f5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver\" started at 2022-02-21 09:42:40 +0000 UTC is still not ready\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 2m40s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal_openshift-kube-apiserver(1e4249e162c276500f1f54ec5bc523f6)" ... 2022-02-21T09:44:35.899450624+00:00 stderr F I0221 09:44:35.898490 1 status_controller.go:211] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2022-02-21T09:42:52Z","message":"MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver\" started at 2022-02-21 09:42:40 +0000 UTC is still not ready\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 2m40s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal_openshift-kube-apiserver(1e4249e162c276500f1f54ec5bc523f6)","reason":"MissingStaticPodController_SyncError::StaticPods_Error","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-02-21T09:18:10Z","message":"NodeInstallerProgressing: 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-02-21T02:51:53Z","message":"StaticPodsAvailable: 3 nodes are active; 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-02-21T02:37:19Z","message":"KubeletMinorVersionUpgradeable: Kubelet and API server minor versions are synced.","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} Others operators will be finished soon or later.
For openshift-etcd-operator, Force a rolling out of etcd server in first terminal, $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge etcd.operator.openshift.io/cluster patched When new etcd server installer pod is starting , $ oc get pod -n openshift-etcd --show-labels | grep Running | grep install installer-9-xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal 1/1 Running 0 24s app=installer In second terminal, oc debug node/xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal # Logged in node of the rolling out etcd server sh-4.4# ip link |grep ens 2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000 sh-4.4# cat test.sh ifconfig ens4 down sleep 600 ifconfig ens4 up sh-4.4# chmod +x test.sh # Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script sh-4.4# /tmp/test.sh & After a while, monitor the etcd cluster operator in first terminal, $ while true; do oc get pod -n openshift-etcd --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-etcd-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-etcd.log ... etcd 4.10.0-0.nightly-2022-02-17-234353 True True False 4h32m NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 9 ... 3s Normal MissingStaticPod deployment/etcd-operator static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 9 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s 2s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: configmaps lister not synced\nDefragControllerDegraded: configmaps lister not synced\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 06:19:01 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMembersControllerDegraded: configmaps lister not synced\nDefragControllerDegraded: configmaps lister not synced\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 06:19:01 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nMissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"etcd\" in namespace: \"openshift-etcd\" for revision: 9 on node: \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nEtcdMembersDegraded: No unhealthy members found" ... etcd 4.10.0-0.nightly-2022-02-17-234353 True True True 4h40m ClusterMemberControllerDegraded: unhealthy members found during reconciling members... Based on above, we can see the kubelet doesn’t run etcd static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emit an event at last.
For kube-scheduler-operator, Force a rolling out of kube-scheduler server in first terminal, $ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge kubescheduler.operator.openshift.io/cluster patched When new kube-scheduler server installer pod is starting , $ oc get pod -n openshift-kube-scheduler --show-labels | grep Running | grep install installer-7-xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal 1/1 Running 0 23s app=installer In second terminal, oc debug node/xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-scheduler server sh-4.4# ip link |grep ens 2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000 sh-4.4# cat test.sh ifconfig ens4 down sleep 600 ifconfig ens4 up sh-4.4# chmod +x test.sh # Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script sh-4.4# /tmp/test.sh & After a while, monitor the kube-scheduler cluster operator in first terminal, $ while true; do oc get pod -n openshift-kube-scheduler --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-scheduler-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-sche.log ... NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-scheduler 4.10.0-0.nightly-2022-02-17-234353 True True False 5h55m NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7 ... 0s Normal MissingStaticPod deployment/openshift-kube-scheduler-operator static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 7 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s 0s Normal OperatorStatusChanged deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 07:42:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"openshift-kube-scheduler\" in namespace: \"openshift-kube-scheduler\" for revision: 7 on node: \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 07:42:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" ... kube-scheduler 4.10.0-0.nightly-2022-02-17-234353 True True True 5h57m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 7 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s... Based on above, we can see the kubelet doesn’t run kube-scheduler static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emitted an event at last.
For KCM operator, Force a rolling out of KCM server in first terminal, $ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge kubecontrollermanager.operator.openshift.io/cluster patched When new KCM server installer pod is starting , $ oc get pod -n openshift-kube-controller-manager --show-labels | grep Running | grep install installer-7-xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal 1/1 Running 0 22s app=installer In second terminal, oc debug node/xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal # Logged in node of the rolling out KCM server sh-4.4# ip link |grep ens 2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000 sh-4.4# cat test.sh ifconfig ens4 down sleep 600 ifconfig ens4 up sh-4.4# chmod +x test.sh # Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script sh-4.4# /tmp/test.sh & After a while, monitor the KCM cluster operator in first terminal, $ while true; do oc get pod -n openshift-kube-controller-manager --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-controller-manager-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-kcm.log ... NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.10.0-0.nightly-2022-02-17-234353 True True False 6h29m NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 ... 6s Normal MissingStaticPod deployment/kube-controller-manager-operator static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 8 on node: "xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s 6s Normal OperatorStatusChanged deployment/kube-controller-manager-operator Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 08:17:07 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-controller-manager\" in namespace: \"openshift-kube-controller-manager\" for revision: 8 on node: \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 08:17:07 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" ... kube-controller-manager 4.10.0-0.nightly-2022-02-17-234353 True True True 6h32m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 8 on node: "xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s... Based on above, we can see the kubelet doesn’t run KCM static pod at the correct revision due to reason 'The master nodes not ready', this controller went degraded and emitted an event at last. In conclusion, the bug fix works fine, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056