Hide Forgot
*** Bug 2053617 has been marked as a duplicate of this bug. ***
moving it to assigned, 4 operator(kas, scheduler, ectd, kcm) PRs are on its way
I am moving this BZ to POST because I want to merge a small improvment.
The following operators should be tested all. 1. kube-apiserver-operator 2. kube-controller-manager-operator 3. kube-scheduler-operator 4. etcd-operator Checkpoints for each opeator: 1. When static pod rollouts, the latest installer pods for each node successfully. 2. After the kubelet deletes the currently running pod, for some reason the kubelet doesn’t run static pod at the correct revision, this controller should go degraded and emit an event. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-02-18-121223 True False 133m Cluster version is 4.11.0-0.nightly-2022-02-18-121223 For kube-apiserver operator, Force a rolling out of kube-apiserver server in first terminal, $ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]' kubeapiserver.operator.openshift.io/cluster patched When new kube-apiserver server installer pod is starting , $ $ oc get pod -n openshift-kube-apiserver --show-labels | grep Running | grep install installer-9-xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal 1/1 Running 0 13s app=installer In second terminal, oc debug node/xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-apiserver server sh-4.4# ip link |grep ens 2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000 sh-4.4# cat test.sh ifconfig ens4 down sleep 600 ifconfig ens4 up sh-4.4# chmod +x test.sh # Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script sh-4.4# /tmp/test.sh & After a while, monitor the kube-scheduler cluster operator in first terminal, $ while true; do oc get pod -n openshift-kube-apiserver --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-apiserver-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-kas.log ... NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-apiserver 4.11.0-0.nightly-2022-02-18-121223 True True False 114m NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 9 ... 10s Normal MissingStaticPod deployment/kube-apiserver-operator static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 9 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 4m15s 10s Normal OperatorStatusChanged deployment/kube-apiserver-operator Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 10:59:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 9 on node: \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 10:59:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" ... kube-apiserver 4.11.0-0.nightly-2022-02-18-121223 True True True 119m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 9 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 4m15s... Based on above, we can see the kubelet doesn’t run kube-apiserver static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emitted an event at last.
For KCM operator, Force a rolling out of KCM server in first terminal, $ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge kubecontrollermanager.operator.openshift.io/cluster patched When new KCM server installer pod is starting , $ oc get pod -n openshift-kube-controller-manager --show-labels | grep Running | grep install installer-9-xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal In second terminal, $ oc debug node/xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal # Logged in node of the rolling out KCM server sh-4.4# ip link |grep ens 2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000 sh-4.4# cat test.sh ifconfig ens4 down sleep 600 ifconfig ens4 up sh-4.4# chmod +x test.sh # Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script sh-4.4# /tmp/test.sh & After a while, monitor the KCM cluster operator in first terminal, $ while true; do oc get pod -n openshift-kube-controller-manager --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-controller-manager-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-kcm.log ... NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.11.0-0.nightly-2022-02-18-121223 True True False 5h40m NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 9 ... 9s Normal MissingStaticPod deployment/kube-controller-manager-operator static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 9 on node: "xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s 9s Normal OperatorStatusChanged deployment/kube-controller-manager-operator Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 14:34:45 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 14:34:45 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nMissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-controller-manager\" in namespace: \"openshift-kube-controller-manager\" for revision: 9 on node: \"xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal\" didn't show up, waited: 2m30s" ... kube-controller-manager 4.11.0-0.nightly-2022-02-18-121223 True True True 5h43m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 9 on node: "xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s... Based on above, we can see the kubelet doesn’t run KCM static pod at the correct revision due to reason 'The master nodes not ready', this controller went degraded and emitted an event at last.
For kube-scheduler-operator, Force a rolling out of kube-scheduler server in first terminal, $ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge kubescheduler.operator.openshift.io/cluster patched When new kube-scheduler server installer pod is starting , $ oc get pod -n openshift-kube-scheduler --show-labels | grep Running | grep install installer-8-xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal 1/1 Running 0 22s app=installer In second terminal, $ oc debug node/xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-scheduler server sh-4.4# ip link |grep ens 2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000 sh-4.4# cat test.sh ifconfig ens4 down sleep 600 ifconfig ens4 up sh-4.4# chmod +x test.sh # Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script sh-4.4# /tmp/test.sh & After a while, monitor the kube-scheduler cluster operator in first terminal, $ while true; do oc get pod -n openshift-kube-scheduler --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-scheduler-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-sche.log ... NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-scheduler 4.11.0-0.nightly-2022-02-18-121223 True True False 6h20m NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 ... 6s Normal MissingStaticPod deployment/openshift-kube-scheduler-operator static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 8 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s 6s Normal OperatorStatusChanged deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 15:15:53 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"openshift-kube-scheduler\" in namespace: \"openshift-kube-scheduler\" for revision: 8 on node: \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 15:15:53 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" ... kube-scheduler 4.11.0-0.nightly-2022-02-18-121223 True True True 6h24m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 8 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s... Based on above, we can see the kubelet doesn’t run kube-scheduler static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emitted an event at last.
For openshift-etcd-operator, Force a rolling out of etcd server in first terminal, $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge etcd.operator.openshift.io/cluster patched When new etcd server installer pod is starting , $ oc get pod -n openshift-etcd --show-labels | grep Running | grep install installer-9-xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal 1/1 Running 0 37s app=installer In second terminal, $ oc debug node/xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal # Logged in node of the rolling out etcd server sh-4.4# ip link |grep ens 2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000 sh-4.4# cat test.sh ifconfig ens4 down sleep 600 ifconfig ens4 up sh-4.4# chmod +x test.sh # Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script sh-4.4# /tmp/test.sh & After a while, monitor the etcd cluster operator in first terminal, $ while true; do oc get pod -n openshift-etcd --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-etcd-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-etcd.log ... NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-scheduler 4.11.0-0.nightly-2022-02-18-121223 True True False 6h20m NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8 ... 6s Normal MissingStaticPod deployment/openshift-kube-scheduler-operator static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 8 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s 6s Normal OperatorStatusChanged deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 15:15:53 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"openshift-kube-scheduler\" in namespace: \"openshift-kube-scheduler\" for revision: 8 on node: \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 15:15:53 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" ... etcd 4.11.0-0.nightly-2022-02-18-121223 True False True 6h26m ClusterMemberControllerDegraded: unhealthy members found during reconciling members... Based on above, we can see the kubelet doesn’t run etcd static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emit an event at last. In conclusion, the bug fix works fine, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069
Hello Ke Wang I falled to the same problem ocp 4.10.25 was OK. I started update to 4.10.30. Operator etcd degraded decided one master is unhealthy. All nodes ready, one etc-master pod showwed status nodeports. etcd-master11.v46ocp4.example.com 0/4 NodePorts 0 30d etcd-master12.v46ocp4.example.com 4/4 Running 4 30d etcd-master13.v46ocp4.example.com 4/4 Running 4 30d etcd-quorum-guard-5f7ddf45b6-wz4ch 1/1 Running 0 30d etcd-quorum-guard-6656f65c79-2l5ff 1/1 Running 0 2d5h etcd-quorum-guard-6656f65c79-4jhwq 0/1 Pending 0 2d5h etcd 4.10.25 True True True 314d ClusterMemberControllerDegraded: unhealthy members found during reconciling members... Normal MissingStaticPod deployment/etcd-operator static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 37 on node: "master11.v46ocp4.example.com" didn't show up, waited: 2m30s patch oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge didn't help I have 3 nodes ocp. Restart every nodes, last master3 I had to restart 2 times. Then update continued.
Hi juraj_petras.com, the bug fix solved the problem that if the static pod is not at the correct revision, this controller should go degraded. It should also emit an event for detection. It didn't provide one solution to make the cluster health when the cluster went into degraded.
Hi Ke Wang Sorry I forgot write you. Problem was solved. I restarted all 3 nodes and update to 4.10.30 finished and was OK. Where was problem ?. This ocp I used for testing ocp with cnsa spectrum scale 5.1.4.0. OK. After update to 4.10.30 (restarting all nodes) all cnsa, csi pods stayed pending. Developers cnsa told me ocp 4.10.30 (from 4.10.26 and more) use ports cnsa on 3 nodes ocp. After install fix cnsa 5.1.4.1.1 all was OK, cnsa, csi port were running. In this moment I have ocp 4.11.7 with cnsa 5.1.5.0 and no problem with cnsa, csi ocp pods, with ports. Thank you very much Ke Wang.
Hi @
Hi @Damien Grisonnet I have CU who is using OCP 4.11.25, and he is facing the same issue. ~~~ message: 'MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 182 on node: "ip-xxxx.xxxx" didn''t show up, waited: 4m45s' NAME READY STATUS RESTARTS AGE kube-apiserver-guard-ip-xxxx 0/1 Running 0 20d ~~~ After restarting the kubelet service on respective master node issue got fixed. Could you please help in this case
Hi Team, any update on above query
Hi @Damien Grisonnet - I have a customer who is facing this issue on 4.11.28 which have same symptoms. ~~~ MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 18 on node: "xxx-xxxx.xxx.xxxx.xxx.xxxx.com" didn't show up, waited: 3m15s StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-hzl-xxx.xxx.xxx.xxx.xxx.xxx was rolled back to revision 18 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/e ~~~ - Restarting kubelet resolved the issue but it is occurring again and again after every 10 minutes. Its a single node cluster. - Customer is not happy with this and suggested that there should have been some warning on the upgrade path in order to highlight now. - What is the permanent fix for this ? In which version is this issue resolved.