2053582 – inability to detect static lifecycle failure

Bug 2053582 - inability to detect static lifecycle failure

Summary: inability to detect static lifecycle failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Damien Grisonnet
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:	EmergencyRequest
Duplicates (1):	2053617 (view as bug list)
Depends On:
Blocks:	2053268 2053616
TreeView+	depends on / blocked

Reported:	2022-02-11 14:34 UTC by Abu Kashem
Modified:	2023-12-10 04:25 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2053268
Environment:
Last Closed:	2022-08-10 10:49:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 748	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:16 UTC
Github	openshift cluster-etcd-operator pull 750	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:17 UTC
Github	openshift cluster-kube-apiserver-operator pull 1321	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:18 UTC
Github	openshift cluster-kube-apiserver-operator pull 1323	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:26 UTC
Github	openshift cluster-kube-controller-manager-operator pull 606	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:19 UTC
Github	openshift cluster-kube-controller-manager-operator pull 608	None	Merged	Bug 2053582: Track static pod lifecycle	2022-09-23 03:50:17 UTC
Github	openshift cluster-kube-scheduler-operator pull 415	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:21 UTC
Github	openshift cluster-kube-scheduler-operator pull 417	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:13 UTC
Github	openshift library-go pull 1316	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:12 UTC
Github	openshift library-go pull 1320	None	Merged	Bug 2053582: Track static pod lifecycle	2022-02-15 18:19:12 UTC
Github	openshift origin pull 26837	None	Merged	bug 2053582: fail on static pod lifecycle failures	2022-02-14 19:45:41 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:50:06 UTC

Comment 1 Wally 2022-02-14 15:25:34 UTC

*** Bug 2053617 has been marked as a duplicate of this bug. ***

Comment 3 Abu Kashem 2022-02-14 20:57:10 UTC

moving it to assigned, 4 operator(kas, scheduler, ectd, kcm) PRs are on its way

Comment 4 Lukasz Szaszkiewicz 2022-02-15 15:44:42 UTC

I am moving this BZ to POST because I want to merge a small improvment.

Comment 8 Ke Wang 2022-02-22 11:20:51 UTC

The following operators should be tested all.
1. kube-apiserver-operator
2. kube-controller-manager-operator
3. kube-scheduler-operator
4. etcd-operator

Checkpoints for each opeator:
1. When static pod rollouts, the latest installer pods for each node successfully.
2. After the kubelet deletes the currently running pod, for some reason the kubelet doesn’t run static pod at the correct revision, this controller should go degraded and emit an event.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-02-18-121223   True        False         133m    Cluster version is 4.11.0-0.nightly-2022-02-18-121223

For kube-apiserver operator, 

Force a rolling out of kube-apiserver server in first terminal, 

$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'
kubeapiserver.operator.openshift.io/cluster patched

When new kube-apiserver server installer pod is starting , 

$ $ oc get pod -n openshift-kube-apiserver --show-labels | grep Running | grep install
installer-9-xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal            1/1     Running     0          13s    app=installer

In second terminal,
oc debug node/xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-apiserver server

sh-4.4# ip link |grep ens 
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000

sh-4.4# cat test.sh 
ifconfig ens4 down
sleep 600
ifconfig ens4 up

sh-4.4# chmod +x test.sh 

# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh & 

After a while, monitor the kube-scheduler cluster operator in first terminal, 
$ while true; do oc get pod -n openshift-kube-apiserver --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-apiserver-operator |  grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-kas.log
...
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE    
kube-apiserver                             4.11.0-0.nightly-2022-02-18-121223   True        True          False      114m    NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 9
...
10s         Normal    MissingStaticPod                         deployment/kube-apiserver-operator            static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 9 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 4m15s
10s         Normal    OperatorStatusChanged                    deployment/kube-apiserver-operator            Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 10:59:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 9 on node: \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 10:59:47 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"

...
kube-apiserver                             4.11.0-0.nightly-2022-02-18-121223   True        True          True       119m    MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 9 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 4m15s...


Based on above, we can see the kubelet doesn’t run kube-apiserver static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emitted an event at last.

Comment 9 Ke Wang 2022-02-22 14:43:28 UTC

For KCM operator, 

Force a rolling out of KCM server in first terminal, 

$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubecontrollermanager.operator.openshift.io/cluster patched

When new KCM server installer pod is starting , 

$ oc get pod -n openshift-kube-controller-manager --show-labels | grep Running | grep install
installer-9-xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal 

In second terminal,
$ oc debug node/xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal # Logged in node of the rolling out KCM server

sh-4.4# ip link |grep ens 
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000

sh-4.4# cat test.sh 
ifconfig ens4 down
sleep 600
ifconfig ens4 up

sh-4.4# chmod +x test.sh 

# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh & 

After a while, monitor the KCM cluster operator in first terminal, 
$ while true; do oc get pod -n openshift-kube-controller-manager --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-controller-manager-operator |  grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-kcm.log
...
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE    
kube-controller-manager                    4.11.0-0.nightly-2022-02-18-121223   True        True          False      5h40m   NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 9
...
9s          Normal   MissingStaticPod            deployment/kube-controller-manager-operator   static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 9 on node: "xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s
9s          Normal   OperatorStatusChanged       deployment/kube-controller-manager-operator   Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 14:34:45 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 14:34:45 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nMissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-controller-manager\" in namespace: \"openshift-kube-controller-manager\" for revision: 9 on node: \"xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal\" didn't show up, waited: 2m30s"
...
kube-controller-manager                    4.11.0-0.nightly-2022-02-18-121223   True        True          True       5h43m   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 9 on node: "xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s...

Based on above, we can see the kubelet doesn’t run KCM static pod at the correct revision due to reason 'The master nodes not ready', this controller went degraded and emitted an event at last.

Comment 10 Ke Wang 2022-02-22 15:23:49 UTC

For kube-scheduler-operator, 

Force a rolling out of kube-scheduler server in first terminal, 
$ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubescheduler.operator.openshift.io/cluster patched

When new kube-scheduler server installer pod is starting , 
$ oc get pod -n openshift-kube-scheduler --show-labels | grep Running | grep install
installer-8-xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal                      1/1     Running     0              22s     app=installer

In second terminal,
$ oc debug node/xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-scheduler server

sh-4.4# ip link |grep ens 
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000

sh-4.4# cat test.sh 
ifconfig ens4 down
sleep 600
ifconfig ens4 up

sh-4.4# chmod +x test.sh 

# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh & 

After a while, monitor the kube-scheduler cluster operator in first terminal, 
$ while true; do oc get pod -n openshift-kube-scheduler --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-scheduler-operator |  grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-sche.log
...
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE    
kube-scheduler                             4.11.0-0.nightly-2022-02-18-121223   True        True          False      6h20m   NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8
...
6s          Normal    MissingStaticPod            deployment/openshift-kube-scheduler-operator   static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 8 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s
6s          Normal    OperatorStatusChanged       deployment/openshift-kube-scheduler-operator   Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 15:15:53 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"openshift-kube-scheduler\" in namespace: \"openshift-kube-scheduler\" for revision: 8 on node: \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 15:15:53 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
...
kube-scheduler                             4.11.0-0.nightly-2022-02-18-121223   True        True          True       6h24m   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 8 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s...

Based on above, we can see the kubelet doesn’t run kube-scheduler static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emitted an event at last.

Comment 11 Ke Wang 2022-02-22 15:44:39 UTC

For openshift-etcd-operator, 

Force a rolling out of etcd server in first terminal, 
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
etcd.operator.openshift.io/cluster patched

When new etcd server installer pod is starting , 
$ oc get pod -n openshift-etcd --show-labels | grep Running | grep install
installer-9-xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal         1/1     Running     0          37s     app=installer

In second terminal,
$ oc debug node/xxxx-22411g1-w24n6-master-2.c.openshift-qe.internal # Logged in node of the rolling out etcd server

sh-4.4# ip link |grep ens 
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000

sh-4.4# cat test.sh 
ifconfig ens4 down
sleep 600
ifconfig ens4 up

sh-4.4# chmod +x test.sh 

# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh & 

After a while, monitor the etcd cluster operator in first terminal, 
$ while true; do oc get pod -n openshift-etcd --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-etcd-operator |  grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-etcd.log
...
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE    
kube-scheduler                             4.11.0-0.nightly-2022-02-18-121223   True        True          False      6h20m   NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8
...
6s          Normal    MissingStaticPod            deployment/openshift-kube-scheduler-operator   static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 8 on node: "xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s
6s          Normal    OperatorStatusChanged       deployment/openshift-kube-scheduler-operator   Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 15:15:53 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"openshift-kube-scheduler\" in namespace: \"openshift-kube-scheduler\" for revision: 8 on node: \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22411g1-w24n6-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 15:15:53 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
...
etcd                                       4.11.0-0.nightly-2022-02-18-121223   True        False         True       6h26m   ClusterMemberControllerDegraded: unhealthy members found during reconciling members...

Based on above, we can see the kubelet doesn’t run etcd static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emit an event at last. In conclusion， the bug fix works fine, so move the bug VERIFIED.

Comment 16 errata-xmlrpc 2022-08-10 10:49:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Comment 17 juraj_petras 2022-09-12 08:23:40 UTC

Hello Ke Wang

I falled to the same problem

ocp 4.10.25 was OK. I started update to 4.10.30. Operator etcd degraded decided one master is unhealthy.

All nodes ready, one etc-master pod showwed status nodeports. 

etcd-master11.v46ocp4.example.com                 0/4     NodePorts   0          30d
etcd-master12.v46ocp4.example.com                 4/4     Running     4          30d
etcd-master13.v46ocp4.example.com                 4/4     Running     4          30d
etcd-quorum-guard-5f7ddf45b6-wz4ch                1/1     Running     0          30d
etcd-quorum-guard-6656f65c79-2l5ff                1/1     Running     0          2d5h
etcd-quorum-guard-6656f65c79-4jhwq                0/1     Pending     0          2d5h

etcd                                       4.10.25   True        True          True       314d    ClusterMemberControllerDegraded: unhealthy members found during reconciling members...

Normal    MissingStaticPod      deployment/etcd-operator   static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 37 on node: "master11.v46ocp4.example.com" didn't show up, waited: 2m30s

patch oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

didn't help 

I have 3 nodes ocp. Restart every nodes, last master3 I had to restart 2 times. Then update continued.

Comment 18 Ke Wang 2022-10-18 11:38:13 UTC

Hi juraj_petras.com, the bug fix solved the problem that if the static pod is not at the correct revision, this controller should go degraded. It should also emit an event for detection. It didn't provide one solution to make the cluster health when the cluster went into degraded.

Comment 19 juraj_petras 2022-10-18 12:10:43 UTC

Hi Ke Wang

Sorry I forgot write you. Problem was solved. I restarted all 3 nodes and update to 4.10.30 finished and was OK. Where was problem ?. This ocp I used for testing ocp with cnsa spectrum scale 5.1.4.0. OK. After update to 4.10.30 (restarting all nodes) all cnsa, csi pods stayed pending.  Developers cnsa told me ocp 4.10.30 (from 4.10.26 and more) use ports cnsa on 3 nodes ocp. After install fix cnsa 5.1.4.1.1 all was OK, cnsa, csi port were running. In this moment I have ocp 4.11.7 with cnsa 5.1.5.0 and no problem with cnsa, csi ocp pods, with ports. Thank you very much Ke Wang.

Comment 20 Shubham Khette 2023-04-14 18:02:58 UTC

Hi @

Comment 21 Shubham Khette 2023-04-14 18:08:20 UTC

Hi @Damien Grisonnet

I have CU who is using OCP 4.11.25, and he is facing the same issue.

~~~
 message: 'MissingStaticPodControllerDegraded: static pod lifecycle failure - static
    pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision:
    182 on node: "ip-xxxx.xxxx" didn''t
    show up, waited: 4m45s'

NAME                          READY  STATUS     RESTARTS  AGE
kube-apiserver-guard-ip-xxxx   0/1    Running    0         20d
~~~

After restarting the kubelet service on respective master node issue got fixed.
Could you please help in this case

Comment 22 Shubham Khette 2023-04-17 18:39:40 UTC

Hi Team, 

any update on above query

Comment 23 Vishvranjan Mishra 2023-04-18 11:12:23 UTC

Hi @Damien Grisonnet

- I have a customer who is facing this issue on 4.11.28 which have same symptoms.


~~~
MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 18 on node: "xxx-xxxx.xxx.xxxx.xxx.xxxx.com" didn't show up, waited: 3m15s
StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-hzl-xxx.xxx.xxx.xxx.xxx.xxx was rolled back to revision 18 due to waiting for kube-apiserver static pod to listen on port 6443: Get "https://localhost:6443/healthz/e
~~~



- Restarting kubelet resolved the issue but it is occurring again and again after every 10 minutes. Its a single node cluster.

- Customer is not happy with this and suggested that there should have been some warning on the upgrade path in order to highlight now.


- What is the permanent fix for this ? In which version is this issue resolved.

Comment 25 Erickson Joseph Santos 2023-08-03 04:27:05 UTC

I can see that the bug regarding "prevention of cluster operator degradation due to a static Pod being in incorrect revision" or auto-healing of such is being discussed here in this Jira ticket: https://issues.redhat.com/browse/OCPBUGS-2474

Comment 26 Shubham Khette 2023-08-11 05:34:49 UTC

Hi Erickson,

Thank you for the update, this JIRA matches the CU's requirement. will follow up on this JIRA request.

Comment 27 Red Hat Bugzilla 2023-12-10 04:25:03 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.