Bug 2053268

Summary:	inability to detect static lifecycle failure
Product:	OpenShift Container Platform	Reporter:	David Eads <deads>
Component:	kube-apiserver	Assignee:	Damien Grisonnet <dgrisonn>
Status:	CLOSED ERRATA	QA Contact:	Ke Wang <kewang>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.10	CC:	akashem, aos-bugs, lszaszki, maszulik, mfojtik, xxia
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	EmergencyRequest
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2053582 2053616 (view as bug list)		Environment:
Last Closed:	2022-03-12 04:42:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2053582
Bug Blocks:	2053620

Description David Eads 2022-02-10 20:02:38 UTC

Related to https://bugzilla.redhat.com/show_bug.cgi?id=2053255

When a static pod fails to rollout, there is no detection in the static pod operator library today. This makes it impossible to determine an accurate reliability number of frequency of static pod rollouts.  Lack of this information, makes it impossible to assess whether a bug like https://bugzilla.redhat.com/show_bug.cgi?id=2053255 is a stop ship problem and makes it harder to identify and debug.


working from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26831/pull-ci-openshift-origin-master-e2e-aws-fips/1491798226607542272 as an example


TIMELINE OF EVENTS:
0."Kubelet version" kubeletVersion="v1.23.3+f14faf2" - contains rphillips fix for static pods
1. kcm revision/7 node/215 installer (normal pod) started at 16:22:03Z, finished at 16:22:57Z
2. kcm revision/7 node/215 static pod file created before 16:22:57.  pod in API created at 16:23:10Z, started kcm container 16:23:10Z
3. kcm-o observes node/215 as available at 16:23:30
4. kcm revision/7 node/177 installer (normal pod) started at 16:23:48Z, finished at 16:24:21Z
5. kcm revision/7 node/177 static pod file created before 16:24:21Z.  pod in API not updated by 16:42\:35


This detection could be built by checking the latest installer pods for each node.  If the installer pod was successful, and if it has been longer than the terminationGracePeriodSeconds+10seconds since the installer pod completed successfully, and if the static pod is not at the correct revision, this controller should go degraded. It should also emit an event for detection in CI.

Since this lack of this blocks our ability to identify, assess frequency, and decide on whether the kubelet bug is a blocker, this becomes a blocker.


The fix will apply to the following operators and should be tested by QE on all:

1. kube-apiserver-operator
2. kube-controller-manager-operator
3. kube-scheduler-operator
4. etcd-operator

Comment 1 David Eads 2022-02-10 20:08:36 UTC

We need this to determine the scope of our problem.  If we ship 4.10.0 with a problem that prevents upgrade to 4.10.z+1 we have a serious problem.  marking blocker.

Comment 2 Michal Fojtik 2022-02-10 20:09:37 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 4 Wally 2022-02-10 21:17:08 UTC

Setting to blocker+ based on convo with David.

Comment 5 Michal Fojtik 2022-02-10 21:39:37 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 7 David Eads 2022-02-10 22:24:17 UTC

https://github.com/openshift/origin/pull/26837 is now associated with this, but it is only the easy HALF of the solution.  Once the event exists, that test will fail, but the operators still need to produce the events

Comment 8 Michal Fojtik 2022-02-10 22:39:36 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 11 Michal Fojtik 2022-02-11 09:15:03 UTC

> This detection could be built by checking the latest installer pods for each node.  If the installer pod was successful, and if it has been longer than the terminationGracePeriodSeconds+10seconds since the installer pod completed successfully, and if the static pod is not at the correct revision, this controller should go degraded. It should also emit an event for detection in CI.


kubeapiserver.operator.openshift.io/v1 today has this in its status:

  nodeStatuses:
  - currentRevision: 0
    nodeName: ip-10-0-210-17.us-west-1.compute.internal
    targetRevision: 3

This came from a static pod installer controller AFAIK. if we extend this to report 'installedRevision:2' that will actually be reported by the installer pod itself once it finish copying the static pod manifests into target kubelet directory, will that make debugging this easier?

In addition, we can have a controller that watches updated to this CRD and detect and report time delays (if currentRevision=1 but installedRevision=2 for more than 60s - event/degraded/etc.)

Comment 12 Michal Fojtik 2022-02-11 10:39:38 UTC

** A NOTE ABOUT USING URGENT **

This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold.
Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility.

NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity.

** INFORMATION REQUIRED **

Please answer these questions before escalation to engineering:

1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather.
2. Give the output of "oc get clusteroperators -o yaml".
3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no]
4. List the top 5 relevant errors from the logs of the operators and operands in (3).
5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top.
6. Explain why (5) is likely the right order and list the information used for that assessment.
7. Explain why Engineering is necessary to make progress.

Comment 13 Maciej Szulik 2022-02-14 15:17:19 UTC

*** Bug 2053620 has been marked as a duplicate of this bug. ***

Comment 14 Maciej Szulik 2022-02-14 15:17:21 UTC

*** Bug 2053618 has been marked as a duplicate of this bug. ***

Comment 15 Wally 2022-02-14 15:26:32 UTC

*** Bug 2053616 has been marked as a duplicate of this bug. ***

Comment 17 Abu Kashem 2022-02-14 20:17:00 UTC

moving it to assigned, 4 operator(kas, scheduler, ectd, kcm) PRs are on its way

Comment 19 Lukasz Szaszkiewicz 2022-02-15 15:11:25 UTC

I am moving this BZ to POST because I want to merge a small improvment.

Comment 23 Ke Wang 2022-02-17 15:53:56 UTC

When I was trying to install one cluster with Jenkins CI job, the installation got stuck in cluster operator kube-controller-manager degradation.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-02-16-171622   True        False         7h14m   Error while reconciling 4.10.0-0.nightly-2022-02-16-171622: the cluster operator kube-controller-manager is degraded

$ oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-controller-manager                    4.10.0-0.nightly-2022-02-16-171622   True        True          True       7h29m   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 7 on node: "kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s

$ oc describe kubecontrollermanagers
Name:         cluster
Namespace:    
Labels:       <none>
Annotations:  include.release.openshift.io/ibm-cloud-managed: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
              release.openshift.io/create-only: true
API Version:  operator.openshift.io/v1
Kind:         KubeControllerManager
...
Status:
  Conditions:
    Last Transition Time:            2022-02-17T08:21:01Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            GuardControllerDegraded
    Last Transition Time:            2022-02-17T08:11:53Z
    Status:                          False
    Type:                            InstallerControllerDegraded
    Last Transition Time:            2022-02-17T08:12:57Z
    Message:                         3 nodes are active; 1 nodes are at revision 5; 1 nodes are at revision 6; 1 nodes are at revision 7
    Status:                          True
    Type:                            StaticPodsAvailable
    Last Transition Time:            2022-02-17T08:09:28Z
    Message:                         1 nodes are at revision 5; 1 nodes are at revision 6; 1 nodes are at revision 7
    Status:                          True
    Type:                            NodeInstallerProgressing
    Last Transition Time:            2022-02-17T08:09:16Z
    Status:                          False
    Type:                            NodeInstallerDegraded
    Last Transition Time:            2022-02-17T08:22:07Z
    Status:                          False
    Type:                            StaticPodsDegraded
    Last Transition Time:            2022-02-17T08:09:16Z
    Message:                         All master nodes are ready
    Reason:                          MasterNodesReady
    Status:                          False
    Type:                            NodeControllerDegraded
    Last Transition Time:            2022-02-17T08:09:17Z
    Reason:                          NoUnsupportedConfigOverrides
    Status:                          True
    Type:                            UnsupportedConfigOverridesUpgradeable
    Last Transition Time:            2022-02-17T08:09:20Z
    Status:                          False
    Type:                            CertRotation_CSRSigningCert_Degraded
    Last Transition Time:            2022-02-17T08:09:21Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            BackingResourceControllerDegraded
    Last Transition Time:            2022-02-17T08:09:23Z
    Status:                          False
    Type:                            ResourceSyncControllerDegraded
    Last Transition Time:            2022-02-17T08:09:36Z
    Status:                          False
    Type:                            ConfigObservationDegraded
    Last Transition Time:            2022-02-17T08:09:25Z
    Status:                          False
    Type:                            InstallerPodPendingDegraded
    Last Transition Time:            2022-02-17T08:09:25Z
    Status:                          False
    Type:                            InstallerPodContainerWaitingDegraded
    Last Transition Time:            2022-02-17T08:09:25Z
    Status:                          False
    Type:                            InstallerPodNetworkingDegraded
    Last Transition Time:            2022-02-17T08:11:11Z
    Status:                          False
    Type:                            RevisionControllerDegraded
    Last Transition Time:            2022-02-17T08:25:29Z
    Message:                         static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 7 on node: "kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s
    Reason:                          SyncError
    Status:                          True
    Type:                            MissingStaticPodControllerDegraded
    Last Transition Time:            2022-02-17T08:09:33Z
    Reason:                          AsExpected
    Status:                          False
    Type:                            KubeControllerManagerStaticResourcesDegraded
    Last Transition Time:            2022-02-17T08:09:34Z
    Status:                          False
    Type:                            SATokenSignerDegraded
    Last Transition Time:            2022-02-17T08:09:51Z
    Status:                          True
    Type:                            Upgradeable
    Last Transition Time:            2022-02-17T08:09:51Z
    Status:                          True
    Type:                            CloudControllerOwner
    Last Transition Time:            2022-02-17T08:09:51Z
    Status:                          False
    Type:                            TargetConfigControllerDegraded
  Latest Available Revision:         7
  Latest Available Revision Reason:  
  Node Statuses:
    Current Revision:  7
    Node Name:         kewang-17410g1-vm7zr-master-2.c.openshift-qe.internal
    Current Revision:  5
    Node Name:         kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal
    Target Revision:   7
    Current Revision:  6
    Node Name:         kewang-17410g1-vm7zr-master-1.c.openshift-qe.internal
  Ready Replicas:      0
Events:                <none>

----
The static pod of kube-controller-manager is not at the correct revision, this controller should go degraded indeed, but it kept 7h14m so long, so there is a potential risk that the upgrade will fail. Is that what we expect？

Comment 25 Ke Wang 2022-02-21 16:14:44 UTC

The following verification steps confirmed with Devs, 

The following operators should be tested all.
1. kube-apiserver-operator
2. kube-controller-manager-operator
3. kube-scheduler-operator
4. etcd-operator

Checkpoints for each opeator:
1. When static pod rollouts, the latest installer pods for each node successfully.
2. After the kubelet deletes the currently running pod, for some reason the kubelet doesn’t run static pod at the correct revision, this controller should go degraded and emit an event.

--------------------
For kube-apiserver-operator, 

Force a rolling out of kube-apiserver in one terminal, 
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'

When new kube-apiserver installer pod is starting , 

$ oc get pods -n openshift-kube-apiserver --show-labels | grep Running | grep install
installer-15-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal           1/1     Running     0                5s      app=installer

In another terminal,
oc debug node/xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-apiserver 

sh-4.4# ip link |grep ens 
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000

sh-4.4# cat test.sh 
ifconfig ens4 down
sleep 600
ifconfig ens4 up

sh-4.4# chmod +x test.sh 

# Check the latest installer pods status, after the installer pod completed successfully, wait about 30s, then run script
sh-4.4# /tmp/test.sh & 

After a while, check the cluster operators, 
# oc get co | grep -v "True.*False.*False"
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver                             4.10.0-0.nightly-2022-02-17-234353   True        True          False      6h41m   NodeInstallerProgressing: 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15

Check if there is some events about why the static pod cannnot be sarted

$ masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')

$ for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -ir 'static pod lifecycle failure' /var/log/ | grep -v debug";done | tail -5
...
Removing debug pod ...
/var/log/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-68ddc9cc8c-7hmp7_5a922cf8-4725-40aa-952b-9c300d14ce95/kube-apiserver-operator/0.log:2022-02-21T09:44:35.924962534+00:00 stderr F I0221 09:44:35.924611       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"34684bef-914e-4bf8-b99a-adfdb0d134f5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver\" started at 2022-02-21 09:42:40 +0000 UTC is still not ready\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 2m40s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal_openshift-kube-apiserver(1e4249e162c276500f1f54ec5bc523f6)"
...
2022-02-21T09:44:35.899450624+00:00 stderr F I0221 09:44:35.898490       1 status_controller.go:211] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2022-02-21T09:42:52Z","message":"MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver\" started at 2022-02-21 09:42:40 +0000 UTC is still not ready\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 2m40s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal_openshift-kube-apiserver(1e4249e162c276500f1f54ec5bc523f6)","reason":"MissingStaticPodController_SyncError::StaticPods_Error","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-02-21T09:18:10Z","message":"NodeInstallerProgressing: 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-02-21T02:51:53Z","message":"StaticPodsAvailable: 3 nodes are active; 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-02-21T02:37:19Z","message":"KubeletMinorVersionUpgradeable: Kubelet and API server minor versions are synced.","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

Others operators will be finished soon or later.

Comment 26 Ke Wang 2022-02-22 06:54:13 UTC

For openshift-etcd-operator, 

Force a rolling out of etcd server in first terminal, 

$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
etcd.operator.openshift.io/cluster patched

When new etcd server installer pod is starting , 

$ oc get pod -n openshift-etcd --show-labels | grep Running | grep install
installer-9-xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal         1/1     Running     0          24s     app=installer

In second terminal,
oc debug node/xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal # Logged in node of the rolling out etcd server

sh-4.4# ip link |grep ens 
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000

sh-4.4# cat test.sh 
ifconfig ens4 down
sleep 600
ifconfig ens4 up

sh-4.4# chmod +x test.sh 

# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh & 

After a while, monitor the etcd cluster operator in first terminal, 
$ while true; do oc get pod -n openshift-etcd --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-etcd-operator |  grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-etcd.log
...
etcd                                       4.10.0-0.nightly-2022-02-17-234353   True        True          False      4h32m   NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 9
...
3s          Normal    MissingStaticPod                        deployment/etcd-operator                         static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 9 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s                                                                              2s          Normal    OperatorStatusChanged                   deployment/etcd-operator                         Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: configmaps lister not synced\nDefragControllerDegraded: configmaps lister not synced\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 06:19:01 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMembersControllerDegraded: configmaps lister not synced\nDefragControllerDegraded: configmaps lister not synced\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 06:19:01 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nMissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"etcd\" in namespace: \"openshift-etcd\" for revision: 9 on node: \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nEtcdMembersDegraded: No unhealthy members found"
...
etcd                                       4.10.0-0.nightly-2022-02-17-234353   True        True          True       4h40m   ClusterMemberControllerDegraded: unhealthy members found during reconciling members...

Based on above, we can see the kubelet doesn’t run etcd static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emit an event at last.

Comment 27 Ke Wang 2022-02-22 07:52:36 UTC

For kube-scheduler-operator, 

Force a rolling out of kube-scheduler server in first terminal, 

$ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubescheduler.operator.openshift.io/cluster patched

When new kube-scheduler server installer pod is starting , 

$ oc get pod -n openshift-kube-scheduler --show-labels | grep Running | grep install
installer-7-xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal                      1/1     Running     0              23s     app=installer


In second terminal,
oc debug node/xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-scheduler server

sh-4.4# ip link |grep ens 
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000

sh-4.4# cat test.sh 
ifconfig ens4 down
sleep 600
ifconfig ens4 up

sh-4.4# chmod +x test.sh 

# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh & 

After a while, monitor the kube-scheduler cluster operator in first terminal, 
$ while true; do oc get pod -n openshift-kube-scheduler --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-scheduler-operator |  grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-sche.log
...
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE    
kube-scheduler                             4.10.0-0.nightly-2022-02-17-234353   True        True          False      5h55m   NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7
...
0s          Normal    MissingStaticPod                    deployment/openshift-kube-scheduler-operator               static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 7 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s
0s          Normal    OperatorStatusChanged               deployment/openshift-kube-scheduler-operator               Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 07:42:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"openshift-kube-scheduler\" in namespace: \"openshift-kube-scheduler\" for revision: 7 on node: \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 07:42:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
...
kube-scheduler                             4.10.0-0.nightly-2022-02-17-234353   True        True          True       5h57m   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 7 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s...

Based on above, we can see the kubelet doesn’t run kube-scheduler static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emitted an event at last.

Comment 28 Ke Wang 2022-02-22 08:33:23 UTC

For KCM operator, 

Force a rolling out of KCM server in first terminal, 

$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubecontrollermanager.operator.openshift.io/cluster patched

When new KCM server installer pod is starting , 

$ oc get pod -n openshift-kube-controller-manager --show-labels | grep Running | grep install
installer-7-xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal                     1/1     Running     0                22s     app=installer

In second terminal,
oc debug node/xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal # Logged in node of the rolling out KCM server

sh-4.4# ip link |grep ens 
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000

sh-4.4# cat test.sh 
ifconfig ens4 down
sleep 600
ifconfig ens4 up

sh-4.4# chmod +x test.sh 

# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh & 

After a while, monitor the KCM cluster operator in first terminal, 
$ while true; do oc get pod -n openshift-kube-controller-manager --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-controller-manager-operator |  grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-kcm.log
...
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE    
kube-controller-manager                    4.10.0-0.nightly-2022-02-17-234353   True        True          False      6h29m   NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8
...
6s          Normal    MissingStaticPod                    deployment/kube-controller-manager-operator              static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 8 on node: "xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s
6s          Normal    OperatorStatusChanged               deployment/kube-controller-manager-operator              Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 08:17:07 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-controller-manager\" in namespace: \"openshift-kube-controller-manager\" for revision: 8 on node: \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 08:17:07 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
...
kube-controller-manager                    4.10.0-0.nightly-2022-02-17-234353   True        True          True       6h32m   MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 8 on node: "xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s...


Based on above, we can see the kubelet doesn’t run KCM static pod at the correct revision due to reason 'The master nodes not ready', this controller went degraded and emitted an event at last. In conclusion， the bug fix works fine, so move the bug VERIFIED.

Comment 30 errata-xmlrpc 2022-03-12 04:42:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056