Bug 2053268
| Summary: | inability to detect static lifecycle failure | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | David Eads <deads> | |
| Component: | kube-apiserver | Assignee: | Damien Grisonnet <dgrisonn> | |
| Status: | CLOSED ERRATA | QA Contact: | Ke Wang <kewang> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 4.10 | CC: | akashem, aos-bugs, lszaszki, maszulik, mfojtik, xxia | |
| Target Milestone: | --- | |||
| Target Release: | 4.10.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | EmergencyRequest | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2053582 2053616 (view as bug list) | Environment: | ||
| Last Closed: | 2022-03-12 04:42:06 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2053582 | |||
| Bug Blocks: | 2053620 | |||
|
Description
David Eads
2022-02-10 20:02:38 UTC
We need this to determine the scope of our problem. If we ship 4.10.0 with a problem that prevents upgrade to 4.10.z+1 we have a serious problem. marking blocker. ** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity. ** INFORMATION REQUIRED ** Please answer these questions before escalation to engineering: 1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather. 2. Give the output of "oc get clusteroperators -o yaml". 3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no] 4. List the top 5 relevant errors from the logs of the operators and operands in (3). 5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top. 6. Explain why (5) is likely the right order and list the information used for that assessment. 7. Explain why Engineering is necessary to make progress. Setting to blocker+ based on convo with David. ** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity. ** INFORMATION REQUIRED ** Please answer these questions before escalation to engineering: 1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather. 2. Give the output of "oc get clusteroperators -o yaml". 3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no] 4. List the top 5 relevant errors from the logs of the operators and operands in (3). 5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top. 6. Explain why (5) is likely the right order and list the information used for that assessment. 7. Explain why Engineering is necessary to make progress. https://github.com/openshift/origin/pull/26837 is now associated with this, but it is only the easy HALF of the solution. Once the event exists, that test will fail, but the operators still need to produce the events ** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity. ** INFORMATION REQUIRED ** Please answer these questions before escalation to engineering: 1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather. 2. Give the output of "oc get clusteroperators -o yaml". 3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no] 4. List the top 5 relevant errors from the logs of the operators and operands in (3). 5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top. 6. Explain why (5) is likely the right order and list the information used for that assessment. 7. Explain why Engineering is necessary to make progress. > This detection could be built by checking the latest installer pods for each node. If the installer pod was successful, and if it has been longer than the terminationGracePeriodSeconds+10seconds since the installer pod completed successfully, and if the static pod is not at the correct revision, this controller should go degraded. It should also emit an event for detection in CI.
kubeapiserver.operator.openshift.io/v1 today has this in its status:
nodeStatuses:
- currentRevision: 0
nodeName: ip-10-0-210-17.us-west-1.compute.internal
targetRevision: 3
This came from a static pod installer controller AFAIK. if we extend this to report 'installedRevision:2' that will actually be reported by the installer pod itself once it finish copying the static pod manifests into target kubelet directory, will that make debugging this easier?
In addition, we can have a controller that watches updated to this CRD and detect and report time delays (if currentRevision=1 but installedRevision=2 for more than 60s - event/degraded/etc.)
** A NOTE ABOUT USING URGENT ** This BZ has been set to urgent severity and priority. When a BZ is marked urgent priority Engineers are asked to stop whatever they are doing, putting everything else on hold. Please be prepared to have reasonable justification ready to discuss, and ensure your own and engineering management are aware and agree this BZ is urgent. Keep in mind, urgent bugs are very expensive and have maximal management visibility. NOTE: This bug was automatically assigned to an engineering manager with the severity reset to *unspecified* until the emergency is vetted and confirmed. Please do not manually override the severity. ** INFORMATION REQUIRED ** Please answer these questions before escalation to engineering: 1. Has a link to must-gather output been provided in this BZ? We cannot work without. If must-gather fails to run, attach all relevant logs and provide the error message of must-gather. 2. Give the output of "oc get clusteroperators -o yaml". 3. In case of degraded/unavailable operators, have all their logs and the logs of the operands been analyzed [yes/no] 4. List the top 5 relevant errors from the logs of the operators and operands in (3). 5. Order the list of degraded/unavailable operators according to which is likely the cause of the failure of the other, root-cause at the top. 6. Explain why (5) is likely the right order and list the information used for that assessment. 7. Explain why Engineering is necessary to make progress. *** Bug 2053620 has been marked as a duplicate of this bug. *** *** Bug 2053618 has been marked as a duplicate of this bug. *** *** Bug 2053616 has been marked as a duplicate of this bug. *** moving it to assigned, 4 operator(kas, scheduler, ectd, kcm) PRs are on its way I am moving this BZ to POST because I want to merge a small improvment. When I was trying to install one cluster with Jenkins CI job, the installation got stuck in cluster operator kube-controller-manager degradation.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.0-0.nightly-2022-02-16-171622 True False 7h14m Error while reconciling 4.10.0-0.nightly-2022-02-16-171622: the cluster operator kube-controller-manager is degraded
$ oc get co | grep -v "True.*False.*False"
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-controller-manager 4.10.0-0.nightly-2022-02-16-171622 True True True 7h29m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 7 on node: "kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s
$ oc describe kubecontrollermanagers
Name: cluster
Namespace:
Labels: <none>
Annotations: include.release.openshift.io/ibm-cloud-managed: true
include.release.openshift.io/self-managed-high-availability: true
include.release.openshift.io/single-node-developer: true
release.openshift.io/create-only: true
API Version: operator.openshift.io/v1
Kind: KubeControllerManager
...
Status:
Conditions:
Last Transition Time: 2022-02-17T08:21:01Z
Reason: AsExpected
Status: False
Type: GuardControllerDegraded
Last Transition Time: 2022-02-17T08:11:53Z
Status: False
Type: InstallerControllerDegraded
Last Transition Time: 2022-02-17T08:12:57Z
Message: 3 nodes are active; 1 nodes are at revision 5; 1 nodes are at revision 6; 1 nodes are at revision 7
Status: True
Type: StaticPodsAvailable
Last Transition Time: 2022-02-17T08:09:28Z
Message: 1 nodes are at revision 5; 1 nodes are at revision 6; 1 nodes are at revision 7
Status: True
Type: NodeInstallerProgressing
Last Transition Time: 2022-02-17T08:09:16Z
Status: False
Type: NodeInstallerDegraded
Last Transition Time: 2022-02-17T08:22:07Z
Status: False
Type: StaticPodsDegraded
Last Transition Time: 2022-02-17T08:09:16Z
Message: All master nodes are ready
Reason: MasterNodesReady
Status: False
Type: NodeControllerDegraded
Last Transition Time: 2022-02-17T08:09:17Z
Reason: NoUnsupportedConfigOverrides
Status: True
Type: UnsupportedConfigOverridesUpgradeable
Last Transition Time: 2022-02-17T08:09:20Z
Status: False
Type: CertRotation_CSRSigningCert_Degraded
Last Transition Time: 2022-02-17T08:09:21Z
Reason: AsExpected
Status: False
Type: BackingResourceControllerDegraded
Last Transition Time: 2022-02-17T08:09:23Z
Status: False
Type: ResourceSyncControllerDegraded
Last Transition Time: 2022-02-17T08:09:36Z
Status: False
Type: ConfigObservationDegraded
Last Transition Time: 2022-02-17T08:09:25Z
Status: False
Type: InstallerPodPendingDegraded
Last Transition Time: 2022-02-17T08:09:25Z
Status: False
Type: InstallerPodContainerWaitingDegraded
Last Transition Time: 2022-02-17T08:09:25Z
Status: False
Type: InstallerPodNetworkingDegraded
Last Transition Time: 2022-02-17T08:11:11Z
Status: False
Type: RevisionControllerDegraded
Last Transition Time: 2022-02-17T08:25:29Z
Message: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 7 on node: "kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s
Reason: SyncError
Status: True
Type: MissingStaticPodControllerDegraded
Last Transition Time: 2022-02-17T08:09:33Z
Reason: AsExpected
Status: False
Type: KubeControllerManagerStaticResourcesDegraded
Last Transition Time: 2022-02-17T08:09:34Z
Status: False
Type: SATokenSignerDegraded
Last Transition Time: 2022-02-17T08:09:51Z
Status: True
Type: Upgradeable
Last Transition Time: 2022-02-17T08:09:51Z
Status: True
Type: CloudControllerOwner
Last Transition Time: 2022-02-17T08:09:51Z
Status: False
Type: TargetConfigControllerDegraded
Latest Available Revision: 7
Latest Available Revision Reason:
Node Statuses:
Current Revision: 7
Node Name: kewang-17410g1-vm7zr-master-2.c.openshift-qe.internal
Current Revision: 5
Node Name: kewang-17410g1-vm7zr-master-0.c.openshift-qe.internal
Target Revision: 7
Current Revision: 6
Node Name: kewang-17410g1-vm7zr-master-1.c.openshift-qe.internal
Ready Replicas: 0
Events: <none>
----
The static pod of kube-controller-manager is not at the correct revision, this controller should go degraded indeed, but it kept 7h14m so long, so there is a potential risk that the upgrade will fail. Is that what we expect?
The following verification steps confirmed with Devs,
The following operators should be tested all.
1. kube-apiserver-operator
2. kube-controller-manager-operator
3. kube-scheduler-operator
4. etcd-operator
Checkpoints for each opeator:
1. When static pod rollouts, the latest installer pods for each node successfully.
2. After the kubelet deletes the currently running pod, for some reason the kubelet doesn’t run static pod at the correct revision, this controller should go degraded and emit an event.
--------------------
For kube-apiserver-operator,
Force a rolling out of kube-apiserver in one terminal,
$ oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'
When new kube-apiserver installer pod is starting ,
$ oc get pods -n openshift-kube-apiserver --show-labels | grep Running | grep install
installer-15-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal 1/1 Running 0 5s app=installer
In another terminal,
oc debug node/xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-apiserver
sh-4.4# ip link |grep ens
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
sh-4.4# cat test.sh
ifconfig ens4 down
sleep 600
ifconfig ens4 up
sh-4.4# chmod +x test.sh
# Check the latest installer pods status, after the installer pod completed successfully, wait about 30s, then run script
sh-4.4# /tmp/test.sh &
After a while, check the cluster operators,
# oc get co | grep -v "True.*False.*False"
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.10.0-0.nightly-2022-02-17-234353 True True False 6h41m NodeInstallerProgressing: 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15
Check if there is some events about why the static pod cannnot be sarted
$ masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')
$ for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -ir 'static pod lifecycle failure' /var/log/ | grep -v debug";done | tail -5
...
Removing debug pod ...
/var/log/pods/openshift-kube-apiserver-operator_kube-apiserver-operator-68ddc9cc8c-7hmp7_5a922cf8-4725-40aa-952b-9c300d14ce95/kube-apiserver-operator/0.log:2022-02-21T09:44:35.924962534+00:00 stderr F I0221 09:44:35.924611 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"34684bef-914e-4bf8-b99a-adfdb0d134f5", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver\" started at 2022-02-21 09:42:40 +0000 UTC is still not ready\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 2m40s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal_openshift-kube-apiserver(1e4249e162c276500f1f54ec5bc523f6)"
...
2022-02-21T09:44:35.899450624+00:00 stderr F I0221 09:44:35.898490 1 status_controller.go:211] clusteroperator/kube-apiserver diff {"status":{"conditions":[{"lastTransitionTime":"2022-02-21T09:42:52Z","message":"MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-apiserver\" in namespace: \"openshift-kube-apiserver\" for revision: 15 on node: \"xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal\" didn't show up, waited: 4m15s\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver\" started at 2022-02-21 09:42:40 +0000 UTC is still not ready\nStaticPodsDegraded: pod/kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal container \"kube-apiserver-check-endpoints\" is waiting: CrashLoopBackOff: back-off 2m40s restarting failed container=kube-apiserver-check-endpoints pod=kube-apiserver-xxxx-21410g1-bvcjc-master-0.c.openshift-qe.internal_openshift-kube-apiserver(1e4249e162c276500f1f54ec5bc523f6)","reason":"MissingStaticPodController_SyncError::StaticPods_Error","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-02-21T09:18:10Z","message":"NodeInstallerProgressing: 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-02-21T02:51:53Z","message":"StaticPodsAvailable: 3 nodes are active; 2 nodes are at revision 13; 1 nodes are at revision 14; 0 nodes have achieved new revision 15","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-02-21T02:37:19Z","message":"KubeletMinorVersionUpgradeable: Kubelet and API server minor versions are synced.","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
Others operators will be finished soon or later.
For openshift-etcd-operator,
Force a rolling out of etcd server in first terminal,
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
etcd.operator.openshift.io/cluster patched
When new etcd server installer pod is starting ,
$ oc get pod -n openshift-etcd --show-labels | grep Running | grep install
installer-9-xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal 1/1 Running 0 24s app=installer
In second terminal,
oc debug node/xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal # Logged in node of the rolling out etcd server
sh-4.4# ip link |grep ens
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
sh-4.4# cat test.sh
ifconfig ens4 down
sleep 600
ifconfig ens4 up
sh-4.4# chmod +x test.sh
# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh &
After a while, monitor the etcd cluster operator in first terminal,
$ while true; do oc get pod -n openshift-etcd --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-etcd-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-etcd.log
...
etcd 4.10.0-0.nightly-2022-02-17-234353 True True False 4h32m NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 9
...
3s Normal MissingStaticPod deployment/etcd-operator static pod lifecycle failure - static pod: "etcd" in namespace: "openshift-etcd" for revision: 9 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s 2s Normal OperatorStatusChanged deployment/etcd-operator Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMembersControllerDegraded: configmaps lister not synced\nDefragControllerDegraded: configmaps lister not synced\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 06:19:01 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMembersControllerDegraded: configmaps lister not synced\nDefragControllerDegraded: configmaps lister not synced\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 06:19:01 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nMissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"etcd\" in namespace: \"openshift-etcd\" for revision: 9 on node: \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nEtcdMembersDegraded: No unhealthy members found"
...
etcd 4.10.0-0.nightly-2022-02-17-234353 True True True 4h40m ClusterMemberControllerDegraded: unhealthy members found during reconciling members...
Based on above, we can see the kubelet doesn’t run etcd static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emit an event at last.
For kube-scheduler-operator,
Force a rolling out of kube-scheduler server in first terminal,
$ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubescheduler.operator.openshift.io/cluster patched
When new kube-scheduler server installer pod is starting ,
$ oc get pod -n openshift-kube-scheduler --show-labels | grep Running | grep install
installer-7-xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal 1/1 Running 0 23s app=installer
In second terminal,
oc debug node/xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal # Logged in node of the rolling out kube-scheduler server
sh-4.4# ip link |grep ens
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
sh-4.4# cat test.sh
ifconfig ens4 down
sleep 600
ifconfig ens4 up
sh-4.4# chmod +x test.sh
# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh &
After a while, monitor the kube-scheduler cluster operator in first terminal,
$ while true; do oc get pod -n openshift-kube-scheduler --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-scheduler-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-sche.log
...
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-scheduler 4.10.0-0.nightly-2022-02-17-234353 True True False 5h55m NodeInstallerProgressing: 3 nodes are at revision 6; 0 nodes have achieved new revision 7
...
0s Normal MissingStaticPod deployment/openshift-kube-scheduler-operator static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 7 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s
0s Normal OperatorStatusChanged deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 07:42:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"openshift-kube-scheduler\" in namespace: \"openshift-kube-scheduler\" for revision: 7 on node: \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal\" not ready since 2022-02-22 07:42:35 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
...
kube-scheduler 4.10.0-0.nightly-2022-02-17-234353 True True True 5h57m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 7 on node: "xxxx-22410g1-m6qxd-master-0.c.openshift-qe.internal" didn't show up, waited: 2m30s...
Based on above, we can see the kubelet doesn’t run kube-scheduler static pod at the correct revision due to reason 'The master nodes not ready' , this controller went degraded and emitted an event at last.
For KCM operator,
Force a rolling out of KCM server in first terminal,
$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
kubecontrollermanager.operator.openshift.io/cluster patched
When new KCM server installer pod is starting ,
$ oc get pod -n openshift-kube-controller-manager --show-labels | grep Running | grep install
installer-7-xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal 1/1 Running 0 22s app=installer
In second terminal,
oc debug node/xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal # Logged in node of the rolling out KCM server
sh-4.4# ip link |grep ens
2: ens4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
sh-4.4# cat test.sh
ifconfig ens4 down
sleep 600
ifconfig ens4 up
sh-4.4# chmod +x test.sh
# Check the latest installer pods status, after the installer pod completed successfully, wait about 10s, then run script
sh-4.4# /tmp/test.sh &
After a while, monitor the KCM cluster operator in first terminal,
$ while true; do oc get pod -n openshift-kube-controller-manager --show-labels | grep Running;echo; oc get co | grep -v "True.*False.*False";echo;oc get events -n openshift-kube-controller-manager-operator | grep -i 'static pod lifecycle failure';sleep 10;done | tee watch-kcm.log
...
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-controller-manager 4.10.0-0.nightly-2022-02-17-234353 True True False 6h29m NodeInstallerProgressing: 3 nodes are at revision 7; 0 nodes have achieved new revision 8
...
6s Normal MissingStaticPod deployment/kube-controller-manager-operator static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 8 on node: "xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s
6s Normal OperatorStatusChanged deployment/kube-controller-manager-operator Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 08:17:07 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)" to "MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-controller-manager\" in namespace: \"openshift-kube-controller-manager\" for revision: 8 on node: \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" didn't show up, waited: 2m30s\nNodeControllerDegraded: The master nodes not ready: node \"xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal\" not ready since 2022-02-22 08:17:07 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)"
...
kube-controller-manager 4.10.0-0.nightly-2022-02-17-234353 True True True 6h32m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 8 on node: "xxxx-22410g1-m6qxd-master-2.c.openshift-qe.internal" didn't show up, waited: 2m30s...
Based on above, we can see the kubelet doesn’t run KCM static pod at the correct revision due to reason 'The master nodes not ready', this controller went degraded and emitted an event at last. In conclusion, the bug fix works fine, so move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |