Bug 2064024 - SNO OCP upgrade with DU workload stuck at waiting for kube-apiserver static pod
Summary: SNO OCP upgrade with DU workload stuck at waiting for kube-apiserver static pod
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Ryan Phillips
QA Contact: yliu1
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-14 21:38 UTC by yliu1
Modified: 2022-08-10 10:54 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
No docs needed.
Clone Of:
Environment:
Last Closed: 2022-08-10 10:54:06 UTC
Target Upstream Version:
Embargoed:
keyoung: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 816 0 None open bug 2064024: Update library-go to 80f9619c2 2022-05-18 07:29:15 UTC
Github openshift cluster-kube-scheduler-operator pull 425 0 None Merged bug 2064024: Update library-go to 80f9619c2 2022-05-18 07:29:14 UTC
Github openshift cluster-kube-scheduler-operator pull 427 0 None Merged bug 2064024: README: fix scheduler configuration formatting 2022-05-18 07:29:14 UTC
Github openshift library-go pull 1347 0 None Merged Bug 2064024: Increase timeout for missing static pods 2022-05-16 11:53:58 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:54:21 UTC

Description yliu1 2022-03-14 21:38:44 UTC
Description of problem:
Attempt to upgrade spoke cluster OCP from 4.9 to 4.10 with du workload application running - upgrade gets stuck at waiting for kube-apiserver static pod. 

###########
ClusterID: 15877094-48af-4e08-a7da-58c14b3c4c2e
ClusterVersion: Updating to "4.10.4" from "4.9.23" for 7 hours: Working towards 4.10.4: 96 of 770 done (12% complete)
ClusterOperators:
	clusteroperator/image-registry is not available (Available: The deployment does not have available replicas
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created) because Degraded: The deployment does not have available replicas
ImagePrunerDegraded: Job has reached the specified backoff limit
	clusteroperator/kube-apiserver is degraded because MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 12 on node: "master-2.cluster1.savanna.lab.eng.rdu2.redhat.com" didn't show up, waited: 2m15s
StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com was rolled back to revision 12 due to waiting for kube-apiserver static pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com to be running: Pending



Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always with ran test app

Steps to Reproduce:
1. Deploy a 4.9 SNO DU node
2. Create DU test app on spoke and wait for all pods running
3. Start ocp upgrade to 4.10.4 (via oc adm upgrade or by patching clusterversion)

Actual results:
- Clusterversion stuck at waiting for kube-apiserver static pod after installer-12 was completed. Also while following error message indicating pod is not running, but pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com was actually running. 

###
ClusterVersion: Updating to "4.10.4" from "4.9.23" for 7 hours: Working towards 4.10.4: 96 of 770 done (12% complete)

	clusteroperator/kube-apiserver is degraded because MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 12 on node: "master-2.cluster1.savanna.lab.eng.rdu2.redhat.com" didn't show up, waited: 2m15s
StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com was rolled back to revision 12 due to waiting for kube-apiserver static pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com to be running: Pending


Expected results:
Upgrade succeeded

Additional info:
After I removed workload app, cluster upgrade started and proceeded successfully.

Comment 4 Ken Young 2022-03-22 20:37:45 UTC
Stefan,

This bug is currently gating us from declaring a telco certified load on 4.10 for our customers.  Have you had a chanve to review this?  Is there anything we can do to assist here?

/KenY

Comment 5 Abu Kashem 2022-03-24 20:18:27 UTC
keyoung,

> Spoke must-gather can be downloaded from here:
> http://registry.ran-vcl01.ptp.lab.eng.bos.redhat.com:8080/images/mustgather-master-2.tar.gz

the must-gather has logs covering the upgrade error? if so we will take a look at it.

Comment 6 yliu1 2022-03-25 12:45:16 UTC
The server was reinstalled so the logs were gone.. I will reproduce this issue and contact you.

Comment 8 Lukasz Szaszkiewicz 2022-03-29 17:33:20 UTC
The purpose of kube-apiserver-startup-monitor is to monitor the kube-apiserver binary. 
Basically, any new revision will install a new kube-apiserver binary alongside the monitoring application. 

The monitoring app runs a series of checks to ensure that the server at the new revision doesn't have any issues. 
In case of any issues, it rolls back to a previous version/revision. I think the default timeout is 5 minutes.

Comment 9 Lukasz Szaszkiewicz 2022-03-29 17:35:42 UTC
If the kube-apiserver needs more than 5 minutes to become ready then the monitor will install a previous version.

Comment 10 Lukasz Szaszkiewicz 2022-03-29 17:36:40 UTC
I checked the attached must-gather but didn't find any logs from the monitor.

Comment 12 yliu1 2022-03-30 14:00:54 UTC
The pod stayed in PodInitializing for extended period of time (more than 20 minutes), but did come up eventually.

Comment 28 yliu1 2022-04-13 14:39:16 UTC
Encountered same issue with 4.9.8, upgrading to 4.10.9. 

[yliu1@yliu1 ~]$ oc get clusterversions.config.openshift.io 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.28    True        True          20h     Unable to apply 4.10.9: wait has exceeded 40 minutes for these operators: kube-apiserver
[yliu1@yliu1 ~]$ oc get co 
oNAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.28    True        False         False      39h     
baremetal                                  4.9.28    True        False         False      41h     
cloud-controller-manager                   4.9.28    True        False         False      41h     
cloud-credential                           4.9.28    True        False         False      41h     
cluster-autoscaler                         4.9.28    True        False         False      41h     
config-operator                            4.9.28    True        False         False      41h     
console                                    4.9.28    True        False         False      41h     
csi-snapshot-controller                    4.9.28    True        False         False      41h     
dns                                        4.9.28    True        False         False      40h     
etcd                                       4.10.9    True        False         False      41h     
image-registry                             4.9.28    True        False         False      41h     
ingress                                    4.9.28    True        False         False      41h     
insights                                   4.9.28    True        False         False      41h     
kube-apiserver                             4.9.28    True        True          True       41h     MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 11 on node: "test-sno-1.lab.eng.rdu2.redhat.com" didn't show up, waited: 2m15s...
kube-controller-manager                    4.9.28    True        False         False      41h     
kube-scheduler                             4.9.28    True        False         False      41h     
kube-storage-version-migrator              4.9.28    True        False         False      41h     
machine-api                                4.9.28    True        False         False      41h     
machine-approver                           4.9.28    True        False         False      41h     
machine-config                             4.9.28    True        False         False      41h     
marketplace                                4.9.28    True        False         False      41h     
monitoring                                 4.9.28    True        False         False      41h     
network                                    4.9.28    True        False         False      41h     
node-tuning                                4.9.28    True        False         False      40h     
openshift-apiserver                        4.9.28    True        False         False      40h     
openshift-controller-manager               4.9.28    True        False         False      17h     
openshift-samples                          4.9.28    True        False         False      41h     
operator-lifecycle-manager                 4.9.28    True        False         False      41h     
operator-lifecycle-manager-catalog         4.9.28    True        False         False      41h     
operator-lifecycle-manager-packageserver   4.9.28    True        False         False      40h     
service-ca                                 4.9.28    True        False         False      41h     
storage                                    4.9.28    True        False         False      41h     
[yliu1@yliu1 ~]$ oc get pods -n openshift-kube-apiserver
NAME                                                      READY   STATUS      RESTARTS   AGE
installer-10-retry-1-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          18h
installer-10-test-sno-1.lab.eng.rdu2.redhat.com           0/1     Completed   0          19h
installer-11-retry-1-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          17h
installer-11-retry-2-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          16h
installer-11-retry-3-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          14h
installer-11-retry-4-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          12h
installer-11-retry-5-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          10h
installer-11-retry-6-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          7h54m
installer-11-retry-7-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          5h34m
installer-11-retry-8-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          3h14m
installer-11-retry-9-test-sno-1.lab.eng.rdu2.redhat.com   0/1     Completed   0          56m
installer-11-test-sno-1.lab.eng.rdu2.redhat.com           0/1     Completed   0          18h
installer-2-test-sno-1.lab.eng.rdu2.redhat.com            0/1     Completed   0          41h
installer-3-test-sno-1.lab.eng.rdu2.redhat.com            0/1     Completed   0          41h
installer-4-test-sno-1.lab.eng.rdu2.redhat.com            0/1     Completed   0          41h
installer-5-test-sno-1.lab.eng.rdu2.redhat.com            0/1     Completed   0          41h
installer-6-test-sno-1.lab.eng.rdu2.redhat.com            0/1     Completed   0          41h
installer-7-retry-1-test-sno-1.lab.eng.rdu2.redhat.com    0/1     Completed   0          23h
installer-7-test-sno-1.lab.eng.rdu2.redhat.com            0/1     Completed   0          23h
installer-8-retry-1-test-sno-1.lab.eng.rdu2.redhat.com    0/1     Completed   0          22h
installer-8-retry-2-test-sno-1.lab.eng.rdu2.redhat.com    0/1     Completed   0          21h
installer-8-test-sno-1.lab.eng.rdu2.redhat.com            0/1     Completed   0          22h
installer-9-retry-1-test-sno-1.lab.eng.rdu2.redhat.com    0/1     Completed   0          19h
installer-9-test-sno-1.lab.eng.rdu2.redhat.com            0/1     Error       0          20h
kube-apiserver-test-sno-1.lab.eng.rdu2.redhat.com         5/5     Running     1          37m
revision-pruner-10-test-sno-1.lab.eng.rdu2.redhat.com     0/1     Completed   0          19h
revision-pruner-11-test-sno-1.lab.eng.rdu2.redhat.com     0/1     Completed   0          18h
revision-pruner-6-test-sno-1.lab.eng.rdu2.redhat.com      0/1     Completed   0          41h
revision-pruner-7-test-sno-1.lab.eng.rdu2.redhat.com      0/1     Completed   0          23h
revision-pruner-8-test-sno-1.lab.eng.rdu2.redhat.com      0/1     Completed   0          22h
revision-pruner-9-test-sno-1.lab.eng.rdu2.redhat.com      0/1     Completed   0          20h

Comment 29 Ryan Phillips 2022-04-14 21:15:41 UTC
Upgrade graphs on the release dashboard are showing successful upgrades.

The static pod controller gives a 2.5 minute grace period for the static pod to show up [1]. I tried this upgrade path with a clusterbot launch with `launch 4.9.28 single-node`. After installation, I upgraded to 4.10.9 using [2]. The upgrade went smoothly until it got a failed machine-config (IIRC this is a known problem):

machine-config                             4.9.28    False       True          False      12m     Cluster not available for [{operator 4.9.28}]

but every other component is correct:

authentication                             4.10.9    True        False         False      23m
baremetal                                  4.10.9    True        False         False      79m
cloud-controller-manager                   4.10.9    True        False         False      83m
cloud-credential                           4.10.9    True        False         False      88m
cluster-autoscaler                         4.10.9    True        False         False      78m
config-operator                            4.10.9    True        False         False      82m
console                                    4.10.9    True        False         False      23m
csi-snapshot-controller                    4.10.9    True        False         False      39m
dns                                        4.10.9    True        False         False      23m
etcd                                       4.10.9    True        False         False      78m
image-registry                             4.10.9    True        False         False      37m
ingress                                    4.10.9    True        False         False      4m9s
insights                                   4.10.9    True        False         False      78m
kube-apiserver                             4.10.9    True        False         False      77m
kube-controller-manager                    4.10.9    True        False         False      77m
kube-scheduler                             4.10.9    True        False         False      77m
kube-storage-version-migrator              4.10.9    True        False         False      82m
machine-api                                4.10.9    True        False         False      75m
machine-approver                           4.10.9    True        False         False      81m
machine-config                             4.9.28    False       True          False      12m     Cluster not available for [{operator 4.9.28}]
marketplace                                4.10.9    True        False         False      79m
monitoring                                 4.10.9    True        False         False      71m
network                                    4.10.9    True        False         False      83m
node-tuning                                4.10.9    True        False         False      37m
openshift-apiserver                        4.10.9    True        False         False      38m
openshift-controller-manager               4.10.9    True        False         False      73m
openshift-samples                          4.10.9    True        False         False      29m
operator-lifecycle-manager                 4.10.9    True        False         False      81m
operator-lifecycle-manager-catalog         4.10.9    True        False         False      81m
operator-lifecycle-manager-packageserver   4.10.9    True        False         False      76m
service-ca                                 4.10.9    True        False         False      82m
storage                                    4.10.9    True        False         False      33m

====

I beleive the issue is environmental to the lab. (Perhaps disks or networking is slow). Signs seems to point to disk IO.

1. https://github.com/openshift/library-go/blob/535fc9bdb13be365bce1ce8a14a871ba8de09f0b/pkg/operator/staticpod/controller/missingstaticpod/missing_static_pod_controller.go#L108
2. oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release:4.10.9-x86_64 --force --allow-explicit-upgrade

Comment 30 Ryan Phillips 2022-04-14 21:20:43 UTC
Right after I posted this comment the cluster completed the upgrade successfully:

authentication                             4.10.9    True        False         False      29m
baremetal                                  4.10.9    True        False         False      85m
cloud-controller-manager                   4.10.9    True        False         False      89m
cloud-credential                           4.10.9    True        False         False      94m
cluster-autoscaler                         4.10.9    True        False         False      84m
config-operator                            4.10.9    True        False         False      88m
console                                    4.10.9    True        False         False      29m
csi-snapshot-controller                    4.10.9    True        False         False      45m
dns                                        4.10.9    True        False         False      29m
etcd                                       4.10.9    True        False         False      84m
image-registry                             4.10.9    True        False         False      43m
ingress                                    4.10.9    True        False         False      10m
insights                                   4.10.9    True        False         False      84m
kube-apiserver                             4.10.9    True        False         False      82m
kube-controller-manager                    4.10.9    True        False         False      83m
kube-scheduler                             4.10.9    True        False         False      83m
kube-storage-version-migrator              4.10.9    True        False         False      88m
machine-api                                4.10.9    True        False         False      81m
machine-approver                           4.10.9    True        False         False      87m
machine-config                             4.10.9    True        False         False      4m38s
marketplace                                4.10.9    True        False         False      85m
monitoring                                 4.10.9    True        False         False      77m
network                                    4.10.9    True        False         False      89m
node-tuning                                4.10.9    True        False         False      43m
openshift-apiserver                        4.10.9    True        False         False      44m
openshift-controller-manager               4.10.9    True        False         False      79m
openshift-samples                          4.10.9    True        False         False      35m
operator-lifecycle-manager                 4.10.9    True        False         False      87m
operator-lifecycle-manager-catalog         4.10.9    True        False         False      87m
operator-lifecycle-manager-packageserver   4.10.9    True        False         False      82m
service-ca                                 4.10.9    True        False         False      88m
storage                                    4.10.9    True        False         False      39m

Comment 31 yliu1 2022-04-19 19:08:39 UTC
As I mentioned in the bz description, if I remove the test workload pods, the upgrade will continue without any issue. 
This is only observed with the workload pods, which has some exec probes enabled. I can provide an env for debugging if that helps.

Comment 32 Ryan Phillips 2022-04-25 13:03:51 UTC
There is a PR over here to address the issue: https://github.com/openshift/library-go/pull/1347

Comment 35 Jan Chaloupka 2022-05-16 11:57:41 UTC
Moving to POST so I can sneak in components vendoring the change in https://github.com/openshift/library-go/pull/1347

Comment 42 yliu1 2022-07-13 15:14:34 UTC
Verified from 4.10.20 to 4.11.0-rc.0. The previous failure point was passed. Although upgrade eventually failed at a later stage with a new bz (bz 2102777).

Comment 43 errata-xmlrpc 2022-08-10 10:54:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.