Description of problem: Attempt to upgrade spoke cluster OCP from 4.9 to 4.10 with du workload application running - upgrade gets stuck at waiting for kube-apiserver static pod. ########### ClusterID: 15877094-48af-4e08-a7da-58c14b3c4c2e ClusterVersion: Updating to "4.10.4" from "4.9.23" for 7 hours: Working towards 4.10.4: 96 of 770 done (12% complete) ClusterOperators: clusteroperator/image-registry is not available (Available: The deployment does not have available replicas NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created) because Degraded: The deployment does not have available replicas ImagePrunerDegraded: Job has reached the specified backoff limit clusteroperator/kube-apiserver is degraded because MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 12 on node: "master-2.cluster1.savanna.lab.eng.rdu2.redhat.com" didn't show up, waited: 2m15s StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com was rolled back to revision 12 due to waiting for kube-apiserver static pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com to be running: Pending Version-Release number of selected component (if applicable): 4.10 How reproducible: Always with ran test app Steps to Reproduce: 1. Deploy a 4.9 SNO DU node 2. Create DU test app on spoke and wait for all pods running 3. Start ocp upgrade to 4.10.4 (via oc adm upgrade or by patching clusterversion) Actual results: - Clusterversion stuck at waiting for kube-apiserver static pod after installer-12 was completed. Also while following error message indicating pod is not running, but pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com was actually running. ### ClusterVersion: Updating to "4.10.4" from "4.9.23" for 7 hours: Working towards 4.10.4: 96 of 770 done (12% complete) clusteroperator/kube-apiserver is degraded because MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 12 on node: "master-2.cluster1.savanna.lab.eng.rdu2.redhat.com" didn't show up, waited: 2m15s StaticPodFallbackRevisionDegraded: a static pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com was rolled back to revision 12 due to waiting for kube-apiserver static pod kube-apiserver-master-2.cluster1.savanna.lab.eng.rdu2.redhat.com to be running: Pending Expected results: Upgrade succeeded Additional info: After I removed workload app, cluster upgrade started and proceeded successfully.
Stefan, This bug is currently gating us from declaring a telco certified load on 4.10 for our customers. Have you had a chanve to review this? Is there anything we can do to assist here? /KenY
keyoung, > Spoke must-gather can be downloaded from here: > http://registry.ran-vcl01.ptp.lab.eng.bos.redhat.com:8080/images/mustgather-master-2.tar.gz the must-gather has logs covering the upgrade error? if so we will take a look at it.
The server was reinstalled so the logs were gone.. I will reproduce this issue and contact you.
The purpose of kube-apiserver-startup-monitor is to monitor the kube-apiserver binary. Basically, any new revision will install a new kube-apiserver binary alongside the monitoring application. The monitoring app runs a series of checks to ensure that the server at the new revision doesn't have any issues. In case of any issues, it rolls back to a previous version/revision. I think the default timeout is 5 minutes.
If the kube-apiserver needs more than 5 minutes to become ready then the monitor will install a previous version.
I checked the attached must-gather but didn't find any logs from the monitor.
The pod stayed in PodInitializing for extended period of time (more than 20 minutes), but did come up eventually.
Encountered same issue with 4.9.8, upgrading to 4.10.9. [yliu1@yliu1 ~]$ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.28 True True 20h Unable to apply 4.10.9: wait has exceeded 40 minutes for these operators: kube-apiserver [yliu1@yliu1 ~]$ oc get co oNAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.28 True False False 39h baremetal 4.9.28 True False False 41h cloud-controller-manager 4.9.28 True False False 41h cloud-credential 4.9.28 True False False 41h cluster-autoscaler 4.9.28 True False False 41h config-operator 4.9.28 True False False 41h console 4.9.28 True False False 41h csi-snapshot-controller 4.9.28 True False False 41h dns 4.9.28 True False False 40h etcd 4.10.9 True False False 41h image-registry 4.9.28 True False False 41h ingress 4.9.28 True False False 41h insights 4.9.28 True False False 41h kube-apiserver 4.9.28 True True True 41h MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 11 on node: "test-sno-1.lab.eng.rdu2.redhat.com" didn't show up, waited: 2m15s... kube-controller-manager 4.9.28 True False False 41h kube-scheduler 4.9.28 True False False 41h kube-storage-version-migrator 4.9.28 True False False 41h machine-api 4.9.28 True False False 41h machine-approver 4.9.28 True False False 41h machine-config 4.9.28 True False False 41h marketplace 4.9.28 True False False 41h monitoring 4.9.28 True False False 41h network 4.9.28 True False False 41h node-tuning 4.9.28 True False False 40h openshift-apiserver 4.9.28 True False False 40h openshift-controller-manager 4.9.28 True False False 17h openshift-samples 4.9.28 True False False 41h operator-lifecycle-manager 4.9.28 True False False 41h operator-lifecycle-manager-catalog 4.9.28 True False False 41h operator-lifecycle-manager-packageserver 4.9.28 True False False 40h service-ca 4.9.28 True False False 41h storage 4.9.28 True False False 41h [yliu1@yliu1 ~]$ oc get pods -n openshift-kube-apiserver NAME READY STATUS RESTARTS AGE installer-10-retry-1-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 18h installer-10-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 19h installer-11-retry-1-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 17h installer-11-retry-2-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 16h installer-11-retry-3-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 14h installer-11-retry-4-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 12h installer-11-retry-5-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 10h installer-11-retry-6-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 7h54m installer-11-retry-7-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 5h34m installer-11-retry-8-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 3h14m installer-11-retry-9-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 56m installer-11-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 18h installer-2-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 41h installer-3-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 41h installer-4-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 41h installer-5-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 41h installer-6-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 41h installer-7-retry-1-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 23h installer-7-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 23h installer-8-retry-1-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 22h installer-8-retry-2-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 21h installer-8-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 22h installer-9-retry-1-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 19h installer-9-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Error 0 20h kube-apiserver-test-sno-1.lab.eng.rdu2.redhat.com 5/5 Running 1 37m revision-pruner-10-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 19h revision-pruner-11-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 18h revision-pruner-6-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 41h revision-pruner-7-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 23h revision-pruner-8-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 22h revision-pruner-9-test-sno-1.lab.eng.rdu2.redhat.com 0/1 Completed 0 20h
Upgrade graphs on the release dashboard are showing successful upgrades. The static pod controller gives a 2.5 minute grace period for the static pod to show up [1]. I tried this upgrade path with a clusterbot launch with `launch 4.9.28 single-node`. After installation, I upgraded to 4.10.9 using [2]. The upgrade went smoothly until it got a failed machine-config (IIRC this is a known problem): machine-config 4.9.28 False True False 12m Cluster not available for [{operator 4.9.28}] but every other component is correct: authentication 4.10.9 True False False 23m baremetal 4.10.9 True False False 79m cloud-controller-manager 4.10.9 True False False 83m cloud-credential 4.10.9 True False False 88m cluster-autoscaler 4.10.9 True False False 78m config-operator 4.10.9 True False False 82m console 4.10.9 True False False 23m csi-snapshot-controller 4.10.9 True False False 39m dns 4.10.9 True False False 23m etcd 4.10.9 True False False 78m image-registry 4.10.9 True False False 37m ingress 4.10.9 True False False 4m9s insights 4.10.9 True False False 78m kube-apiserver 4.10.9 True False False 77m kube-controller-manager 4.10.9 True False False 77m kube-scheduler 4.10.9 True False False 77m kube-storage-version-migrator 4.10.9 True False False 82m machine-api 4.10.9 True False False 75m machine-approver 4.10.9 True False False 81m machine-config 4.9.28 False True False 12m Cluster not available for [{operator 4.9.28}] marketplace 4.10.9 True False False 79m monitoring 4.10.9 True False False 71m network 4.10.9 True False False 83m node-tuning 4.10.9 True False False 37m openshift-apiserver 4.10.9 True False False 38m openshift-controller-manager 4.10.9 True False False 73m openshift-samples 4.10.9 True False False 29m operator-lifecycle-manager 4.10.9 True False False 81m operator-lifecycle-manager-catalog 4.10.9 True False False 81m operator-lifecycle-manager-packageserver 4.10.9 True False False 76m service-ca 4.10.9 True False False 82m storage 4.10.9 True False False 33m ==== I beleive the issue is environmental to the lab. (Perhaps disks or networking is slow). Signs seems to point to disk IO. 1. https://github.com/openshift/library-go/blob/535fc9bdb13be365bce1ce8a14a871ba8de09f0b/pkg/operator/staticpod/controller/missingstaticpod/missing_static_pod_controller.go#L108 2. oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release:4.10.9-x86_64 --force --allow-explicit-upgrade
Right after I posted this comment the cluster completed the upgrade successfully: authentication 4.10.9 True False False 29m baremetal 4.10.9 True False False 85m cloud-controller-manager 4.10.9 True False False 89m cloud-credential 4.10.9 True False False 94m cluster-autoscaler 4.10.9 True False False 84m config-operator 4.10.9 True False False 88m console 4.10.9 True False False 29m csi-snapshot-controller 4.10.9 True False False 45m dns 4.10.9 True False False 29m etcd 4.10.9 True False False 84m image-registry 4.10.9 True False False 43m ingress 4.10.9 True False False 10m insights 4.10.9 True False False 84m kube-apiserver 4.10.9 True False False 82m kube-controller-manager 4.10.9 True False False 83m kube-scheduler 4.10.9 True False False 83m kube-storage-version-migrator 4.10.9 True False False 88m machine-api 4.10.9 True False False 81m machine-approver 4.10.9 True False False 87m machine-config 4.10.9 True False False 4m38s marketplace 4.10.9 True False False 85m monitoring 4.10.9 True False False 77m network 4.10.9 True False False 89m node-tuning 4.10.9 True False False 43m openshift-apiserver 4.10.9 True False False 44m openshift-controller-manager 4.10.9 True False False 79m openshift-samples 4.10.9 True False False 35m operator-lifecycle-manager 4.10.9 True False False 87m operator-lifecycle-manager-catalog 4.10.9 True False False 87m operator-lifecycle-manager-packageserver 4.10.9 True False False 82m service-ca 4.10.9 True False False 88m storage 4.10.9 True False False 39m
As I mentioned in the bz description, if I remove the test workload pods, the upgrade will continue without any issue. This is only observed with the workload pods, which has some exec probes enabled. I can provide an env for debugging if that helps.
There is a PR over here to address the issue: https://github.com/openshift/library-go/pull/1347
Moving to POST so I can sneak in components vendoring the change in https://github.com/openshift/library-go/pull/1347
Verified from 4.10.20 to 4.11.0-rc.0. The previous failure point was passed. Although upgrade eventually failed at a later stage with a new bz (bz 2102777).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069