Created attachment 1890738 [details] must-gather logs Description of problem: Performing an upgrade from OCP 4.9.38 to 4.9.39 does not succeed on KVM. Several operators remain in a degraded state. I have been able to reproduce this consistently across different machine types(z14/z15/z16). The cluster is using networkType OVNKubernetes. At least one master and one worker node are in Ready,SchedulingDisabled state. This is the state of the cluster when it is stuck at upgrading to OCP 4.9.39: [root@bastion ~]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.39 True False True 128m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... baremetal 4.9.39 True False False 142m cloud-controller-manager 4.9.39 True False False 146m cloud-credential 4.9.39 True False False 147m cluster-autoscaler 4.9.39 True False False 141m config-operator 4.9.39 True False False 143m console 4.9.39 True False False 26m csi-snapshot-controller 4.9.39 True False False 142m dns 4.9.39 True True True 141m DNS default is degraded etcd 4.9.39 True False False 141m image-registry 4.9.39 True False False 134m ingress 4.9.39 True False True 133m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-fb8ffdb9c-kt6nl" cannot be scheduled: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable. Make sure you have sufficient worker nodes.) insights 4.9.39 True False False 136m kube-apiserver 4.9.39 True False False 138m kube-controller-manager 4.9.39 True False False 141m kube-scheduler 4.9.39 True False False 141m kube-storage-version-migrator 4.9.39 True False False 26m machine-api 4.9.39 True False False 142m machine-approver 4.9.39 True False False 142m machine-config 4.9.38 True True True 138m Unable to apply 4.9.39: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-c80c4f300f00a7b29f4f6b0222563fc5 expected bd2557a122fbc1831e8c91c28c114382f2d5d44f has d8ba2a02349b4de92a553720101d6b46a6f814b6: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-01b30ccf8739761b484c402dbb6a1dad, retrying marketplace 4.9.39 True False False 142m monitoring 4.9.39 False True True 9m5s Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.9.39 True True True 143m DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-29tvt is in CrashLoopBackOff State... node-tuning 4.9.39 True False False 135m openshift-apiserver 4.9.39 True False True 135m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.9.39 True False False 141m openshift-samples 4.9.39 True False False 49m operator-lifecycle-manager 4.9.39 True False False 142m operator-lifecycle-manager-catalog 4.9.39 True False False 142m operator-lifecycle-manager-packageserver 4.9.39 True False False 138m service-ca 4.9.39 True False False 143m storage 4.9.39 True False False 143m [root@bastion ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.38 True True 79m Unable to apply 4.9.39: wait has exceeded 40 minutes for these operators: openshift-apiserver [root@bastion ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com Ready master 146m v1.22.8+f34b40c master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com Ready,SchedulingDisabled master 146m v1.22.8+f34b40c master-2.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com Ready master 146m v1.22.8+f34b40c worker-0.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com Ready,SchedulingDisabled worker 136m v1.22.8+f34b40c worker-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com Ready worker 138m v1.22.8+f34b40c It seems to reached the point where the nodes need to reboot after the updates have been applied, but they never proceed with the restart. In this case, I restarted master-1, and the following error shows up under RHCOS: [systemd] Failed Units: 1 ovs-configuration.service When I check on the status: [core@master-1 ~]$ systemctl status ovs-configuration.service ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Thu 2022-06-16 18:40:40 UTC; 4min 44s ago Main PID: 1255 (code=exited, status=1/FAILURE) CPU: 524ms Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: 192.168.79.0/24 dev enc1 proto kernel scope link src 192.168.79.22 metric 100 Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: + ip -6 route show Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: ::1 dev lo proto kernel metric 256 pref medium Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: fe80::/64 dev enc1 proto kernel metric 100 pref medium Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: + exit 1 Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Failed with result 'exit-code'. Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com systemd[1]: Failed to start Configures OVS with proper host networking configuration. Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Consumed 524ms CPU time Could this be related to current open bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2095415 https://bugzilla.redhat.com/show_bug.cgi?id=2095264 The must-gather summary to be included: ClusterID: ca7cf702-df23-4aae-87e0-1df54b120b07 ClusterVersion: Updating to "4.9.39" from "4.9.38" for 2 hours: Working towards 4.9.39: 206 of 738 done (27% complete), waiting up to 40 minutes on openshift-apiserver ClusterOperators: clusteroperator/authentication is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver () OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication () clusteroperator/dns is degraded because DNS default is degraded clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-fb8ffdb9c-kt6nl" cannot be scheduled: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) clusteroperator/machine-config is degraded because Unable to apply 4.9.39: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-c80c4f300f00a7b29f4f6b0222563fc5 expected bd2557a122fbc1831e8c91c28c114382f2d5d44f has d8ba2a02349b4de92a553720101d6b46a6f814b6: 1 (ready 1) out of 3 nodes are updating to latest configuration rendered-master-01b30ccf8739761b484c402dbb6a1dad, retrying clusteroperator/monitoring is not available (Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.) because Failed to rollout the stack. Error: updating prometheus-adapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: got 1 unavailable replicas updating thanos querier: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: got 1 unavailable replicas clusteroperator/network is degraded because DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-29tvt is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-66jwb is in CrashLoopBackOff State DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-06-16T18:52:31Z clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () error: gather did not start for pod must-gather-fklz4: timed out waiting for the condition Version-Release number of selected component (if applicable): 1. OCP 4.9.39 2. RHCOS 4.9.0 How reproducible: Consistently reproducible. Steps to Reproduce: 1. Start with a working 4.9.38 OCP Cluster 2. Perform an upgrade to 4.9.39 via CLI. For example: oc adm upgrade --allow-upgrade-with-warnings --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:ce4ada0ef9ea875118138b39260a277cfe08758e24c21b973d6751fe16b9c284 3. Observe the cluster operators upgrade to 4.9.39. 4. When machine-config operator begins to upgrade, the following operators will begin to go into degrade state: machine-config dns ingress monitoring network openshift-apiserver 5. Observe the state of the nodes remain in “Ready,SchedulingDisabled” Actual results: OCP upgrade from 4.9.38 to 4.9.39 fails. Expected results: Upgrading to OCP 4.9.39 succeeds. Additional info:
4.9.39 will be dropped per Ramona's latest z-stream program email. So I think this bug will not be a High severity or a Blocker bug
I'm 90% sure this is a duplicate of the bug noted in the description. https://bugzilla.redhat.com/show_bug.cgi?id=2095264 Before I close this though, can I get a confirmation that the newest builds install the upgrade without hitting this issue? (That bug is marked verified, so the latest builds should contain the fix.)
Hi Jeremy, I checked for the latest available 4.9.0 nightly build and found just the one: 4.9.0-0.nightly-s390x-2022-06-15-082850 This was built after 4.9.39(4.9.0-0.nightly-s390x-2022-06-14-044237) which hit the issue. I installed a new OCP 4.9.38 cluster and tried upgrading to this latest nightly and it failed the same way. Was there a newer OCP 4.9 nightly that I could try that has the fix? Thanks, Phil
Hi Phil - yes - I've jumped the gun on this. Please sit tight for the next build of 4.9. :)
this has been fixed through: https://bugzilla.redhat.com/show_bug.cgi?id=2098099. the release team is working on promoting 4.9.40 which should have the fix.
*** This bug has been marked as a duplicate of bug 2098099 ***