2097872 – OCP 4.9.39 upgrade fails on KVM

Bug 2097872 - OCP 4.9.39 upgrade fails on KVM

Summary: OCP 4.9.39 upgrade fails on KVM

Keywords:
Status:	CLOSED DUPLICATE of bug 2098099
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Multi-Arch
Sub Component:
Version:	4.9
Hardware:	s390x
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jeremy Poulin
QA Contact:	Douglas Slavens
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-16 19:14 UTC by Philip Chan
Modified:	2022-06-23 13:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-23 13:41:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather logs (9.97 MB, application/gzip) 2022-06-16 19:14 UTC, Philip Chan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	MULTIARCH-2612	0	None	None	None	2022-06-16 19:20:39 UTC

Description Philip Chan 2022-06-16 19:14:26 UTC

Created attachment 1890738 [details]
must-gather logs

Description of problem:
Performing an upgrade from OCP 4.9.38 to 4.9.39 does not succeed on KVM.  Several operators remain in a degraded state.  I have been able to reproduce this consistently across different machine types(z14/z15/z16).  The cluster is using networkType OVNKubernetes.  At least one master and one worker node are in Ready,SchedulingDisabled state.

This is the state of the cluster when it is stuck at upgrading to OCP 4.9.39:

[root@bastion ~]# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.39    True        False         True       128m    APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()...
baremetal                                  4.9.39    True        False         False      142m
cloud-controller-manager                   4.9.39    True        False         False      146m
cloud-credential                           4.9.39    True        False         False      147m
cluster-autoscaler                         4.9.39    True        False         False      141m
config-operator                            4.9.39    True        False         False      143m
console                                    4.9.39    True        False         False      26m
csi-snapshot-controller                    4.9.39    True        False         False      142m
dns                                        4.9.39    True        True          True       141m    DNS default is degraded
etcd                                       4.9.39    True        False         False      141m
image-registry                             4.9.39    True        False         False      134m
ingress                                    4.9.39    True        False         True       133m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-fb8ffdb9c-kt6nl" cannot be scheduled: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable. Make sure you have sufficient worker nodes.)
insights                                   4.9.39    True        False         False      136m
kube-apiserver                             4.9.39    True        False         False      138m
kube-controller-manager                    4.9.39    True        False         False      141m
kube-scheduler                             4.9.39    True        False         False      141m
kube-storage-version-migrator              4.9.39    True        False         False      26m
machine-api                                4.9.39    True        False         False      142m
machine-approver                           4.9.39    True        False         False      142m
machine-config                             4.9.38    True        True          True       138m    Unable to apply 4.9.39: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-c80c4f300f00a7b29f4f6b0222563fc5 expected bd2557a122fbc1831e8c91c28c114382f2d5d44f has d8ba2a02349b4de92a553720101d6b46a6f814b6: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-01b30ccf8739761b484c402dbb6a1dad, retrying
marketplace                                4.9.39    True        False         False      142m
monitoring                                 4.9.39    False       True          True       9m5s    Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.9.39    True        True          True       143m    DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-29tvt is in CrashLoopBackOff State...
node-tuning                                4.9.39    True        False         False      135m
openshift-apiserver                        4.9.39    True        False         True       135m    APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
openshift-controller-manager               4.9.39    True        False         False      141m
openshift-samples                          4.9.39    True        False         False      49m
operator-lifecycle-manager                 4.9.39    True        False         False      142m
operator-lifecycle-manager-catalog         4.9.39    True        False         False      142m
operator-lifecycle-manager-packageserver   4.9.39    True        False         False      138m
service-ca                                 4.9.39    True        False         False      143m
storage                                    4.9.39    True        False         False      143m
[root@bastion ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.38    True        True          79m     Unable to apply 4.9.39: wait has exceeded 40 minutes for these operators: openshift-apiserver
[root@bastion ~]# oc get nodes
NAME                                                    STATUS                     ROLES    AGE    VERSION
master-0.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com   Ready                      master   146m   v1.22.8+f34b40c
master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com   Ready,SchedulingDisabled   master   146m   v1.22.8+f34b40c
master-2.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com   Ready                      master   146m   v1.22.8+f34b40c
worker-0.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com   Ready,SchedulingDisabled   worker   136m   v1.22.8+f34b40c
worker-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com   Ready                      worker   138m   v1.22.8+f34b40c

It seems to reached the point where the nodes need to reboot after the updates have been applied, but they never proceed with the restart.  In this case, I restarted master-1, and the following error shows up under RHCOS:

[systemd]
Failed Units: 1
  ovs-configuration.service

When I check on the status:

[core@master-1 ~]$ systemctl status ovs-configuration.service
● ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2022-06-16 18:40:40 UTC; 4min 44s ago
 Main PID: 1255 (code=exited, status=1/FAILURE)
      CPU: 524ms

Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: 192.168.79.0/24 dev enc1 proto kernel scope link src 192.168.79.22 metric 100
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: + ip -6 route show
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: ::1 dev lo proto kernel metric 256 pref medium
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: fe80::/64 dev enc1 proto kernel metric 100 pref medium
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com configure-ovs.sh[1255]: + exit 1
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com systemd[1]: Failed to start Configures OVS with proper host networking configuration.
Jun 16 18:40:40 master-1.pok-236-apr-qemu.ocptest.pok.stglabs.ibm.com systemd[1]: ovs-configuration.service: Consumed 524ms CPU time

Could this be related to current open bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=2095415
https://bugzilla.redhat.com/show_bug.cgi?id=2095264

The must-gather summary to be included:
ClusterID: ca7cf702-df23-4aae-87e0-1df54b120b07
ClusterVersion: Updating to "4.9.39" from "4.9.38" for 2 hours: Working towards 4.9.39: 206 of 738 done (27% complete), waiting up to 40 minutes on openshift-apiserver
ClusterOperators:
	clusteroperator/authentication is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ()
	clusteroperator/dns is degraded because DNS default is degraded
	clusteroperator/ingress is degraded because The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-fb8ffdb9c-kt6nl" cannot be scheduled: 0/5 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) were unschedulable. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
	clusteroperator/machine-config is degraded because Unable to apply 4.9.39: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-c80c4f300f00a7b29f4f6b0222563fc5 expected bd2557a122fbc1831e8c91c28c114382f2d5d44f has d8ba2a02349b4de92a553720101d6b46a6f814b6: 1 (ready 1) out of 3 nodes are updating to latest configuration rendered-master-01b30ccf8739761b484c402dbb6a1dad, retrying
	clusteroperator/monitoring is not available (Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.) because Failed to rollout the stack. Error: updating prometheus-adapter: reconciling PrometheusAdapter Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-adapter: got 1 unavailable replicas
updating thanos querier: reconciling Thanos Querier Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/thanos-querier: got 1 unavailable replicas
	clusteroperator/network is degraded because DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-29tvt is in CrashLoopBackOff State
DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-66jwb is in CrashLoopBackOff State
DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2022-06-16T18:52:31Z
	clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()


error: gather did not start for pod must-gather-fklz4: timed out waiting for the condition

Version-Release number of selected component (if applicable):
1. OCP 4.9.39
2. RHCOS 4.9.0

How reproducible:
Consistently reproducible.

Steps to Reproduce:
1. Start with a working 4.9.38 OCP Cluster
2. Perform an upgrade to 4.9.39 via CLI.  For example:
oc adm upgrade --allow-upgrade-with-warnings --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:ce4ada0ef9ea875118138b39260a277cfe08758e24c21b973d6751fe16b9c284

3. Observe the cluster operators upgrade to 4.9.39.  
4. When machine-config operator begins to upgrade, the following operators will begin to go into degrade state:
machine-config
dns
ingress
monitoring
network
openshift-apiserver

5. Observe the state of the nodes remain in “Ready,SchedulingDisabled”

Actual results:
OCP upgrade from 4.9.38 to 4.9.39 fails.

Expected results:
Upgrading to OCP 4.9.39 succeeds.

Additional info:

Comment 2 Dan Li 2022-06-17 17:27:16 UTC

4.9.39 will be dropped per Ramona's latest z-stream program email. So I think this bug will not be a High severity or a Blocker bug

Comment 3 Jeremy Poulin 2022-06-17 17:58:12 UTC

I'm 90% sure this is a duplicate of the bug noted in the description.
https://bugzilla.redhat.com/show_bug.cgi?id=2095264

Before I close this though, can I get a confirmation that the newest builds install the upgrade without hitting this issue? (That bug is marked verified, so the latest builds should contain the fix.)

Comment 4 Philip Chan 2022-06-17 23:15:27 UTC

Hi Jeremy,

I checked for the latest available 4.9.0 nightly build and found just the one:
4.9.0-0.nightly-s390x-2022-06-15-082850

This was built after 4.9.39(4.9.0-0.nightly-s390x-2022-06-14-044237) which hit the issue.  I installed a new OCP 4.9.38 cluster and tried upgrading to this latest nightly and it failed the same way.  Was there a newer OCP 4.9 nightly that I could try that has the fix?

Thanks,
Phil

Comment 5 Jeremy Poulin 2022-06-21 17:23:19 UTC

Hi Phil - yes - I've jumped the gun on this. Please sit tight for the next build of 4.9. :)

Comment 6 Prashanth Sundararaman 2022-06-23 13:40:21 UTC

this has been fixed through: https://bugzilla.redhat.com/show_bug.cgi?id=2098099. the release team is working on promoting 4.9.40 which should have the fix.

Comment 7 Prashanth Sundararaman 2022-06-23 13:41:36 UTC


*** This bug has been marked as a duplicate of bug 2098099 ***

Note You need to log in before you can comment on or make changes to this bug.