1977884 – Upgrade from 4.8.0-rc.0 to 4.9.0-0.nightly-2021-06-24-073147 failing with multiple errors

Bug 1977884 - Upgrade from 4.8.0-rc.0 to 4.9.0-0.nightly-2021-06-24-073147 failing with multiple errors

Summary: Upgrade from 4.8.0-rc.0 to 4.9.0-0.nightly-2021-06-24-073147 failing with mul...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Arda Guclu
QA Contact:	Ori Michaeli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1972572
TreeView+	depends on / blocked

Reported:	2021-06-30 15:51 UTC by Ori Michaeli
Modified:	2021-10-18 17:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-18 17:37:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:37:38 UTC

Description Ori Michaeli 2021-06-30 15:51:33 UTC

Description of problem:
While trying to verify bug 1972572 by upgrading [IPI BM virtual simulation] 4.7.12 -> 4.8.0-rc.0 -> 4.9.0-0.nightly-2021-06-24-073147, upgrade is stuck on 75% with multiple errors.

4.7 to 4.8 upgrade passed and bmh were showing "provisioned registration error" error and were deprovisioning as expected (bug 1972426).

4.8 to 4.9 upgrade is stuck and worker nodes are showing "NotReady"

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-rc.0   True        True          42h     Working towards 4.9.0-0.nightly-2021-06-24-073147: 509 of 676 done (75% complete)
[kni@provisionhost-0-0 ~]$ oc get co
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.9.0-0.nightly-2021-06-24-073147   False       False         True       41h     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ocp-edge-cluster-0.qe.lab.redhat.com/healthz": dial tcp 192.168.123.10:443: connect: connection refused
ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 2 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
baremetal                 4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
cloud-credential          4.9.0-0.nightly-2021-06-24-073147   True        False         False      2d      
cluster-autoscaler        4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
config-operator           4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
console                   4.9.0-0.nightly-2021-06-24-073147   False       False         False      41h     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com/health): Get "https://console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com/health": dial tcp 192.168.123.10:443: connect: connection refused
csi-snapshot-controller   4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
dns                       4.8.0-rc.0                          True        True          False      45h     DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 5."
etcd                      4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
image-registry            4.9.0-0.nightly-2021-06-24-073147   False       True          True       41h     Available: The deployment does not have available replicas
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created
ingress                         4.9.0-0.nightly-2021-06-24-073147   False       True          True       41h     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights                        4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
kube-apiserver                  4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
kube-controller-manager         4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
kube-scheduler                  4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
kube-storage-version-migrator   4.9.0-0.nightly-2021-06-24-073147   True        False         False      41h     
machine-api                     4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
machine-approver                4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
machine-config                  4.8.0-rc.0                          False       False         True       41h     Cluster not available for 4.8.0-rc.0
marketplace                     4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
monitoring                      4.9.0-0.nightly-2021-06-24-073147   False       True          True       41h     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                         4.8.0-rc.0                          True        True          True       47h     DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-06-28T22:08:29Z
DaemonSet "openshift-multus/multus-additional-cni-plugins" rollout is not making progress - last change 2021-06-28T22:08:30Z
DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-06-28T22:08:30Z
node-tuning                                4.9.0-0.nightly-2021-06-24-073147   True        False         False      45h     
openshift-apiserver                        4.9.0-0.nightly-2021-06-24-073147   True        False         False      41h     
openshift-controller-manager               4.9.0-0.nightly-2021-06-24-073147   True        False         False      24h     
openshift-samples                          4.9.0-0.nightly-2021-06-24-073147   True        False         False      41h     
operator-lifecycle-manager                 4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
operator-lifecycle-manager-catalog         4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
operator-lifecycle-manager-packageserver   4.9.0-0.nightly-2021-06-24-073147   True        False         False      45h     
service-ca                                 4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
storage                                    4.9.0-0.nightly-2021-06-24-073147   True        False         False      47h     
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME         STATUS     ROLES    AGE   VERSION
master-0-0   Ready      master   47h   v1.21.0-rc.0+120883f
master-0-1   Ready      master   47h   v1.21.0-rc.0+120883f
master-0-2   Ready      master   47h   v1.21.0-rc.0+120883f
worker-0-0   NotReady   worker   47h   v1.21.0-rc.0+120883f
worker-0-1   NotReady   worker   47h   v1.21.0-rc.0+120883f


Version-Release number of selected component (if applicable):

4.7.12 -> 4.8.0-rc.0 -> 4.9.0-0.nightly-2021-06-24-073147
How reproducible:
Always

Steps to Reproduce:
1.upgrade from 4.7 to 4.8
2.upgrade from 4.8 to 4.9

Actual results:
Upgrade is stuck

Expected results:
Upgrade passed and bmh should stay provisioned.

Additional info:

Will add link to must-gather

Comment 4 Arda Guclu 2021-07-01 12:46:16 UTC

After debugging the problem by connecting the @Ori Michaeli's remote machine;

oc get bmh -n openshift-machine-api
NAME                   STATE                    CONSUMER                                  ONLINE   ERROR
openshift-master-0-0   externally provisioned   ocp-edge-cluster-0-tgn52-master-0         true     
openshift-master-0-1   externally provisioned   ocp-edge-cluster-0-tgn52-master-1         true     
openshift-master-0-2   externally provisioned   ocp-edge-cluster-0-tgn52-master-2         true     
openshift-worker-0-0   provisioned              ocp-edge-cluster-0-tgn52-worker-0-vh4jh   true     
openshift-worker-0-1   provisioned              ocp-edge-cluster-0-tgn52-worker-0-mntx2   true  

This indicates the problem is solved and BMH resources are being healed after deprovisioning state of 4.8.

However, as stated above comment;
oc get nodes
NAME         STATUS     ROLES    AGE   VERSION
master-0-0   Ready      master   47h   v1.21.0-rc.0+120883f
master-0-1   Ready      master   47h   v1.21.0-rc.0+120883f
master-0-2   Ready      master   47h   v1.21.0-rc.0+120883f
worker-0-0   NotReady   worker   47h   v1.21.0-rc.0+120883f
worker-0-1   NotReady   worker   47h   v1.21.0-rc.0+120883f

Worker's lost connection. Kubelet stopped working and even ssh returns connection refused error. It is attached worker nodes' screenshot by connecting via virt-manager.

Comment 5 Ori Michaeli 2021-07-05 10:43:12 UTC

Upgrading with 4.8 post bug 1972426 works:
4.7.12 -> 4.8.0-rc.1 -> 4.9.0-0.nightly-2021-06-28-221420 Passed

Comment 6 Ori Michaeli 2021-08-22 11:34:14 UTC

Verified with 4.7.24 -> 4.8.6 -> 4.9.0-fc.0

Comment 9 errata-xmlrpc 2021-10-18 17:37:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.