Description of problem: While trying to verify bug 1972572 by upgrading [IPI BM virtual simulation] 4.7.12 -> 4.8.0-rc.0 -> 4.9.0-0.nightly-2021-06-24-073147, upgrade is stuck on 75% with multiple errors. 4.7 to 4.8 upgrade passed and bmh were showing "provisioned registration error" error and were deprovisioning as expected (bug 1972426). 4.8 to 4.9 upgrade is stuck and worker nodes are showing "NotReady" [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-rc.0 True True 42h Working towards 4.9.0-0.nightly-2021-06-24-073147: 509 of 676 done (75% complete) [kni@provisionhost-0-0 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-2021-06-24-073147 False False True 41h OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ocp-edge-cluster-0.qe.lab.redhat.com/healthz": dial tcp 192.168.123.10:443: connect: connection refused ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 2 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods). baremetal 4.9.0-0.nightly-2021-06-24-073147 True False False 47h cloud-credential 4.9.0-0.nightly-2021-06-24-073147 True False False 2d cluster-autoscaler 4.9.0-0.nightly-2021-06-24-073147 True False False 47h config-operator 4.9.0-0.nightly-2021-06-24-073147 True False False 47h console 4.9.0-0.nightly-2021-06-24-073147 False False False 41h RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com/health): Get "https://console-openshift-console.apps.ocp-edge-cluster-0.qe.lab.redhat.com/health": dial tcp 192.168.123.10:443: connect: connection refused csi-snapshot-controller 4.9.0-0.nightly-2021-06-24-073147 True False False 47h dns 4.8.0-rc.0 True True False 45h DNS "default" reports Progressing=True: "Have 3 available node-resolver pods, want 5." etcd 4.9.0-0.nightly-2021-06-24-073147 True False False 47h image-registry 4.9.0-0.nightly-2021-06-24-073147 False True True 41h Available: The deployment does not have available replicas NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created ingress 4.9.0-0.nightly-2021-06-24-073147 False True True 41h The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.) insights 4.9.0-0.nightly-2021-06-24-073147 True False False 47h kube-apiserver 4.9.0-0.nightly-2021-06-24-073147 True False False 47h kube-controller-manager 4.9.0-0.nightly-2021-06-24-073147 True False False 47h kube-scheduler 4.9.0-0.nightly-2021-06-24-073147 True False False 47h kube-storage-version-migrator 4.9.0-0.nightly-2021-06-24-073147 True False False 41h machine-api 4.9.0-0.nightly-2021-06-24-073147 True False False 47h machine-approver 4.9.0-0.nightly-2021-06-24-073147 True False False 47h machine-config 4.8.0-rc.0 False False True 41h Cluster not available for 4.8.0-rc.0 marketplace 4.9.0-0.nightly-2021-06-24-073147 True False False 47h monitoring 4.9.0-0.nightly-2021-06-24-073147 False True True 41h Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.8.0-rc.0 True True True 47h DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-06-28T22:08:29Z DaemonSet "openshift-multus/multus-additional-cni-plugins" rollout is not making progress - last change 2021-06-28T22:08:30Z DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-06-28T22:08:30Z node-tuning 4.9.0-0.nightly-2021-06-24-073147 True False False 45h openshift-apiserver 4.9.0-0.nightly-2021-06-24-073147 True False False 41h openshift-controller-manager 4.9.0-0.nightly-2021-06-24-073147 True False False 24h openshift-samples 4.9.0-0.nightly-2021-06-24-073147 True False False 41h operator-lifecycle-manager 4.9.0-0.nightly-2021-06-24-073147 True False False 47h operator-lifecycle-manager-catalog 4.9.0-0.nightly-2021-06-24-073147 True False False 47h operator-lifecycle-manager-packageserver 4.9.0-0.nightly-2021-06-24-073147 True False False 45h service-ca 4.9.0-0.nightly-2021-06-24-073147 True False False 47h storage 4.9.0-0.nightly-2021-06-24-073147 True False False 47h [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 47h v1.21.0-rc.0+120883f master-0-1 Ready master 47h v1.21.0-rc.0+120883f master-0-2 Ready master 47h v1.21.0-rc.0+120883f worker-0-0 NotReady worker 47h v1.21.0-rc.0+120883f worker-0-1 NotReady worker 47h v1.21.0-rc.0+120883f Version-Release number of selected component (if applicable): 4.7.12 -> 4.8.0-rc.0 -> 4.9.0-0.nightly-2021-06-24-073147 How reproducible: Always Steps to Reproduce: 1.upgrade from 4.7 to 4.8 2.upgrade from 4.8 to 4.9 Actual results: Upgrade is stuck Expected results: Upgrade passed and bmh should stay provisioned. Additional info: Will add link to must-gather
After debugging the problem by connecting the @Ori Michaeli's remote machine; oc get bmh -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR openshift-master-0-0 externally provisioned ocp-edge-cluster-0-tgn52-master-0 true openshift-master-0-1 externally provisioned ocp-edge-cluster-0-tgn52-master-1 true openshift-master-0-2 externally provisioned ocp-edge-cluster-0-tgn52-master-2 true openshift-worker-0-0 provisioned ocp-edge-cluster-0-tgn52-worker-0-vh4jh true openshift-worker-0-1 provisioned ocp-edge-cluster-0-tgn52-worker-0-mntx2 true This indicates the problem is solved and BMH resources are being healed after deprovisioning state of 4.8. However, as stated above comment; oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 47h v1.21.0-rc.0+120883f master-0-1 Ready master 47h v1.21.0-rc.0+120883f master-0-2 Ready master 47h v1.21.0-rc.0+120883f worker-0-0 NotReady worker 47h v1.21.0-rc.0+120883f worker-0-1 NotReady worker 47h v1.21.0-rc.0+120883f Worker's lost connection. Kubelet stopped working and even ssh returns connection refused error. It is attached worker nodes' screenshot by connecting via virt-manager.
Upgrading with 4.8 post bug 1972426 works: 4.7.12 -> 4.8.0-rc.1 -> 4.9.0-0.nightly-2021-06-28-221420 Passed
Verified with 4.7.24 -> 4.8.6 -> 4.9.0-fc.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759