Description of problem: OCP 4.5.16 upgraded to 4.6.21 without issues, although, after the upgrade there were events in the web console stating the master nodes couldn't be updated. Annotation `machine.openshift.io/instance-state` shows ERROR for all the master nodes. It was noticed that, initially, master nodes were installed using an image in glance named `rhcos-lab` which doesn't exist anymore. The rest of the nodes were configured to use another RHCOS image which exist in glance and those are ok. In machine-api-controllers POD logs there's the following message: ~~~ 2021-04-15T19:07:15.604191855Z E0415 19:07:15.603326 1 actuator.go:550] Machine error ocpprod-xxxxx-master-0: no image with the name rhcos-lab could be found 2021-04-15T19:07:15.604191855Z E0415 19:07:15.603362 1 controller.go:280] ocpprod-xxxxx-master-0: error updating machine: no image with the name rhcos-lab could be found 2021-04-15T19:07:19.441795240Z E0415 19:07:19.441446 1 actuator.go:550] Machine error ocpprod-xxxxx-master-1: no image with the name rhcos-lab could be found 2021-04-15T19:07:19.441795240Z E0415 19:07:19.441481 1 controller.go:280] ocpprod-xxxxx-master-1: error updating machine: no image with the name rhcos-lab could be found 2021-04-15T19:07:23.577725170Z E0415 19:07:23.577676 1 actuator.go:550] Machine error ocpprod-xxxxx-master-2: no image with the name rhcos-lab could be found 2021-04-15T19:07:23.577937392Z E0415 19:07:23.577906 1 controller.go:280] ocpprod-xxxxx-master-2: error updating machine: no image with the name rhcos-lab could be found ~~~ When comparing master machine objects with the others worker-xxx-xxxxx machine definition: ------------------------------------- ~~~ spec: metadata: {} providerSpec: value: apiVersion: openstackproviderconfig.openshift.io/v1alpha1 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: openshift image: rhcos-4.5 <<<<<<<<<<<<<<<<<< Here kind: OpenstackProviderSpec ~~~ Master machine definition: --------------------------- ~~~ spec: metadata: {} providerSpec: value: apiVersion: openstackproviderconfig.openshift.io/v1alpha1 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: openshift image: rhcos-lab <<<<<<<<<<<<<<<<<< Here kind: OpenstackProviderSpec ~~~ A test was done by modifying the image name, in the machine resource, to the same name as the rest of the nodes which triggered the deletion process for master-2 instance and it remained in `Not Ready` and `phase: Failed` ~~~ errorMessage: Can't find created instance. lastUpdated: '2021-04-19T13:08:59Z' nodeRef: kind: Node name: ocpprod-xxxxx-master-2 uid: e375026b-3bb0-4b33-98d4-0745cb9644f0 phase: Failed ~~~ VM was gone from OpenStack Version-Release number of selected component (if applicable): - OpenShift 4.6.21 (upgraded from 4.5.16) - OpenStack 16.1 - Install method: IPI How reproducible: Not sure Steps to Reproduce: 1. Install IPI OCP on OSP 2. Remove the glance image used for master nodes 3. Change image in machine object to point to an existing image in glance Actual results: - VM instance disappeared in OSP. - Master node remains in NotReady status because it cannot be reached. Expected results: - Master VM instance to be created in OSP and join the cluster again with corresponding RHCOS image. Additional info:
I believe the issue happens because of this: https://github.com/openshift/cluster-api-provider-openstack/blob/release-4.6/pkg/cloud/openstack/machine/actuator.go#L376-L381 In short, CAPO can't handle changes in master machine spec. it always leads to this error ^ MAO reacts on this by setting the failed phase https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L342-L344
Verified on OSP16.1 (RHOS-16.1-RHEL-8-20210506.n.1) with below versions: openshift_puddle: 4.8.0-0.nightly-2021-05-27-234332 openshift_puddle: 4.7.0-0.nightly-2021-06-04-012633 openshift_puddle: 4.6.0-0.nightly-2021-05-27-163935 openshift_puddle: 4.5.0-0.nightly-2021-05-12-204808 'Replacing an unhealthy etcd member' procedure is included on our CI and working fine on above mentioned releases. Furthermore, the known_isse documentation is already merged and present on: https://github.com/openshift/installer/blob/master/docs/user/openstack/known-issues.md#problems-when-changing-the-machine-spec-for-master-node
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438