Bug 1951713 - [OCP-OSP] After changing image in machine object it enters in Failed - Can't find created instance
Summary: [OCP-OSP] After changing image in machine object it enters in Failed - Can't ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.8.0
Assignee: Emilien Macchi
QA Contact: rlobillo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-20 19:21 UTC by Javier Coscia
Modified: 2021-07-27 23:02 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:02:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4959 0 None open Bug 1951713: docs/openstack/known-issues: add section for machine spec edits 2021-05-26 02:19:05 UTC
Red Hat Knowledge Base (Solution) 5998451 0 None None None 2021-04-27 18:14:26 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:02:45 UTC

Description Javier Coscia 2021-04-20 19:21:41 UTC
Description of problem:

OCP 4.5.16 upgraded to 4.6.21 without issues, although, after the upgrade there were events in the web console stating the master nodes couldn't be updated.

Annotation `machine.openshift.io/instance-state` shows ERROR for all the master nodes.

It was noticed that, initially, master nodes were installed using an image in glance named `rhcos-lab` which doesn't exist anymore. The rest of the nodes were configured to use another RHCOS image which exist in glance and those are ok.



In machine-api-controllers POD logs there's the following message:

~~~
2021-04-15T19:07:15.604191855Z E0415 19:07:15.603326       1 actuator.go:550] Machine error ocpprod-xxxxx-master-0: no image with the name rhcos-lab could be found
2021-04-15T19:07:15.604191855Z E0415 19:07:15.603362       1 controller.go:280] ocpprod-xxxxx-master-0: error updating machine: no image with the name rhcos-lab could be found
2021-04-15T19:07:19.441795240Z E0415 19:07:19.441446       1 actuator.go:550] Machine error ocpprod-xxxxx-master-1: no image with the name rhcos-lab could be found
2021-04-15T19:07:19.441795240Z E0415 19:07:19.441481       1 controller.go:280] ocpprod-xxxxx-master-1: error updating machine: no image with the name rhcos-lab could be found
2021-04-15T19:07:23.577725170Z E0415 19:07:23.577676       1 actuator.go:550] Machine error ocpprod-xxxxx-master-2: no image with the name rhcos-lab could be found
2021-04-15T19:07:23.577937392Z E0415 19:07:23.577906       1 controller.go:280] ocpprod-xxxxx-master-2: error updating machine: no image with the name rhcos-lab could be found
~~~


When comparing master machine objects with the others

worker-xxx-xxxxx machine definition:
-------------------------------------
~~~
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: openshift
      image: rhcos-4.5                    <<<<<<<<<<<<<<<<<< Here
      kind: OpenstackProviderSpec
~~~


Master machine definition:
---------------------------
~~~
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: openshift
      image: rhcos-lab                    <<<<<<<<<<<<<<<<<< Here
      kind: OpenstackProviderSpec
~~~





A test was done by modifying the image name, in the machine resource, to the same name as the rest of the nodes which triggered the deletion process for master-2 instance and it remained in `Not Ready` and `phase: Failed`

~~~
  errorMessage: Can't find created instance.
  lastUpdated: '2021-04-19T13:08:59Z'
  nodeRef:
    kind: Node
    name: ocpprod-xxxxx-master-2
    uid: e375026b-3bb0-4b33-98d4-0745cb9644f0
  phase: Failed
~~~


VM was gone from OpenStack 






Version-Release number of selected component (if applicable):

- OpenShift 4.6.21 (upgraded from 4.5.16)
- OpenStack 16.1
- Install method: IPI

How reproducible:

Not sure

Steps to Reproduce:
1. Install IPI OCP on OSP
2. Remove the glance image used for master nodes
3. Change image in machine object to point to an existing image in glance

Actual results:

- VM instance disappeared in OSP.
- Master node remains in NotReady status because it cannot be reached.

Expected results:

- Master VM instance to be created in OSP and join the cluster again with corresponding RHCOS image.

Additional info:

Comment 4 Mike Fedosin 2021-04-21 12:59:56 UTC
I believe the issue happens because of this: https://github.com/openshift/cluster-api-provider-openstack/blob/release-4.6/pkg/cloud/openstack/machine/actuator.go#L376-L381
In short, CAPO can't handle changes in master machine spec. it always leads to this error ^
MAO reacts on this by setting the failed phase https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/machine/controller.go#L342-L344

Comment 18 rlobillo 2021-06-09 09:44:49 UTC
Verified on OSP16.1 (RHOS-16.1-RHEL-8-20210506.n.1) with below versions:

openshift_puddle: 4.8.0-0.nightly-2021-05-27-234332 
openshift_puddle: 4.7.0-0.nightly-2021-06-04-012633
openshift_puddle: 4.6.0-0.nightly-2021-05-27-163935
openshift_puddle: 4.5.0-0.nightly-2021-05-12-204808

'Replacing an unhealthy etcd member' procedure is included on our CI and working fine on above mentioned releases.

Furthermore, the known_isse documentation is already merged and present on: 
https://github.com/openshift/installer/blob/master/docs/user/openstack/known-issues.md#problems-when-changing-the-machine-spec-for-master-node

Comment 21 errata-xmlrpc 2021-07-27 23:02:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.