1962066 – Edit machine/machineset specs not working

Bug 1962066 - Edit machine/machineset specs not working

Summary: Edit machine/machineset specs not working

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Matthew Booth
QA Contact:	rlobillo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-19 09:13 UTC by rlobillo
Modified:	2022-03-12 04:35 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: On OpenStack, editing a Machine spec would cause OpenShift to attempt to recreate the machine by deleting and recreating it. This is incorrect behaviour because cloud resources are intended to be immutable. Consequence: On OpenStack, editing a Machine spec causes the unrecoverable loss of the Node it is hosting. Fix: On OpenStack, edits to a Machine spec are now ignored. Result: Editing a Machine spec after creation has no effect.
Clone Of:
Environment:
Last Closed:	2022-03-12 04:35:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
new_machineset.yaml (1.77 KB, text/plain) 2021-05-19 09:28 UTC, rlobillo	no flags	Details
new_machine.yaml (1.45 KB, text/plain) 2021-05-19 09:28 UTC, rlobillo	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-openstack pull 210	0	None	open	Bug 2022627: Fix nodelink and CSR approval when a machine has multiple addresses	2021-12-02 14:13:48 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-12 04:35:40 UTC

Description rlobillo 2021-05-19 09:13:50 UTC

Description of problem:

We have observed that edition of machine/machineset specs is not working. The changes are not applied, and in some cases it is leading the machine to go to Failed status.

# Test 1: Update machineset spec is not applied, specifically modifying the cinder AZ.

	As day2 operation, create a new machineset (please check new_machineset.yaml) with specific nova and cinder AZ pairing that will create an extra worker (ostest-pwk7t-worker-1):

	$ oc get machineset -A
	NAMESPACE               NAME                    DESIRED   CURRENT   READY   AVAILABLE   AGE
	openshift-machine-api   ostest-pwk7t-worker-0   3         3         3       3           4h3m
	openshift-machine-api   ostest-pwk7t-worker-1   1         1         1       1           39m

	$ oc get nodes
	NAME                          STATUS   ROLES    AGE     VERSION
	ostest-pwk7t-master-0         Ready    master   4h1m    v1.21.0-rc.0+41625cd
	ostest-pwk7t-master-1         Ready    master   4h1m    v1.21.0-rc.0+41625cd
	ostest-pwk7t-master-2         Ready    master   4h1m    v1.21.0-rc.0+41625cd
	ostest-pwk7t-worker-0-g7hdp   Ready    worker   3h39m   v1.21.0-rc.0+41625cd
	ostest-pwk7t-worker-0-nw85w   Ready    worker   3h37m   v1.21.0-rc.0+41625cd
	ostest-pwk7t-worker-0-rzsb5   Ready    worker   3h35m   v1.21.0-rc.0+41625cd
	ostest-pwk7t-worker-1-t2lrm   Ready    worker   12m     v1.21.0-rc.0+41625cd

	$ openstack volume show ostest-pwk7t-worker-1-t2lrm -c availability_zone
	+-------------------+------------+
	| Field             | Value      |
	+-------------------+------------+
	| availability_zone | cinder_AZ1 |
	+-------------------+------------+


	Once ready, modify machineset (oc edit) setting different cinder AZ (cinder_AZ0). The machine will be not modified at all:

	$ oc get machineset -n openshift-machine-api ostest-pwk7t-worker-1 -o json | jq .spec.template.spec.providerSpec.value.rootVolume
	{
	  "availabilityZone": "cinder_AZ0",
	  "deviceType": "",
	  "diskSize": 25,
	  "sourceType": "image",
	  "sourceUUID": "ostest-pwk7t-rhcos",
	  "volumeType": "tripleo"
	}

	$ oc get machine -n openshift-machine-api ostest-pwk7t-worker-1-t2lrm -o json | jq .spec.providerSpec.value.rootVolume
	{
	  "availabilityZone": "cinder_AZ1",
	  "deviceType": "",
	  "diskSize": 25,
	  "sourceType": "image",
	  "sourceUUID": "ostest-pwk7t-rhcos",
	  "volumeType": "tripleo"
	}

	And cluster is not reflecting the change (worker machine is not redeployed):

	$ oc get machines -A
	NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
	openshift-machine-api   ostest-pwk7t-master-0         Running   m4.xlarge   regionOne   AZ-0   4h17m
	openshift-machine-api   ostest-pwk7t-master-1         Running   m4.xlarge   regionOne   AZ-0   4h17m
	openshift-machine-api   ostest-pwk7t-master-2         Running   m4.xlarge   regionOne   AZ-0   4h17m
	openshift-machine-api   ostest-pwk7t-worker-0-g7hdp   Running   m4.xlarge   regionOne   AZ-0   4h10m
	openshift-machine-api   ostest-pwk7t-worker-0-nw85w   Running   m4.xlarge   regionOne   AZ-0   4h10m
	openshift-machine-api   ostest-pwk7t-worker-0-rzsb5   Running   m4.xlarge   regionOne   AZ-0   4h10m
	openshift-machine-api   ostest-pwk7t-worker-1-t2lrm   Running   m4.xlarge   regionOne   AZ-0   37m


	As a workaround, setting the replica to 0 and to 1 on the machineset (to trigger the  destroy/creation of the VM) will recreate the VM with the updated specs:

	$ oc get machineset -n openshift-machine-api ostest-pwk7t-worker-1 -o json | jq .spec.template.spec.providerSpec.value.rootVolume                         
	{
	  "availabilityZone": "cinder_AZ0",
	  "deviceType": "",
	  "diskSize": 25,
	  "sourceType": "image",
	  "sourceUUID": "ostest-pwk7t-rhcos",
	  "volumeType": "tripleo"
	}

	$ oc get machine -n openshift-machine-api ostest-pwk7t-worker-1-s2vhw -o json | jq .spec.providerSpec.value.rootVolume
	{
	  "availabilityZone": "cinder_AZ0",
	  "deviceType": "",
	  "diskSize": 25,
	  "sourceType": "image",
	  "sourceUUID": "ostest-pwk7t-rhcos",
	  "volumeType": "tripleo"
	}


# Test 2: Edit a machine spec to update the cinder AZ on the spec is moving the machine to 'Failed' status:

	Once an extra machine resource (please check new_machine.yaml) using compute zone AZ-0 and volume zone cinder_AZ1 is created, modifying its spec (setting cinder_AZ0) moves the machine to Failed status:

	(shiftstack) [stack@undercloud-0 ~]$ oc get machine -n openshift-machine-api ostest-pwk7t-worker-2 -o json | jq .spec.providerSpec.value.rootVolume                                          
	{
	  "availabilityZone": "cinder_AZ0",
	  "deviceType": "",
	  "diskSize": 25,
	  "sourceType": "image",
	  "sourceUUID": "ostest-pwk7t-rhcos",
	  "volumeType": "tripleo"
	}
	(shiftstack) [stack@undercloud-0 ~]$ oc get machines -A
	NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE                                                                                           
	openshift-machine-api   ostest-pwk7t-master-0         Running   m4.xlarge   regionOne   AZ-0   19h                                                                                           
	openshift-machine-api   ostest-pwk7t-master-1         Running   m4.xlarge   regionOne   AZ-0   19h                                                                                           
	openshift-machine-api   ostest-pwk7t-master-2         Running   m4.xlarge   regionOne   AZ-0   19h                                                                                           
	openshift-machine-api   ostest-pwk7t-worker-0-rzsb5   Running   m4.xlarge   regionOne   AZ-0   19h                                                                                           
	openshift-machine-api   ostest-pwk7t-worker-1-s2vhw   Running   m4.xlarge   regionOne   AZ-0   15h                                                                                           
	openshift-machine-api   ostest-pwk7t-worker-2         Failed    m4.xlarge   regionOne   AZ-0   15h                                                                                           

	As a workaround, deleting the machine resource and applying it again with updated info will recreate the VM with the updated specs.

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-05-13-222446


How reproducible: Always.

Steps to Reproduce:
1. Install OCP cluster
2. Create new machineset with replicas: 1, adding a worker with specific nova and cinder AZ.
3. Wait until the worker is added to the cluster.
4. Edit the machineset spec.
5. Observe that the machine is not modified and machineset and machine specs do not match.
6. Create a new machine with specific nova and Cinder AZ.
7. Wait until the worker is added to the cluster.
8. Edit the machine spec.
9. Observe that the machine is going to 'Failed' status.

Actual results: machine spec changes are not reflected on the system.

Expected results: The worker should be destroyed and recreated with the new attributes or modifying specs should not be allowed on these resources.

Additional info: 
- new_machine.yaml and new_machineset.yaml attached
- mustgather: http://file.rdu.redhat.com/rlobillo/must_gather_machine_changes.tgz

Comment 1 rlobillo 2021-05-19 09:28:08 UTC

Created attachment 1784768 [details]
new_machineset.yaml

Comment 2 rlobillo 2021-05-19 09:28:35 UTC

Created attachment 1784769 [details]
new_machine.yaml

Comment 3 Joel Speed 2021-05-19 09:54:27 UTC

This sounds like it's working as designed.

Machines do not mutate the VMs they create. They are responsible only for creation and deletion.

If you edit a Machine spec, nothing should happen.

If you edit a MachineSet spec, the changes will apply only to new Machines that are created.

Changing information on the Machine spec after it has created the Machine can cause machines to go failed, I suspect in this case that the change you've made means the Machine controller can no longer find the VM and as such, marks it as failed as if it had been deleted on the OSP side.

Comment 4 rlobillo 2021-05-20 12:03:49 UTC

Thanks Joel.  

I edited a machine spec and it was moved to 'Failed' status. According to your comment, that should not happen, right?

Comment 5 Joel Speed 2021-05-20 12:06:19 UTC

What exactly did you edit? There are certain conditions where this is a valid action.

One of those for example, on AWS, is that if you change the region, the controller can no longer find the instance on the cloud provider side.
In this case, it is designed to switch into failed because the assumption is that the instance has been removed on the cloud provider side, this is unrecoverable in our API.

Comment 6 rlobillo 2021-05-20 12:09:33 UTC

I changed .spec.providerSpec.value.rootVolume.availabilityZone from 'cinder_AZ1' to 'cinder_AZ0':

spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      availabilityZone: AZ-0
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: m4.xlarge
      image: ""
      kind: OpenstackProviderSpec
      metadata:
        creationTimestamp: null
      networks:
      - filter: {}
        subnets:
        - filter:
            name: ostest-pwk7t-nodes
            tags: openshiftClusterID=ostest-pwk7t
      rootVolume:
        availabilityZone: cinder_AZ1 <-- 
        deviceType: ""
        diskSize: 25
        sourceType: image
        sourceUUID: ostest-pwk7t-rhcos
        volumeType: tripleo
      securityGroups:
      - filter: {}
        name: ostest-pwk7t-worker
      serverMetadata:
        Name: ostest-pwk7t-worker
        openshiftClusterID: ostest-pwk7t
      tags:
      - openshiftClusterID=ostest-pwk7t
      trunk: true
      userDataSecret:
        name: worker-user-data

Comment 7 Joel Speed 2021-05-20 12:22:33 UTC

I wouldn't expect this change to move the machine to Failed. My next step would be to check if there are any logs in the machine controller that indicate why it moved this machine to failed?

Comment 8 rlobillo 2021-05-20 15:51:32 UTC

Hello Joel.

You will find the logs attached to the case, inside the must-gather.

This is what I found on the machine-controller logs:

2021-05-19T07:33:00.757467824Z I0519 07:33:00.757409       1 controller.go:174] ostest-pwk7t-worker-2: reconciling Machine
2021-05-19T07:33:02.399787013Z I0519 07:33:02.399720       1 controller.go:297] ostest-pwk7t-worker-2: reconciling machine triggers idempotent update
2021-05-19T07:33:03.792588682Z I0519 07:33:03.792542       1 actuator.go:381] re-creating machine ostest-pwk7t-worker-2 for update.
2021-05-19T07:33:07.193971419Z I0519 07:33:07.193900       1 actuator.go:150] Skipped creating a VM that already exists.
2021-05-19T07:33:23.761788118Z I0519 07:33:23.761708       1 actuator.go:406] Successfully updated machine ostest-pwk7t-worker-2
2021-05-19T07:33:45.760076450Z I0519 07:33:45.759104       1 controller.go:174] ostest-pwk7t-worker-2: reconciling Machine
2021-05-19T07:33:52.172110491Z I0519 07:33:52.172034       1 controller.go:482] ostest-pwk7t-worker-2: going into phase "Failed"
2021-05-19T07:33:52.272137502Z I0519 07:33:52.272080       1 controller.go:174] ostest-pwk7t-worker-2: reconciling Machine
2021-05-19T07:33:52.272137502Z W0519 07:33:52.272118       1 controller.go:275] ostest-pwk7t-worker-2: machine has gone "Failed" phase. It won't reconcile

Comment 9 Joel Speed 2021-05-20 16:17:48 UTC

Could you try again to update the machine and let one of the machines go into the failed state?

In this case, I'm hoping that it should say on the Machine status itself exactly what went wrong.
Looking at the code path, it looks like the actuator Exists returned false without returning an error, in which case, it went into Failed correctly.

I noticed some other errors as it tried to create the new machine, perhaps these might be related

2021-05-19T07:39:12.916812569Z E0519 07:39:12.916753       1 controller.go:281] ostest-pwk7t-worker-2: failed to check if machine exists: Error checking if instance exists (machine/actuator.go 346): 
2021-05-19T07:39:12.916812569Z Error getting a new instance service from the machine (machine/actuator.go 467): Failed to authenticate provider client: Get "https://overcloud.redhat.local:13000/": x509: certificate signed by unknown authority
2021-05-19T07:39:13.156719620Z E0519 07:39:13.156662       1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="Error checking if instance exists (machine/actuator.go 346): \nError getting a new instance service from the machine (machine/actuator.go 467): Failed to authenticate provider client: Get \"https://overcloud.redhat.local:13000/\": x509: certificate signed by unknown authority" "name"="ostest-pwk7t-worker-2" "namespace"="openshift-machine-api" 

Looks like it might have ended up trying to talk to a different openstack cluster?

Comment 10 rlobillo 2021-05-26 15:44:57 UTC

Hello Joel.

Just retried the test and got below logs:

I0526 15:39:10.337611       1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine
I0526 15:39:13.109382       1 controller.go:297] ostest-mjg7d-worker-3: reconciling machine triggers idempotent update
I0526 15:39:14.959132       1 actuator.go:381] re-creating machine ostest-mjg7d-worker-3 for update.
I0526 15:39:18.975570       1 actuator.go:150] Skipped creating a VM that already exists.
I0526 15:39:26.930516       1 actuator.go:406] Successfully updated machine ostest-mjg7d-worker-3
I0526 15:40:02.253042       1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine
I0526 15:40:03.551557       1 controller.go:482] ostest-mjg7d-worker-3: going into phase "Failed"
I0526 15:40:03.599887       1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine
I0526 15:40:04.833449       1 controller.go:482] ostest-mjg7d-worker-3: going into phase "Failed"
I0526 15:40:04.873813       1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine
W0526 15:40:04.873849       1 controller.go:275] ostest-mjg7d-worker-3: machine has gone "Failed" phase. It won't reconcile
I0526 15:40:04.877639       1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine
W0526 15:40:04.877678       1 controller.go:275] ostest-mjg7d-worker-3: machine has gone "Failed" phase. It won't reconcile

OCP version: 4.8.0-0.nightly-2021-05-25-072938

Comment 11 Joel Speed 2021-05-26 15:47:34 UTC

@rlobillo Could you please create and share a must gather from that cluster that you just created. I would like to see the content of the Machine resources and tie the details to the logs from the machine controller

Comment 13 Matthew Booth 2021-06-09 16:01:22 UTC

Note: need to look into the must-gather in comment 12 to work out why we're suddenly getting SSL errors connecting to OpenStack after changing the Cinder AZ.

Expected behaviour here would be:

Ideally a webhook would have prevented the change in the first place.
In the absence of that, no change in the Machine, and a log message in machine api explaining why the Machine won't be updated.

Comment 14 Matthew Booth 2021-06-21 10:29:38 UTC

status:
  addresses:
  - address: 10.196.1.3
    type: InternalIP
  - address: ostest-mjg7d-worker-3
    type: Hostname
  - address: ostest-mjg7d-worker-3
    type: InternalDNS
  conditions:
  - lastTransitionTime: "2021-05-26T15:40:04Z"
    message: Instance not found on provider
    reason: InstanceMissing
    severity: Warning
    status: "False"
    type: InstanceExists
  errorMessage: Can't find created instance.
  lastUpdated: "2021-05-26T15:40:04Z"
  nodeRef:
    kind: Node
    name: ostest-mjg7d-worker-3
    uid: 4e214485-4511-4242-9997-5ecf21a58a24
  phase: Failed

Comment 15 Matthew Booth 2021-06-21 10:31:39 UTC

(In reply to Matthew Booth from comment #13)
> Note: need to look into the must-gather in comment 12 to work out why we're
> suddenly getting SSL errors connecting to OpenStack after changing the
> Cinder AZ.

The SSL errors seem to correlate with etcd errors, and they quickly go away. I'm not inclined to go down that rabbit hole. Suspect some weird failure mode results in an incomplete cloud config.

Comment 16 Matthew Booth 2021-06-21 12:31:17 UTC

I agree with Joel's assessment above: it seems that the way we got here was that actuator.Exists() returned false without an error. As no error was returned, the machine controller assumes it is actually gone, and don't make any further attempts to reconcile the machine. It's difficult to guess what's actually going, but I'm currently suspicious of this code in instanceExists:

https://github.com/openshift/cluster-api-provider-openstack/blob/4f41902b58d95a97d303ff1ee509c0c864250cd5/pkg/cloud/openstack/machine/actuator.go#L650-L651

When we check if an instance exists we're listing instances and filtering by name, image, and flavor. Image and flavor seem superflous, and I wonder if image specifically is causing a problem. I note that we're setting providerspec.image to "" above, and setting the actual image on the root volume. I wonder if Nova is setting instance.image to the image of the root volume, and subsequently failing to match an empty image. This is pure speculation!

Comment 17 Matthew Booth 2021-06-21 13:00:56 UTC

(In reply to Matthew Booth from comment #16)
> I agree with Joel's assessment above: it seems that the way we got here was
> that actuator.Exists() returned false without an error. As no error was
> returned, the machine controller assumes it is actually gone, and don't make
> any further attempts to reconcile the machine. It's difficult to guess
> what's actually going, but I'm currently suspicious of this code in
> instanceExists:
> 
> https://github.com/openshift/cluster-api-provider-openstack/blob/
> 4f41902b58d95a97d303ff1ee509c0c864250cd5/pkg/cloud/openstack/machine/
> actuator.go#L650-L651
> 
> When we check if an instance exists we're listing instances and filtering by
> name, image, and flavor. Image and flavor seem superflous, and I wonder if
> image specifically is causing a problem. I note that we're setting
> providerspec.image to "" above, and setting the actual image on the root
> volume. I wonder if Nova is setting instance.image to the image of the root
> volume, and subsequently failing to match an empty image. This is pure
> speculation!

This isn't borne out in testing. At least on 16.1 instance.image is unset for BFV, and searching on image="" returns the instance.

Comment 18 Matthew Booth 2021-06-21 16:34:12 UTC

Possible reproducer. My dev environment doesn't have multiple volume AZs, but I edited the *size* of a worker machine. The result was that the machine was marked Failed after the instance was deleted.

The question now is: why was the instance deleted?

Comment 19 Matthew Booth 2021-06-21 16:35:23 UTC

I0621 16:30:01.221279       1 controller.go:174] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling Machine
I0621 16:30:03.863134       1 controller.go:297] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling machine triggers idempotent update
I0621 16:30:05.290815       1 actuator.go:381] re-creating machine cluster-dsal-qmrmv-worker-0-gl7qf for update.
I0621 16:30:09.816911       1 actuator.go:150] Skipped creating a VM that already exists.
I0621 16:30:25.242183       1 actuator.go:406] Successfully updated machine cluster-dsal-qmrmv-worker-0-gl7qf
I0621 16:30:53.921183       1 controller.go:174] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling Machine
I0621 16:30:55.696962       1 controller.go:482] cluster-dsal-qmrmv-worker-0-gl7qf: going into phase "Failed"
I0621 16:30:55.752839       1 controller.go:174] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling Machine
I0621 16:30:56.675470       1 controller.go:482] cluster-dsal-qmrmv-worker-0-gl7qf: going into phase "Failed"
I0621 16:30:56.718122       1 controller.go:174] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling Machine
W0621 16:30:56.718366       1 controller.go:275] cluster-dsal-qmrmv-worker-0-gl7qf: machine has gone "Failed" phase. It won't reconcile

Comment 20 Matthew Booth 2021-06-21 16:41:30 UTC

@jspeed The instance is deleted because machine controller called actuator.Update(). Given the earlier discussion about machines being immutable, what is actuator.Update() supposed to do?

Comment 21 Matthew Booth 2021-06-21 16:53:05 UTC

Another problem in this code is here: https://github.com/openshift/cluster-api-provider-openstack/blob/79fa6d04adae132752880c526cc6f002717054e8/pkg/cloud/openstack/machine/actuator.go#L382-L388

The Create() is skipped because:

I0621 16:30:09.816911       1 actuator.go:150] Skipped creating a VM that already exists.

and we then proceed to delete. Even if we really wanted to do this, this would still be weird.

Comment 22 Joel Speed 2021-06-22 09:32:31 UTC

Update() has a couple of purposes, it is effectively to update the state of the Machine (MAPI Machine) to reflect the state of the VM (or OpenStack instance in this case?). It should not be recreating the instance on the OpenStack side, nor should it be making any changes to the instance, they are considered to be immutable.

Update is supposed to:
- Look up the instance on the cloud provider
- Reconcile any load balancer attachments if appropriate
- Reconcile tags if appropriate (eg the kubernetes cluster tag that denotes the instance is owned by the cluster)
- Set the providerID on the spec of the Machine
- Ensure status information for the instance is up to date on the Machine status - eg things like the IP addresses

We consider LB attachement and tags to not be part of the specification of the instance itself and therefore these are treated as mutable, but things like disk spec changes or image spec changes are definitely not supported.

Comment 23 Matthew Booth 2021-06-22 09:40:51 UTC

Also from Joel via Slack:

One of the stipulations of MAPI is that the providerID should never change, ie, there's a 1:1 mapping between Machine and instance, you're currently not doing that.

Comment 26 Pierre Prinetti 2021-11-10 15:34:41 UTC

The expected behaviour is that CAPO should ignore (the unsupported action of) editing a Machine.

Comment 27 ShiftStack Bugwatcher 2021-11-25 16:11:45 UTC

Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 28 Matthew Booth 2021-12-02 14:15:27 UTC

https://github.com/openshift/cluster-api-provider-openstack/pull/210 will prevent CAPO from attempting to make any changes to an existing server. Changes to an existing machine will be safely ignored.

Comment 30 Matthew Booth 2021-12-07 16:25:16 UTC

The fix for this landed: https://github.com/openshift/cluster-api-provider-openstack/pull/210

Comment 32 rlobillo 2021-12-16 15:41:02 UTC

Verified on 4.10.0-0.nightly-2021-12-14-083101 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0).

On a running cluster, a new machine is created using rootVolume on 'cinderAZ0' AZ:

$ oc get nodes && oc get machines -n openshift-machine-api
NAME                          STATUS   ROLES    AGE   VERSION
ostest-kmzqk-master-0         Ready    master   2d    v1.22.1+6859754
ostest-kmzqk-master-1         Ready    master   2d    v1.22.1+6859754
ostest-kmzqk-master-2         Ready    master   2d    v1.22.1+6859754
ostest-kmzqk-new-worker       Ready    worker   22h   v1.22.1+6859754
ostest-kmzqk-worker-0-dh2rw   Ready    worker   47h   v1.22.1+6859754
ostest-kmzqk-worker-0-pqq7p   Ready    worker   47h   v1.22.1+6859754
NAME                          PHASE     TYPE        REGION      ZONE   AGE
ostest-kmzqk-master-0         Running                                  2d
ostest-kmzqk-master-1         Running                                  2d
ostest-kmzqk-master-2         Running                                  2d
ostest-kmzqk-new-worker       Running   m4.xlarge   regionOne   nova   22h
ostest-kmzqk-worker-0-dh2rw   Running   m4.xlarge   regionOne   nova   47h
ostest-kmzqk-worker-0-pqq7p   Running   m4.xlarge   regionOne   nova   47h

$ oc get -n openshift-machine-api machine/ostest-kmzqk-new-worker -o json | jq .spec.providerSpec.value.rootVolume
{
  "availabilityZone": "cinderAZ0",
  "deviceType": "",
  "diskSize": 25,
  "sourceType": "image",
  "sourceUUID": "ostest-kmzqk-rhcos",
  "volumeType": "tripleo"
}


With the fix, editing the spec is not moving the machine to Failed:

$ oc patch -n openshift-machine-api machine/ostest-kmzqk-new-worker  --type merge -p '{"spec":{"providerSpec":{"value":{"rootVolume":{"availabilityZone":"cinderAZ1"}}}}}'
machine.machine.openshift.io/ostest-kmzqk-new-worker patched
$ oc get -n openshift-machine-api machine/ostest-kmzqk-new-worker -o json | jq .spec.providerSpec.value.rootVolume
{
  "availabilityZone": "cinderAZ1",
  "deviceType": "",
  "diskSize": 25,
  "sourceType": "image",
  "sourceUUID": "ostest-kmzqk-rhcos",
  "volumeType": "tripleo"
}

$ oc logs -n openshift-machine-api -l k8s-app=controller -c machine-controller | tail
[...]
I1216 15:38:20.886539       1 controller.go:175] ostest-kmzqk-new-worker: reconciling Machine
I1216 15:38:23.511158       1 controller.go:298] ostest-kmzqk-new-worker: reconciling machine triggers idempotent update

The event is simply ignored.

Comment 37 errata-xmlrpc 2022-03-12 04:35:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.