Description of problem: We have observed that edition of machine/machineset specs is not working. The changes are not applied, and in some cases it is leading the machine to go to Failed status. # Test 1: Update machineset spec is not applied, specifically modifying the cinder AZ. As day2 operation, create a new machineset (please check new_machineset.yaml) with specific nova and cinder AZ pairing that will create an extra worker (ostest-pwk7t-worker-1): $ oc get machineset -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api ostest-pwk7t-worker-0 3 3 3 3 4h3m openshift-machine-api ostest-pwk7t-worker-1 1 1 1 1 39m $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-pwk7t-master-0 Ready master 4h1m v1.21.0-rc.0+41625cd ostest-pwk7t-master-1 Ready master 4h1m v1.21.0-rc.0+41625cd ostest-pwk7t-master-2 Ready master 4h1m v1.21.0-rc.0+41625cd ostest-pwk7t-worker-0-g7hdp Ready worker 3h39m v1.21.0-rc.0+41625cd ostest-pwk7t-worker-0-nw85w Ready worker 3h37m v1.21.0-rc.0+41625cd ostest-pwk7t-worker-0-rzsb5 Ready worker 3h35m v1.21.0-rc.0+41625cd ostest-pwk7t-worker-1-t2lrm Ready worker 12m v1.21.0-rc.0+41625cd $ openstack volume show ostest-pwk7t-worker-1-t2lrm -c availability_zone +-------------------+------------+ | Field | Value | +-------------------+------------+ | availability_zone | cinder_AZ1 | +-------------------+------------+ Once ready, modify machineset (oc edit) setting different cinder AZ (cinder_AZ0). The machine will be not modified at all: $ oc get machineset -n openshift-machine-api ostest-pwk7t-worker-1 -o json | jq .spec.template.spec.providerSpec.value.rootVolume { "availabilityZone": "cinder_AZ0", "deviceType": "", "diskSize": 25, "sourceType": "image", "sourceUUID": "ostest-pwk7t-rhcos", "volumeType": "tripleo" } $ oc get machine -n openshift-machine-api ostest-pwk7t-worker-1-t2lrm -o json | jq .spec.providerSpec.value.rootVolume { "availabilityZone": "cinder_AZ1", "deviceType": "", "diskSize": 25, "sourceType": "image", "sourceUUID": "ostest-pwk7t-rhcos", "volumeType": "tripleo" } And cluster is not reflecting the change (worker machine is not redeployed): $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pwk7t-master-0 Running m4.xlarge regionOne AZ-0 4h17m openshift-machine-api ostest-pwk7t-master-1 Running m4.xlarge regionOne AZ-0 4h17m openshift-machine-api ostest-pwk7t-master-2 Running m4.xlarge regionOne AZ-0 4h17m openshift-machine-api ostest-pwk7t-worker-0-g7hdp Running m4.xlarge regionOne AZ-0 4h10m openshift-machine-api ostest-pwk7t-worker-0-nw85w Running m4.xlarge regionOne AZ-0 4h10m openshift-machine-api ostest-pwk7t-worker-0-rzsb5 Running m4.xlarge regionOne AZ-0 4h10m openshift-machine-api ostest-pwk7t-worker-1-t2lrm Running m4.xlarge regionOne AZ-0 37m As a workaround, setting the replica to 0 and to 1 on the machineset (to trigger the destroy/creation of the VM) will recreate the VM with the updated specs: $ oc get machineset -n openshift-machine-api ostest-pwk7t-worker-1 -o json | jq .spec.template.spec.providerSpec.value.rootVolume { "availabilityZone": "cinder_AZ0", "deviceType": "", "diskSize": 25, "sourceType": "image", "sourceUUID": "ostest-pwk7t-rhcos", "volumeType": "tripleo" } $ oc get machine -n openshift-machine-api ostest-pwk7t-worker-1-s2vhw -o json | jq .spec.providerSpec.value.rootVolume { "availabilityZone": "cinder_AZ0", "deviceType": "", "diskSize": 25, "sourceType": "image", "sourceUUID": "ostest-pwk7t-rhcos", "volumeType": "tripleo" } # Test 2: Edit a machine spec to update the cinder AZ on the spec is moving the machine to 'Failed' status: Once an extra machine resource (please check new_machine.yaml) using compute zone AZ-0 and volume zone cinder_AZ1 is created, modifying its spec (setting cinder_AZ0) moves the machine to Failed status: (shiftstack) [stack@undercloud-0 ~]$ oc get machine -n openshift-machine-api ostest-pwk7t-worker-2 -o json | jq .spec.providerSpec.value.rootVolume { "availabilityZone": "cinder_AZ0", "deviceType": "", "diskSize": 25, "sourceType": "image", "sourceUUID": "ostest-pwk7t-rhcos", "volumeType": "tripleo" } (shiftstack) [stack@undercloud-0 ~]$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-pwk7t-master-0 Running m4.xlarge regionOne AZ-0 19h openshift-machine-api ostest-pwk7t-master-1 Running m4.xlarge regionOne AZ-0 19h openshift-machine-api ostest-pwk7t-master-2 Running m4.xlarge regionOne AZ-0 19h openshift-machine-api ostest-pwk7t-worker-0-rzsb5 Running m4.xlarge regionOne AZ-0 19h openshift-machine-api ostest-pwk7t-worker-1-s2vhw Running m4.xlarge regionOne AZ-0 15h openshift-machine-api ostest-pwk7t-worker-2 Failed m4.xlarge regionOne AZ-0 15h As a workaround, deleting the machine resource and applying it again with updated info will recreate the VM with the updated specs. Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-05-13-222446 How reproducible: Always. Steps to Reproduce: 1. Install OCP cluster 2. Create new machineset with replicas: 1, adding a worker with specific nova and cinder AZ. 3. Wait until the worker is added to the cluster. 4. Edit the machineset spec. 5. Observe that the machine is not modified and machineset and machine specs do not match. 6. Create a new machine with specific nova and Cinder AZ. 7. Wait until the worker is added to the cluster. 8. Edit the machine spec. 9. Observe that the machine is going to 'Failed' status. Actual results: machine spec changes are not reflected on the system. Expected results: The worker should be destroyed and recreated with the new attributes or modifying specs should not be allowed on these resources. Additional info: - new_machine.yaml and new_machineset.yaml attached - mustgather: http://file.rdu.redhat.com/rlobillo/must_gather_machine_changes.tgz
Created attachment 1784768 [details] new_machineset.yaml
Created attachment 1784769 [details] new_machine.yaml
This sounds like it's working as designed. Machines do not mutate the VMs they create. They are responsible only for creation and deletion. If you edit a Machine spec, nothing should happen. If you edit a MachineSet spec, the changes will apply only to new Machines that are created. Changing information on the Machine spec after it has created the Machine can cause machines to go failed, I suspect in this case that the change you've made means the Machine controller can no longer find the VM and as such, marks it as failed as if it had been deleted on the OSP side.
Thanks Joel. I edited a machine spec and it was moved to 'Failed' status. According to your comment, that should not happen, right?
What exactly did you edit? There are certain conditions where this is a valid action. One of those for example, on AWS, is that if you change the region, the controller can no longer find the instance on the cloud provider side. In this case, it is designed to switch into failed because the assumption is that the instance has been removed on the cloud provider side, this is unrecoverable in our API.
I changed .spec.providerSpec.value.rootVolume.availabilityZone from 'cinder_AZ1' to 'cinder_AZ0': spec: metadata: {} providerSpec: value: apiVersion: openstackproviderconfig.openshift.io/v1alpha1 availabilityZone: AZ-0 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: m4.xlarge image: "" kind: OpenstackProviderSpec metadata: creationTimestamp: null networks: - filter: {} subnets: - filter: name: ostest-pwk7t-nodes tags: openshiftClusterID=ostest-pwk7t rootVolume: availabilityZone: cinder_AZ1 <-- deviceType: "" diskSize: 25 sourceType: image sourceUUID: ostest-pwk7t-rhcos volumeType: tripleo securityGroups: - filter: {} name: ostest-pwk7t-worker serverMetadata: Name: ostest-pwk7t-worker openshiftClusterID: ostest-pwk7t tags: - openshiftClusterID=ostest-pwk7t trunk: true userDataSecret: name: worker-user-data
I wouldn't expect this change to move the machine to Failed. My next step would be to check if there are any logs in the machine controller that indicate why it moved this machine to failed?
Hello Joel. You will find the logs attached to the case, inside the must-gather. This is what I found on the machine-controller logs: 2021-05-19T07:33:00.757467824Z I0519 07:33:00.757409 1 controller.go:174] ostest-pwk7t-worker-2: reconciling Machine 2021-05-19T07:33:02.399787013Z I0519 07:33:02.399720 1 controller.go:297] ostest-pwk7t-worker-2: reconciling machine triggers idempotent update 2021-05-19T07:33:03.792588682Z I0519 07:33:03.792542 1 actuator.go:381] re-creating machine ostest-pwk7t-worker-2 for update. 2021-05-19T07:33:07.193971419Z I0519 07:33:07.193900 1 actuator.go:150] Skipped creating a VM that already exists. 2021-05-19T07:33:23.761788118Z I0519 07:33:23.761708 1 actuator.go:406] Successfully updated machine ostest-pwk7t-worker-2 2021-05-19T07:33:45.760076450Z I0519 07:33:45.759104 1 controller.go:174] ostest-pwk7t-worker-2: reconciling Machine 2021-05-19T07:33:52.172110491Z I0519 07:33:52.172034 1 controller.go:482] ostest-pwk7t-worker-2: going into phase "Failed" 2021-05-19T07:33:52.272137502Z I0519 07:33:52.272080 1 controller.go:174] ostest-pwk7t-worker-2: reconciling Machine 2021-05-19T07:33:52.272137502Z W0519 07:33:52.272118 1 controller.go:275] ostest-pwk7t-worker-2: machine has gone "Failed" phase. It won't reconcile
Could you try again to update the machine and let one of the machines go into the failed state? In this case, I'm hoping that it should say on the Machine status itself exactly what went wrong. Looking at the code path, it looks like the actuator Exists returned false without returning an error, in which case, it went into Failed correctly. I noticed some other errors as it tried to create the new machine, perhaps these might be related 2021-05-19T07:39:12.916812569Z E0519 07:39:12.916753 1 controller.go:281] ostest-pwk7t-worker-2: failed to check if machine exists: Error checking if instance exists (machine/actuator.go 346): 2021-05-19T07:39:12.916812569Z Error getting a new instance service from the machine (machine/actuator.go 467): Failed to authenticate provider client: Get "https://overcloud.redhat.local:13000/": x509: certificate signed by unknown authority 2021-05-19T07:39:13.156719620Z E0519 07:39:13.156662 1 controller.go:302] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="Error checking if instance exists (machine/actuator.go 346): \nError getting a new instance service from the machine (machine/actuator.go 467): Failed to authenticate provider client: Get \"https://overcloud.redhat.local:13000/\": x509: certificate signed by unknown authority" "name"="ostest-pwk7t-worker-2" "namespace"="openshift-machine-api" Looks like it might have ended up trying to talk to a different openstack cluster?
Hello Joel. Just retried the test and got below logs: I0526 15:39:10.337611 1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine I0526 15:39:13.109382 1 controller.go:297] ostest-mjg7d-worker-3: reconciling machine triggers idempotent update I0526 15:39:14.959132 1 actuator.go:381] re-creating machine ostest-mjg7d-worker-3 for update. I0526 15:39:18.975570 1 actuator.go:150] Skipped creating a VM that already exists. I0526 15:39:26.930516 1 actuator.go:406] Successfully updated machine ostest-mjg7d-worker-3 I0526 15:40:02.253042 1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine I0526 15:40:03.551557 1 controller.go:482] ostest-mjg7d-worker-3: going into phase "Failed" I0526 15:40:03.599887 1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine I0526 15:40:04.833449 1 controller.go:482] ostest-mjg7d-worker-3: going into phase "Failed" I0526 15:40:04.873813 1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine W0526 15:40:04.873849 1 controller.go:275] ostest-mjg7d-worker-3: machine has gone "Failed" phase. It won't reconcile I0526 15:40:04.877639 1 controller.go:174] ostest-mjg7d-worker-3: reconciling Machine W0526 15:40:04.877678 1 controller.go:275] ostest-mjg7d-worker-3: machine has gone "Failed" phase. It won't reconcile OCP version: 4.8.0-0.nightly-2021-05-25-072938
@rlobillo Could you please create and share a must gather from that cluster that you just created. I would like to see the content of the Machine resources and tie the details to the logs from the machine controller
Note: need to look into the must-gather in comment 12 to work out why we're suddenly getting SSL errors connecting to OpenStack after changing the Cinder AZ. Expected behaviour here would be: Ideally a webhook would have prevented the change in the first place. In the absence of that, no change in the Machine, and a log message in machine api explaining why the Machine won't be updated.
status: addresses: - address: 10.196.1.3 type: InternalIP - address: ostest-mjg7d-worker-3 type: Hostname - address: ostest-mjg7d-worker-3 type: InternalDNS conditions: - lastTransitionTime: "2021-05-26T15:40:04Z" message: Instance not found on provider reason: InstanceMissing severity: Warning status: "False" type: InstanceExists errorMessage: Can't find created instance. lastUpdated: "2021-05-26T15:40:04Z" nodeRef: kind: Node name: ostest-mjg7d-worker-3 uid: 4e214485-4511-4242-9997-5ecf21a58a24 phase: Failed
(In reply to Matthew Booth from comment #13) > Note: need to look into the must-gather in comment 12 to work out why we're > suddenly getting SSL errors connecting to OpenStack after changing the > Cinder AZ. The SSL errors seem to correlate with etcd errors, and they quickly go away. I'm not inclined to go down that rabbit hole. Suspect some weird failure mode results in an incomplete cloud config.
I agree with Joel's assessment above: it seems that the way we got here was that actuator.Exists() returned false without an error. As no error was returned, the machine controller assumes it is actually gone, and don't make any further attempts to reconcile the machine. It's difficult to guess what's actually going, but I'm currently suspicious of this code in instanceExists: https://github.com/openshift/cluster-api-provider-openstack/blob/4f41902b58d95a97d303ff1ee509c0c864250cd5/pkg/cloud/openstack/machine/actuator.go#L650-L651 When we check if an instance exists we're listing instances and filtering by name, image, and flavor. Image and flavor seem superflous, and I wonder if image specifically is causing a problem. I note that we're setting providerspec.image to "" above, and setting the actual image on the root volume. I wonder if Nova is setting instance.image to the image of the root volume, and subsequently failing to match an empty image. This is pure speculation!
(In reply to Matthew Booth from comment #16) > I agree with Joel's assessment above: it seems that the way we got here was > that actuator.Exists() returned false without an error. As no error was > returned, the machine controller assumes it is actually gone, and don't make > any further attempts to reconcile the machine. It's difficult to guess > what's actually going, but I'm currently suspicious of this code in > instanceExists: > > https://github.com/openshift/cluster-api-provider-openstack/blob/ > 4f41902b58d95a97d303ff1ee509c0c864250cd5/pkg/cloud/openstack/machine/ > actuator.go#L650-L651 > > When we check if an instance exists we're listing instances and filtering by > name, image, and flavor. Image and flavor seem superflous, and I wonder if > image specifically is causing a problem. I note that we're setting > providerspec.image to "" above, and setting the actual image on the root > volume. I wonder if Nova is setting instance.image to the image of the root > volume, and subsequently failing to match an empty image. This is pure > speculation! This isn't borne out in testing. At least on 16.1 instance.image is unset for BFV, and searching on image="" returns the instance.
Possible reproducer. My dev environment doesn't have multiple volume AZs, but I edited the *size* of a worker machine. The result was that the machine was marked Failed after the instance was deleted. The question now is: why was the instance deleted?
I0621 16:30:01.221279 1 controller.go:174] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling Machine I0621 16:30:03.863134 1 controller.go:297] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling machine triggers idempotent update I0621 16:30:05.290815 1 actuator.go:381] re-creating machine cluster-dsal-qmrmv-worker-0-gl7qf for update. I0621 16:30:09.816911 1 actuator.go:150] Skipped creating a VM that already exists. I0621 16:30:25.242183 1 actuator.go:406] Successfully updated machine cluster-dsal-qmrmv-worker-0-gl7qf I0621 16:30:53.921183 1 controller.go:174] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling Machine I0621 16:30:55.696962 1 controller.go:482] cluster-dsal-qmrmv-worker-0-gl7qf: going into phase "Failed" I0621 16:30:55.752839 1 controller.go:174] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling Machine I0621 16:30:56.675470 1 controller.go:482] cluster-dsal-qmrmv-worker-0-gl7qf: going into phase "Failed" I0621 16:30:56.718122 1 controller.go:174] cluster-dsal-qmrmv-worker-0-gl7qf: reconciling Machine W0621 16:30:56.718366 1 controller.go:275] cluster-dsal-qmrmv-worker-0-gl7qf: machine has gone "Failed" phase. It won't reconcile
@jspeed The instance is deleted because machine controller called actuator.Update(). Given the earlier discussion about machines being immutable, what is actuator.Update() supposed to do?
Another problem in this code is here: https://github.com/openshift/cluster-api-provider-openstack/blob/79fa6d04adae132752880c526cc6f002717054e8/pkg/cloud/openstack/machine/actuator.go#L382-L388 The Create() is skipped because: I0621 16:30:09.816911 1 actuator.go:150] Skipped creating a VM that already exists. and we then proceed to delete. Even if we really wanted to do this, this would still be weird.
Update() has a couple of purposes, it is effectively to update the state of the Machine (MAPI Machine) to reflect the state of the VM (or OpenStack instance in this case?). It should not be recreating the instance on the OpenStack side, nor should it be making any changes to the instance, they are considered to be immutable. Update is supposed to: - Look up the instance on the cloud provider - Reconcile any load balancer attachments if appropriate - Reconcile tags if appropriate (eg the kubernetes cluster tag that denotes the instance is owned by the cluster) - Set the providerID on the spec of the Machine - Ensure status information for the instance is up to date on the Machine status - eg things like the IP addresses We consider LB attachement and tags to not be part of the specification of the instance itself and therefore these are treated as mutable, but things like disk spec changes or image spec changes are definitely not supported.
Also from Joel via Slack: One of the stipulations of MAPI is that the providerID should never change, ie, there's a 1:1 mapping between Machine and instance, you're currently not doing that.
The expected behaviour is that CAPO should ignore (the unsupported action of) editing a Machine.
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing
https://github.com/openshift/cluster-api-provider-openstack/pull/210 will prevent CAPO from attempting to make any changes to an existing server. Changes to an existing machine will be safely ignored.
The fix for this landed: https://github.com/openshift/cluster-api-provider-openstack/pull/210
Verified on 4.10.0-0.nightly-2021-12-14-083101 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0). On a running cluster, a new machine is created using rootVolume on 'cinderAZ0' AZ: $ oc get nodes && oc get machines -n openshift-machine-api NAME STATUS ROLES AGE VERSION ostest-kmzqk-master-0 Ready master 2d v1.22.1+6859754 ostest-kmzqk-master-1 Ready master 2d v1.22.1+6859754 ostest-kmzqk-master-2 Ready master 2d v1.22.1+6859754 ostest-kmzqk-new-worker Ready worker 22h v1.22.1+6859754 ostest-kmzqk-worker-0-dh2rw Ready worker 47h v1.22.1+6859754 ostest-kmzqk-worker-0-pqq7p Ready worker 47h v1.22.1+6859754 NAME PHASE TYPE REGION ZONE AGE ostest-kmzqk-master-0 Running 2d ostest-kmzqk-master-1 Running 2d ostest-kmzqk-master-2 Running 2d ostest-kmzqk-new-worker Running m4.xlarge regionOne nova 22h ostest-kmzqk-worker-0-dh2rw Running m4.xlarge regionOne nova 47h ostest-kmzqk-worker-0-pqq7p Running m4.xlarge regionOne nova 47h $ oc get -n openshift-machine-api machine/ostest-kmzqk-new-worker -o json | jq .spec.providerSpec.value.rootVolume { "availabilityZone": "cinderAZ0", "deviceType": "", "diskSize": 25, "sourceType": "image", "sourceUUID": "ostest-kmzqk-rhcos", "volumeType": "tripleo" } With the fix, editing the spec is not moving the machine to Failed: $ oc patch -n openshift-machine-api machine/ostest-kmzqk-new-worker --type merge -p '{"spec":{"providerSpec":{"value":{"rootVolume":{"availabilityZone":"cinderAZ1"}}}}}' machine.machine.openshift.io/ostest-kmzqk-new-worker patched $ oc get -n openshift-machine-api machine/ostest-kmzqk-new-worker -o json | jq .spec.providerSpec.value.rootVolume { "availabilityZone": "cinderAZ1", "deviceType": "", "diskSize": 25, "sourceType": "image", "sourceUUID": "ostest-kmzqk-rhcos", "volumeType": "tripleo" } $ oc logs -n openshift-machine-api -l k8s-app=controller -c machine-controller | tail [...] I1216 15:38:20.886539 1 controller.go:175] ostest-kmzqk-new-worker: reconciling Machine I1216 15:38:23.511158 1 controller.go:298] ostest-kmzqk-new-worker: reconciling machine triggers idempotent update The event is simply ignored.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056