Description of problem: On OSP update a worker, set its providerSpec with an invalid flavor, the machine becomes 'Failed'. According to https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-instance-lifecycle.md#failed, this is not expected to happen. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-04-02-130551 How reproducible: Always Steps to Reproduce: 1. Update a machine, set an invalid value to the flavor in its providerSpec 2. oc get machines -n openshift-machine-api Actual results: Machine phase becomes 'Failed'. Machine controller log: ``` I0403 03:11:59.513759 1 controller.go:284] Reconciling machine "jhou-5zdd9-worker-54g97" triggers idempotent update I0403 03:11:59.514153 1 actuator.go:373] re-creating machine jhou-5zdd9-worker-54g97 for update. I0403 03:11:59.528837 1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle I0403 03:11:59.579865 1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle I0403 03:12:00.122810 1 actuator.go:146] Skipped creating a VM that already exists. I0403 03:12:00.133465 1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle I0403 03:12:00.183206 1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle I0403 03:12:11.074821 1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle I0403 03:12:21.424184 1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle I0403 03:12:31.633977 1 actuator.go:398] Successfully updated machine jhou-5zdd9-worker-54g97 I0403 03:12:35.691117 1 controller.go:164] Reconciling Machine "jhou-5zdd9-worker-54g97" I0403 03:12:35.691151 1 controller.go:376] Machine "jhou-5zdd9-worker-54g97" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0403 03:12:35.794177 1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle I0403 03:12:41.016430 1 controller.go:428] Machine "jhou-5zdd9-worker-54g97" going into phase "Failed" I0403 03:12:41.038276 1 controller.go:164] Reconciling Machine "jhou-5zdd9-worker-54g97" I0403 03:12:41.038315 1 controller.go:376] Machine "jhou-5zdd9-worker-54g97" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster W0403 03:12:41.038328 1 controller.go:273] Machine "jhou-5zdd9-worker-54g97" has gone "Failed" phase. It won't reconcile ``` Expected results: Additional info:
I'm not sure I understand this bug report. What do you suggest should be the status of the instance if you pass an invalid flavor in the providerSpec? It seems to match the definition of "Failed": Create() returns a invalidConfigurationMachineError type or Exists() is False and machine has a providerID/address.
What happens on OSP, is when updating this machine, there were 2 events(from oc describe machine): a create following a delete. Because we set an invalid value to the flavor, the creation failed resulting this machine in failed phase. On other cloud providers(aws,gcp,azure), there is only an update event(no delete/create) and machine does not become 'Failed'. The reporting is because the same operation on OSP has a different result.
Considering the priority assigned to this bug and our team capacity, we are deferring this bug to an upcoming sprint. Please let us know if there are reasons for us to reprioritize.
Verified on 4.6.0-0.nightly-2020-08-27-005538 Changed flavor to machine, log shows: I0827 11:41:47.735711 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle I0827 11:41:48.406889 1 controller.go:285] mrnd-13-46-gckws-worker-0-qks9l: reconciling machine triggers idempotent update I0827 11:41:48.419032 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle E0827 11:41:48.730625 1 actuator.go:550] Machine error mrnd-13-46-gckws-worker-0-qks9l: Can't find a flavor with name ci.m1.xlargee: Unable to find flavor with name ci.m1.xlargee E0827 11:41:48.730663 1 controller.go:287] mrnd-13-46-gckws-worker-0-qks9l: error updating machine: Can't find a flavor with name ci.m1.xlargee: Unable to find flavor with name ci.m1.xlargee I0827 11:41:49.731079 1 controller.go:172] mrnd-13-46-gckws-worker-0-qks9l: reconciling Machine I0827 11:41:49.754563 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle I0827 11:41:50.660861 1 controller.go:285] mrnd-13-46-gckws-worker-0-qks9l: reconciling machine triggers idempotent update I0827 11:41:50.677234 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle E0827 11:41:51.033382 1 actuator.go:550] Machine error mrnd-13-46-gckws-worker-0-qks9l: Can't find a flavor with name ci.m1.xlargee: Unable to find flavor with name ci.m1.xlargee E0827 11:41:51.033564 1 controller.go:287] mrnd-13-46-gckws-worker-0-qks9l: error updating machine: Can't find a flavor with name ci.m1.xlargee: Unable to find flavor with name ci.m1.xlargee I0827 11:41:52.033998 1 controller.go:172] mrnd-13-46-gckws-worker-0-qks9l: reconciling Machine I0827 11:41:52.054118 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle I0827 11:41:53.095999 1 controller.go:285] mrnd-13-46-gckws-worker-0-qks9l: reconciling machine triggers idempotent update But machine keeps Running status [morenod@morenod-laptop ~]$ oc get nodes | grep mrnd-13-46-gckws-worker-0-qks9l mrnd-13-46-gckws-worker-0-qks9l Ready worker 23m v1.19.0-rc.2+f71a7ab-dirty [morenod@morenod-laptop ~]$ oc get machines -A | grep mrnd-13-46-gckws-worker-0-qks9l openshift-machine-api mrnd-13-46-gckws-worker-0-qks9l Running ci.m1.xlarge regionOne nova 27m [morenod@morenod-laptop ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days