Bug 1820421 - [OSP] Update machine with an invalid flavor, machine becomes 'Failed'
Summary: [OSP] Update machine with an invalid flavor, machine becomes 'Failed'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.6.0
Assignee: Mike Fedosin
QA Contact: David Sanz
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-03 03:24 UTC by Jianwei Hou
Modified: 2023-09-14 05:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Cluster API Provider OpenStack didn't validate flavors before updating machines. Consequence: Machine with updated invalid flavors failed to boot. Fix: Validate flavor existence before updating machines and return an error immediately. Result: In case of invalid flavor Cluster API Provider OpenStack returns an error to the user immediately and doesn't update the machine.
Clone Of:
Environment:
Last Closed: 2020-10-27 15:57:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-openstack pull 112 0 None closed Bug 1820421: validate that flavor exists for machine 2020-10-14 14:04:38 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:12 UTC

Description Jianwei Hou 2020-04-03 03:24:49 UTC
Description of problem:
On OSP update a worker, set its providerSpec with an invalid flavor, the machine becomes 'Failed'.
According to https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-instance-lifecycle.md#failed, this is not expected to happen.

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-04-02-130551

How reproducible:
Always

Steps to Reproduce:
1. Update a machine, set an invalid value to the flavor in its providerSpec
2. oc get machines -n openshift-machine-api


Actual results:
Machine phase becomes 'Failed'.

Machine controller log:
```
I0403 03:11:59.513759       1 controller.go:284] Reconciling machine "jhou-5zdd9-worker-54g97" triggers idempotent update
I0403 03:11:59.514153       1 actuator.go:373] re-creating machine jhou-5zdd9-worker-54g97 for update.
I0403 03:11:59.528837       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0403 03:11:59.579865       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0403 03:12:00.122810       1 actuator.go:146] Skipped creating a VM that already exists.
I0403 03:12:00.133465       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0403 03:12:00.183206       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0403 03:12:11.074821       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0403 03:12:21.424184       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0403 03:12:31.633977       1 actuator.go:398] Successfully updated machine jhou-5zdd9-worker-54g97
I0403 03:12:35.691117       1 controller.go:164] Reconciling Machine "jhou-5zdd9-worker-54g97"
I0403 03:12:35.691151       1 controller.go:376] Machine "jhou-5zdd9-worker-54g97" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0403 03:12:35.794177       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0403 03:12:41.016430       1 controller.go:428] Machine "jhou-5zdd9-worker-54g97" going into phase "Failed"
I0403 03:12:41.038276       1 controller.go:164] Reconciling Machine "jhou-5zdd9-worker-54g97"
I0403 03:12:41.038315       1 controller.go:376] Machine "jhou-5zdd9-worker-54g97" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
W0403 03:12:41.038328       1 controller.go:273] Machine "jhou-5zdd9-worker-54g97" has gone "Failed" phase. It won't reconcile
```

Expected results:


Additional info:

Comment 1 Martin André 2020-04-09 14:25:06 UTC
I'm not sure I understand this bug report. What do you suggest should be the status of the instance if you pass an invalid flavor in the providerSpec?
It seems to match the definition of "Failed": Create() returns a invalidConfigurationMachineError type or Exists() is False and machine has a providerID/address.

Comment 2 Jianwei Hou 2020-04-10 01:58:34 UTC
What happens on OSP, is when updating this machine, there were 2 events(from oc describe machine): a create following a delete. Because we set an invalid value to the flavor, the creation failed resulting this machine in failed phase.

On other cloud providers(aws,gcp,azure), there is only an update event(no delete/create) and machine does not become 'Failed'.

The reporting is because the same operation on OSP has a different result.

Comment 3 Pierre Prinetti 2020-05-13 15:42:34 UTC
Considering the priority assigned to this bug and our team capacity, we are deferring this bug to an upcoming sprint. Please let us know if there are reasons for us to reprioritize.

Comment 10 David Sanz 2020-08-27 11:43:00 UTC
Verified on 4.6.0-0.nightly-2020-08-27-005538

Changed flavor to machine, log shows:

I0827 11:41:47.735711       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0827 11:41:48.406889       1 controller.go:285] mrnd-13-46-gckws-worker-0-qks9l: reconciling machine triggers idempotent update
I0827 11:41:48.419032       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
E0827 11:41:48.730625       1 actuator.go:550] Machine error mrnd-13-46-gckws-worker-0-qks9l: Can't find a flavor with name ci.m1.xlargee: Unable to find flavor with name ci.m1.xlargee
E0827 11:41:48.730663       1 controller.go:287] mrnd-13-46-gckws-worker-0-qks9l: error updating machine: Can't find a flavor with name ci.m1.xlargee: Unable to find flavor with name ci.m1.xlargee
I0827 11:41:49.731079       1 controller.go:172] mrnd-13-46-gckws-worker-0-qks9l: reconciling Machine
I0827 11:41:49.754563       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0827 11:41:50.660861       1 controller.go:285] mrnd-13-46-gckws-worker-0-qks9l: reconciling machine triggers idempotent update
I0827 11:41:50.677234       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
E0827 11:41:51.033382       1 actuator.go:550] Machine error mrnd-13-46-gckws-worker-0-qks9l: Can't find a flavor with name ci.m1.xlargee: Unable to find flavor with name ci.m1.xlargee
E0827 11:41:51.033564       1 controller.go:287] mrnd-13-46-gckws-worker-0-qks9l: error updating machine: Can't find a flavor with name ci.m1.xlargee: Unable to find flavor with name ci.m1.xlargee
I0827 11:41:52.033998       1 controller.go:172] mrnd-13-46-gckws-worker-0-qks9l: reconciling Machine
I0827 11:41:52.054118       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0827 11:41:53.095999       1 controller.go:285] mrnd-13-46-gckws-worker-0-qks9l: reconciling machine triggers idempotent update


But machine keeps Running status

[morenod@morenod-laptop ~]$ oc get nodes | grep mrnd-13-46-gckws-worker-0-qks9l
mrnd-13-46-gckws-worker-0-qks9l   Ready    worker   23m   v1.19.0-rc.2+f71a7ab-dirty
[morenod@morenod-laptop ~]$ oc get machines -A | grep mrnd-13-46-gckws-worker-0-qks9l
openshift-machine-api   mrnd-13-46-gckws-worker-0-qks9l   Running   ci.m1.xlarge   regionOne   nova   27m
[morenod@morenod-laptop ~]$

Comment 12 errata-xmlrpc 2020-10-27 15:57:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 13 Red Hat Bugzilla 2023-09-14 05:55:02 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.