Bug 1805025 - [OSP] Machine status doesn't become "Failed" when creating a machine with invalid image
Summary: [OSP] Machine status doesn't become "Failed" when creating a machine with inv...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Mike Fedosin
QA Contact: Milind Yadav
URL:
Whiteboard:
: 1805023 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-20 06:15 UTC by sunzhaohua
Modified: 2021-02-24 15:11 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:10:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-openstack pull 121 0 None closed Bug 1805025: validate that image exists before creating a machine 2021-02-11 17:20:28 UTC
Github openshift cluster-api-provider-openstack pull 137 0 None closed Bug 1805025: return correct error if machine validation fails 2021-02-11 17:20:28 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:11:46 UTC

Description sunzhaohua 2020-02-20 06:15:45 UTC
Description of problem:
UPI on OSP, machine status doesn't become "Failed" when creating a machine with invalid image

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-17-211020

How reproducible:
Always

Steps to Reproduce:
1.Create a machine, setting its providerSpec with an invalid image
2.Check machine status
3.Check logs 

Actual results:
Machine stuck in Provisioning status, doesn't become "Failed"

$ oc get machine
NAME                          PHASE          TYPE           REGION      ZONE   AGE
zhsun9-wxcfz-worker-aaaaa     Provisioning   ci.m1.xlarge   regionOne   nova   42m
zhsun9-wxcfz-worker-b-llxd8   Running        ci.m1.xlarge   regionOne   nova   16m
zhsun9-wxcfz-worker-bfwjk     Running        ci.m1.xlarge   regionOne   nova   19h
zhsun9-wxcfz-worker-c-6tjtg   Running        ci.m1.xlarge   regionOne   nova   18h

I0220 03:35:35.617474       1 controller.go:164] Reconciling Machine "zhsun9-wxcfz-worker-aaaaa"
I0220 03:35:35.617633       1 controller.go:376] Machine "zhsun9-wxcfz-worker-aaaaa" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0220 03:35:35.627688       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0220 03:35:37.209780       1 controller.go:319] Reconciling machine object zhsun9-wxcfz-worker-aaaaa triggers idempotent create.
I0220 03:35:37.222271       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0220 03:35:37.296965       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
E0220 03:35:44.641047       1 actuator.go:474] Machine error zhsun9-wxcfz-worker-aaaaa: error creating Openstack instance: Create new server err: no image with the name rhcos-44.81.202002071430-0000 could be found
W0220 03:35:44.641072       1 controller.go:321] Failed to create machine "zhsun9-wxcfz-worker-aaaaa": error creating Openstack instance: Create new server err: no image with the name rhcos-44.81.202002071430-0000 could be found
I0220 03:38:28.481553       1 controller.go:164] Reconciling Machine "zhsun9-wxcfz-worker-aaaaa"

Expected results:
Machine status  become "Failed"


Additional info:

Comment 1 Martin André 2020-02-20 14:59:39 UTC
*** Bug 1805023 has been marked as a duplicate of this bug. ***

Comment 2 Pierre Prinetti 2020-05-07 14:17:24 UTC
The team considers this bug as valid. Considering this bug priority and our capacity, we are deferring this bug to an upcoming sprint. If there are reasons for us to reprioritise, please let us know.

Comment 3 Pierre Prinetti 2020-05-14 14:13:02 UTC
Considering the priority assigned to this bug and our team capacity, we are deferring this bug to an upcoming sprint. Please let us know if there are reasons for us to reprioritize.

Comment 4 Pierre Prinetti 2020-06-04 14:30:46 UTC
Deferring to an upcoming sprint. Please let us know if there are reasons for us to reprioritize.

Comment 6 Pierre Prinetti 2020-06-22 15:53:27 UTC
This may have been fixed in 4.6 and 4.5 by updading cluster-api-provider-openstack's dependencies [1].

Can you still reproduce the issue on 4.5 and 4.6?

[1]: https://github.com/openshift/cluster-api-provider-openstack/pull/101

Comment 7 Pierre Prinetti 2020-06-22 15:54:25 UTC
(4.5 OR 4.6, sorry)

Comment 8 sunzhaohua 2020-07-02 08:18:02 UTC
@Pierre Prinetti yes, I could reproduced this in 4.5
$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-rc.5   True        False         64m     Cluster version is 4.5.0-rc.5


$ oc get machine
NAME                               PHASE          TYPE        REGION      ZONE   AGE
hongli-share-mtkvw-master-0        Running        m1.xlarge   regionOne   nova   5h11m
hongli-share-mtkvw-master-1        Running        m1.xlarge   regionOne   nova   5h11m
hongli-share-mtkvw-master-2        Running        m1.xlarge   regionOne   nova   5h11m
hongli-share-mtkvw-worker-lm57w    Running        m1.xlarge   regionOne   nova   4h54m
hongli-share-mtkvw-worker-q2bnm    Running        m1.xlarge   regionOne   nova   4h54m
hongli-share-mtkvw-worker-zzgwx    Running        m1.xlarge   regionOne   nova   4h54m
hongli-share-mtkvw-worker1-7zlg4   Provisioning                                  27m

I0702 08:16:29.247088       1 controller.go:313] hongli-share-mtkvw-worker1-7zlg4: reconciling machine triggers idempotent create
I0702 08:16:29.264637       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0702 08:16:29.330715       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
E0702 08:16:37.659537       1 actuator.go:538] Machine error hongli-share-mtkvw-worker1-7zlg4: error creating Openstack instance: Create new server err: no image with the name hongli-share-mtkvw-rhcos-invalid could be found
W0702 08:16:37.659572       1 controller.go:315] hongli-share-mtkvw-worker1-7zlg4: failed to create machine: error creating Openstack instance: Create new server err: no image with the name hongli-share-mtkvw-rhcos-invalid could be found

Comment 16 Milind Yadav 2020-09-21 07:31:12 UTC
Validated on below clusterversion (IPI)
[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-20-184226   True        False         77m     Cluster version is 4.6.0-0.nightly-2020-09-20-184226
[miyadav@miyadav ~]$ 

Steps :

created machineset with invalid spec , machine stuck in Provisioning state ..
[miyadav@miyadav ~]$ oc get machines
                                 PHASE          TYPE        REGION      ZONE   AGE
miyadav-2109-rgtw2-master-0           Running        m1.xlarge   regionOne   nova   109m
miyadav-2109-rgtw2-master-1           Running        m1.xlarge   regionOne   nova   109m
miyadav-2109-rgtw2-master-2           Running        m1.xlarge   regionOne   nova   109m
miyadav-2109-rgtw2-worker-0-9trlc     Running        m1.large    regionOne   nova   103m
miyadav-2109-rgtw2-worker-0-p7sb6     Running        m1.large    regionOne   nova   103m
miyadav-2109-rgtw2-worker-0-z9tlj     Running        m1.large    regionOne   nova   103m
miyadav-2109-rgtw2-worker-inv-44zff   Provisioning                                  41m
[miyadav@miyadav ~]$ oc get pods
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-67f5fbc644-k458r   2/2     Running   0          69m
machine-api-controllers-5445c7f675-pbhwp       7/7     Running   0          66m
machine-api-operator-7858f579db-tfg8f          2/2     Running   0          66m
[miyadav@miyadav ~]$ oc logs -f machine-api-controllers-5445c7f675-pbhwp -c machine-controller
.
.
.
E0921 06:36:40.088279       1 actuator.go:550] Machine error miyadav-2109-rgtw2-worker-inv-44zff: Unable to find flavor with name m1.invalid
W0921 06:36:40.088317       1 controller.go:315] miyadav-2109-rgtw2-worker-inv-44zff: failed to create machine: Unable to find flavor with name m1.invalid
I0921 06:36:41.088670       1 controller.go:169] miyadav-2109-rgtw2-worker-inv-44zff: reconciling Machine
I0921 06:36:41.112408       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0921 06:36:41.556962       1 controller.go:313] miyadav-2109-rgtw2-worker-inv-44zff: reconciling machine triggers idempotent create
I0921 06:36:41.619936       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0921 06:36:41.716050       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
E0921 06:36:42.052502       1 actuator.go:550] Machine error miyadav-2109-rgtw2-worker-inv-44zff: Unable to find flavor with name m1.invalid
W0921 06:36:42.052536       1 controller.go:315] miyadav-2109-rgtw2-worker-inv-44zff: failed to create machine: Unable to find flavor with name m1.invalid
I0921 06:36:43.053162       1 controller.go:169] miyadav-2109-rgtw2-worker-inv-44zff: reconciling Machine
I0921 06:36:43.071554       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
.
.
.

Expected : machine should become in failed status
Actual : machine stuck in provisioning state

Comment 21 Milind Yadav 2020-11-10 04:05:45 UTC
Validated at - 
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-09-235738   True        False         16m     Cluster version is 4.7.0-0.nightly-2020-11-09-235738



Steps :
created machineset with invalid spec (flavor: invalid)

[root@miyadav miyadav]# oc get machines -w
NAME                                PHASE     TYPE        REGION      ZONE   AGE
miyadav-b025-jzgt7-master-0         Running   m1.xlarge   regionOne   nova   52m
miyadav-b025-jzgt7-master-1         Running   m1.xlarge   regionOne   nova   52m
miyadav-b025-jzgt7-master-2         Running   m1.xlarge   regionOne   nova   52m
miyadav-b025-jzgt7-worker-0-6d4l5   Running   m1.large    regionOne   nova   50m
miyadav-b025-jzgt7-worker-0-lg4r4   Running   m1.large    regionOne   nova   50m
miyadav-b025-jzgt7-worker-0-xnjzz   Running   m1.large    regionOne   nova   50m
miyadav-b025-jzgt7-worker-i-47tw4   Failed                                   7s

Expected & Actual : Machine should be in failed status

Additional Info:
Moved to VERIFIED

Comment 24 errata-xmlrpc 2021-02-24 15:10:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.