Bug 1826017 - [vsphere]Machine status should be "Failed" with an invalid configuration
Summary: [vsphere]Machine status should be "Failed" with an invalid configuration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Joel Speed
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On: 1824497 1833256
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-20 17:09 UTC by Joel Speed
Modified: 2020-07-13 17:29 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Errors returned from the cloud-provider actuator no longer matched the expected type due to being wrapped using github.com/pkg/errors Consequence: The Machine controller could not determine that the Machine should be marked as failed Fix: Use error wrapping from the standard library to check the error types Result: Machine controller can now determine when Machines should be marked Failed
Clone Of: 1824497
Environment:
Last Closed: 2020-07-13 17:29:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 564 0 None closed BUG 1826017: Switch to Go errors instead of github.com/pkg/errors 2021-02-10 15:03:30 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:29:23 UTC

Description Joel Speed 2020-04-20 17:09:37 UTC
+++ This bug was initially created as a clone of Bug #1824497 +++

Description of problem:
Machine status should be "Failed" when creating a spot instance with price lower than spot instance price
 
Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-04-15-223247

How reproducible:
Always

Steps to Reproduce:
1. Creating a spot instance with price lower than spot instance price
  providerSpec:
    value:
      spotMarketOptions:
        maxPrice: "0.01"
2. Check machines and logs
  
Actual results:
Machine stuck in Provisioning status

$ oc get machine
NAME                                        PHASE          TYPE        REGION      ZONE         AGE
zhsun416aws-rg88g-master-0                  Running        m4.xlarge   us-east-2   us-east-2a   6h53m
zhsun416aws-rg88g-master-1                  Running        m4.xlarge   us-east-2   us-east-2b   6h53m
zhsun416aws-rg88g-master-2                  Running        m4.xlarge   us-east-2   us-east-2c   6h53m
zhsun416aws-rg88g-worker-us-east-2a-txx9k   Running        m4.large    us-east-2   us-east-2a   6h39m
zhsun416aws-rg88g-worker-us-east-2b-9r4rx   Running        m4.large    us-east-2   us-east-2b   6h39m
zhsun416aws-rg88g-worker-us-east-2c-fxxdm   Provisioning                                        4m57s

  lastUpdated: "2020-04-16T09:43:52Z"
  phase: Provisioning
  providerStatus:
    conditions:
    - lastProbeTime: "2020-04-16T09:44:10Z"
      lastTransitionTime: "2020-04-16T09:44:10Z"
      message: 'error launching instance: Your Spot request price of 0.01 is lower
        than the minimum required Spot request fulfillment price of 0.0257.'
      reason: MachineCreationFailed
      status: "False"
      type: MachineCreation

I0416 09:46:31.346461       1 actuator.go:74] zhsun416aws-rg88g-worker-us-east-2c-fxxdm: actuator creating machine
I0416 09:46:31.347178       1 reconciler.go:38] zhsun416aws-rg88g-worker-us-east-2c-fxxdm: creating machine
E0416 09:46:31.347197       1 reconciler.go:221] NodeRef not found in machine zhsun416aws-rg88g-worker-us-east-2c-fxxdm
I0416 09:46:31.372053       1 instances.go:47] No stopped instances found for machine zhsun416aws-rg88g-worker-us-east-2c-fxxdm
I0416 09:46:31.372096       1 instances.go:145] Using AMI ami-0e888b699fa6e37e7
I0416 09:46:31.372108       1 instances.go:77] Describing security groups based on filters
I0416 09:46:31.583386       1 instances.go:122] Describing subnets based on filters
I0416 09:46:32.438067       1 instances.go:331] Error launching instance: SpotMaxPriceTooLow: Your Spot request price of 0.01 is lower than the minimum required Spot request fulfillment price of 0.0257.
        status code: 400, request id: 3e5331d8-d1e9-4034-833c-15f10ce599f4
E0416 09:46:32.438171       1 reconciler.go:69] zhsun416aws-rg88g-worker-us-east-2c-fxxdm: error creating machine: error launching instance: Your Spot request price of 0.01 is lower than the minimum required Spot request fulfillment price of 0.0257.
I0416 09:46:32.438187       1 machine_scope.go:134] zhsun416aws-rg88g-worker-us-east-2c-fxxdm: Updating status
I0416 09:46:32.438195       1 machine_scope.go:155] zhsun416aws-rg88g-worker-us-east-2c-fxxdm: finished calculating AWS status
I0416 09:46:32.438215       1 machine_scope.go:80] zhsun416aws-rg88g-worker-us-east-2c-fxxdm: patching machine
E0416 09:46:32.453533       1 actuator.go:65] zhsun416aws-rg88g-worker-us-east-2c-fxxdm error: failed to launch instance: error launching instance: Your Spot request price of 0.01 is lower than the minimum required Spot request fulfillment price of 0.0257.
W0416 09:46:32.453594       1 controller.go:311] zhsun416aws-rg88g-worker-us-east-2c-fxxdm: failed to create machine: failed to launch instance: error launching instance: Your Spot request price of 0.01 is lower than the minimum required Spot request fulfillment price of 0.0257.
E0416 09:46:32.453654       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to launch instance: error launching instance: Your Spot request price of 0.01 is lower than the minimum required Spot request fulfillment price of 0.0257."  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"zhsun416aws-rg88g-worker-us-east-2c-fxxdm"}
I0416 09:46:32.453784       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="failed to launch instance: error launching instance: Your Spot request price of 0.01 is lower than the minimum required Spot request fulfillment price of 0.0257." "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsun416aws-rg88g-worker-us-east-2c-fxxdm","uid":"6dedaf4b-12db-4741-8a26-5555ca8dd11e","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"134275"} "reason"="FailedCreate"


Expected results:
The machine phase is set "Failed"

Additional info:

--- Additional comment from Joel Speed on 2020-04-16 16:37:10 UTC ---

I've tested this with the same build and have been unable to reproduce. Is there any more information you can provide?

--- Additional comment from Joel Speed on 2020-04-17 09:57:00 UTC ---

I believe this issue was introduced by a refactor of the Cluster-API-Provider-AWS in 

Machine's will only go into the failed phase when the returned error is an `InvalidMachineConfigurationError` (see: https://github.com/openshift/machine-api-operator/blob/b9b4aaea428abe021d84477bd62a99f806fb64f2/pkg/controller/machine/controller.go#L312-L317)

The error you are seeing here does return this (https://github.com/openshift/cluster-api-provider-aws/blob/025ec74aa743c3834020f4f6a45ac19c1acb76d2/pkg/actuators/machine/instances.go#L261), however it is then wrapped (https://github.com/openshift/cluster-api-provider-aws/blob/025ec74aa743c3834020f4f6a45ac19c1acb76d2/pkg/actuators/machine/reconciler.go#L73) so that it no longer matches the correct type

The check to see if the error is an InvalidMachineConfigurationError (implemented: https://github.com/openshift/machine-api-operator/blob/b9b4aaea428abe021d84477bd62a99f806fb64f2/pkg/controller/machine/controller.go#L312-L317) does not currently support this wrapping. So it will need to be updated to support the wrapping.

Comment 3 Milind Yadav 2020-05-12 09:54:00 UTC
Description of problem:
Machine status should be "Failed" when creating machineset with invalid configuration
 
Version-Release number of selected component (if applicable):
Cluster version is 4.5.0-0.nightly-2020-05-08-015855

How reproducible:
Always

Step1.:Create a machineset with invalid specs.
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    autoscaling.openshift.io/machineautoscaler: openshift-machine-api/machineautoscaler
    machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "3"
    machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1"
  creationTimestamp: "2020-05-12T09:18:42Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: miyadav-12ipi-vhcs5
  name: miyadav-12ipi-vhcs5-worker-invalid
  namespace: openshift-machine-api
  resourceVersion: "161152"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machinesets/miyadav-12ipi-vhcs5-worker-invalid
  uid: 0254a43b-637c-45ff-8b83-f910b1ebda9e
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: miyadav-12ipi-vhcs5
      machine.openshift.io/cluster-api-machineset: miyadav-12ipi-vhcs5-worker
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: miyadav-12ipi-vhcs5
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: miyadav-12ipi-vhcs5-worker
    spec:
      metadata: {}
      providerSpec:
        value:
          apiVersion: vsphereprovider.openshift.io/v1beta1
          credentialsSecret:
            name: vsphere-cloud-credentials
          diskGiB: 50
          kind: VSphereMachineProviderSpec
          memoryMiB: 8192
          metadata:
            creationTimestamp: null
          network:
            devices:
            - networkName: VM Network
          numCPUs: 4
          numCoresPerSocket: 1
          snapshot: ""
          spotMarketOptions:
            maxPrice: "0.01"
          template: miyadav-12ipi-vhcs5-rhcos
          userDataSecret:
            name: worker-user-data
          workspace:
            datacenter: dc1
            datastore: nvme-ds1
            folder: /dc1/vm/miyadav-12ipi-vhcs5
            server: vcsa-qe.vmware.devcluster.openshift.com

2.oc create -f <invalid-machineset.yml 
Actual : machineset created successfully [miyadav@miyadav ManualRun]$ oc get machineset
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-12ipi-vhcs5-worker           2         2         2       2           6h22m
miyadav-12ipi-vhcs5-worker-invalid   2         2                             5m22s

3.[miyadav@miyadav ManualRun]$ oc get machines
NAME                                       PHASE          TYPE   REGION   ZONE   AGE
miyadav-12ipi-vhcs5-master-0               Running                               6h32m
miyadav-12ipi-vhcs5-master-1               Running                               6h32m
miyadav-12ipi-vhcs5-master-2               Running                               6h32m
miyadav-12ipi-vhcs5-worker-g79rl           Running                               6h20m
miyadav-12ipi-vhcs5-worker-invalid-n29vv   Provisioning                          15m
miyadav-12ipi-vhcs5-worker-qfj77           Running                               6h20m

Actual : machines not in failed status , but stuck in provisioning
Expected : Machine should be in failed status

Additional info:

machine-controller logs :
.
.
.
ta:{}} DeviceName:VM Network UseAutoDetect:<nil>} Network:<nil> InPassthroughMode:<nil>}
I0512 09:47:05.144773       1 reconciler.go:523] miyadav-12ipi-vhcs5-worker-invalid-n29vv: running task: task-135147
I0512 09:47:05.144861       1 reconciler.go:620] miyadav-12ipi-vhcs5-worker-invalid-n29vv: Updating provider status
I0512 09:47:05.144907       1 machine_scope.go:99] miyadav-12ipi-vhcs5-worker-invalid-n29vv: patching machine
I0512 09:47:05.156276       1 controller.go:325] miyadav-12ipi-vhcs5-worker-invalid-n29vv: created instance, requeuing
I0512 09:47:05.156429       1 controller.go:169] miyadav-12ipi-vhcs5-worker-invalid-n29vv: reconciling Machine
I0512 09:47:05.156462       1 actuator.go:80] miyadav-12ipi-vhcs5-worker-invalid-n29vv: actuator checking if machine exists
I0512 09:47:05.163046       1 session.go:114] Find template by instance uuid: a051b036-fffd-4848-bf0f-7dcd5a0e2f7a
I0512 09:47:05.181298       1 reconciler.go:155] miyadav-12ipi-vhcs5-worker-invalid-n29vv: does not exist
I0512 09:47:05.181405       1 controller.go:313] miyadav-12ipi-vhcs5-worker-invalid-n29vv: reconciling machine triggers idempotent create
I0512 09:47:05.181430       1 actuator.go:59] miyadav-12ipi-vhcs5-worker-invalid-n29vv: actuator creating machine
I0512 09:47:05.190393       1 reconciler.go:604] task: task-135148, state: error, description-id: VirtualMachine.clone
I0512 09:47:05.190475       1 session.go:114] Find template by instance uuid: a051b036-fffd-4848-bf0f-7dcd5a0e2f7a
I0512 09:47:05.208697       1 reconciler.go:83] miyadav-12ipi-vhcs5-worker-invalid-n29vv: cloning
I0512 09:47:05.208738       1 session.go:111] Invalid UUID for VM "miyadav-12ipi-vhcs5-rhcos": , trying to find by name
I0512 09:47:05.227797       1 reconciler.go:399] miyadav-12ipi-vhcs5-worker-invalid-n29vv: no snapshot name provided, getting snapshot using template
I0512 09:47:05.249183       1 reconciler.go:478] Getting network devices
I0512 09:47:05.249283       1 reconciler.go:555] Adding device: VM Network
I0512 09:47:05.254066       1 reconciler.go:584] Adding device: eth card type: vmxnet3, network spec: &{NetworkName:VM Network}, device info: &{VirtualDeviceDeviceBackingInfo:{VirtualDeviceBackingInfo:{DynamicData:{}} DeviceName:VM Network UseAutoDetect:<nil>} Network:<nil> InPassthroughMode:<nil>}
I0512 09:47:05.257745       1 reconciler.go:523] miyadav-12ipi-vhcs5-worker-invalid-n29vv: running task: task-135148
I0512 09:47:05.257811       1 reconciler.go:620] miyadav-12ipi-vhcs5-worker-invalid-n29vv: Updating provider status
I0512 09:47:05.257844       1 machine_scope.go:99] miyadav-12ipi-vhcs5-worker-invalid-n29vv: patching machine
I0512 09:47:05.269717       1 controller.go:325] miyadav-12ipi-vhcs5-worker-invalid-n29vv: created instance, requeuing
I0512 09:47:05.269816       1 controller.go:169] miyadav-12ipi-vhcs5-worker-invalid-n29vv: reconciling Machine
I0512 09:47:05.269842       1 actuator.go:80] miyadav-12ipi-vhcs5-worker-invalid-n29vv: actuator checking if machine exists
I0512 09:47:05.277706       1 session.go:114] Find template by instance uuid: a051b036-fffd-4848-bf0f-7dcd5a0e2f7a
I0512 09:47:05.297275       1 reconciler.go:155] miyadav-12ipi-vhcs5-worker-invalid-n29vv: does not exist
.
.
.

Comment 4 Michael Gugino 2020-05-13 13:30:25 UTC
This bug does not seem to be directly related to the bug from which it is cloned.

Not everything that fails to provision is necessarily going to trigger a 'failure.'

Please describe what it is you're attempting to do in more detail.

Comment 5 Milind Yadav 2020-05-14 12:10:33 UTC
As with the description , I tried to put invalid configuration , but it seems it drops it gracefully if the options that are invalid and provision the machines , checked with Joel and did below :

Removing the machine.openshift.io/cluster-api-cluster from spec.template.metadata.labels

After that it never created any machines or logs
[miyadav@miyadav bugvsphere]$ oc get machineset
NAME                              DESIRED   CURRENT   READY   AVAILABLE   AGE
miyadav-11-rdnr9-worker           2         2         2       2           134m
miyadav-11-rdnr9-worker-invalid   1                                       56s
miyadav-11-rdnr9-worker-rz        1         1         1       1           26m
.
.
Then manually tried creating a machine that doesn't have the label rather than using a machineset.

Getting this :
.
.
.
E0513 08:47:29.127626       1 controller.go:173] miyadav-11-rdnr9-worker-sam: machine validation failed: spec.labels: Invalid value: map[string]string{"machine.openshift.io/cluster-api-machine-role":"worker", "machine.openshift.io/cluster-api-machine-type":"worker", "machine.openshift.io/cluster-api-machineset":"miyadav-11-rdnr9-worker", "machine.openshift.io/region":"", "machine.openshift.io/zone":""}: missing machine.openshift.io/cluster-api-cluster label.


So , I would need more help to understand the reason to create this bug ..

Comment 6 Joel Speed 2020-05-14 14:28:16 UTC
This will be easier to validate once BZ#1833256 is merged. Perhaps we could add that as a dependency of this and hold off verifying this for now. If that BZ can be verified, then this one is also working

Comment 7 Joel Speed 2020-05-18 10:14:12 UTC
Making this depend on BZ#1833256 as we need it before we can verify this

Comment 9 Milind Yadav 2020-05-26 09:25:11 UTC
[miyadav@miyadav bugvsphere]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-26-051016   True        False         37m     Cluster version is 4.5.0-0.nightly-2020-05-26-051016

Validated both scenarios as mentioned above machines never went to provisioning stuck phase .

After checking with Joel, As the invalid configuration is being reproduced by the dependent bug mentioned in this one . https://bugzilla.redhat.com/show_bug.cgi?id=1833256

With that being Failed , with valid "invalid configuration" validates  this bug as well, as the invalid configurations mentioned in this bug to reproduce are not something which will be created during installation or running cluster.

Comment 10 errata-xmlrpc 2020-07-13 17:29:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.