Bug 1840821 - [gcp]Machine status should be "Failed" with an invalid configuration
Summary: [gcp]Machine status should be "Failed" with an invalid configuration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Joel Speed
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On: 1840552
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-27 16:37 UTC by Joel Speed
Modified: 2020-07-13 17:42 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When updating the phase to failed, an update to the Machine annotations overwrote changes to the status Consequence: Status updates were never persisted on a failed phase Fix: Update the annotations before starting to update the phase Result: Annotation and phase changes are persisted
Clone Of: 1840552
Environment:
Last Closed: 2020-07-13 17:42:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-gcp pull 93 0 None closed Bug 1840821: Update machine-api-operator dependency 2020-06-24 03:27:40 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:42:37 UTC

Description Joel Speed 2020-05-27 16:37:07 UTC
+++ This bug was initially created as a clone of Bug #1840552 +++

Description of problem:
Machine stuck in "Provisioning" when creating a machine with invalid ami, machine controller continuously reconciling status
 
Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-05-27-000432

How reproducible:
Always

Steps to Reproduce:
1. Creating a new machineset with invalid ami
    spec:
      metadata: {}
      providerSpec:
        value:
          ami:
            id: ami-0254f6b4eaa3b9d0f-invalid

2. Check machines and logs
  
Actual results:
Machine stuck in Provisioning status, machine controller continuously reconciling status

$ oc get machine
NAME                                          PHASE          TYPE        REGION      ZONE         AGE
zhsunaws527-cwprz-master-0                    Running        m4.xlarge   us-east-2   us-east-2a   4h54m
zhsunaws527-cwprz-master-1                    Running        m4.xlarge   us-east-2   us-east-2b   4h54m
zhsunaws527-cwprz-master-2                    Running        m4.xlarge   us-east-2   us-east-2c   4h54m
zhsunaws527-cwprz-worker-us-east-2a-xxxkq     Running        m4.large    us-east-2   us-east-2a   4h45m
zhsunaws527-cwprz-worker-us-east-2b-smbs5     Running        m4.large    us-east-2   us-east-2b   4h45m
zhsunaws527-cwprz-worker-us-east-2c-9mnfv     Running        m4.large    us-east-2   us-east-2c   4h45m
zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs   Provisioning                                        79m


status:
  errorMessage: "zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to
    Create machine: failed to launch instance: error getting blockDeviceMappings:
    error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus
    code: 400, request id: e80cef1f-3de7-4987-abf2-a81d2dc6ef1f"
  lastUpdated: "2020-05-27T06:02:40Z"
  phase: Provisioning
  providerStatus:
    conditions:
    - lastTransitionTime: "2020-05-27T06:02:41Z"
      message: "error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed:
        Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus code: 400, request
        id: 8e5f963e-e4e8-4036-a212-c25e75bdac33"
      reason: MachineCreationFailed
      status: "False"
      type: MachineCreation

27 07:25:34.584813       1 reconciler.go:68] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: error creating machine: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid"
        status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26
I0527 07:25:34.584829       1 machine_scope.go:138] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: Updating status
I0527 07:25:34.584839       1 machine_scope.go:165] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: finished calculating AWS status
I0527 07:25:34.584854       1 machine_scope.go:84] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: patching machine
E0527 07:25:34.608643       1 actuator.go:66] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs error: zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid"
        status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26
W0527 07:25:34.608699       1 controller.go:315] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: failed to create machine: zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid"
        status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26
I0527 07:25:34.608717       1 controller.go:415] Actuator returned invalid configuration error: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: "ami-0254f6b4eaa3b9d0f-invalid"
        status code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26
I0527 07:25:34.608730       1 controller.go:424] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: going into phase "Failed"
I0527 07:25:34.610156       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciler failed to Create machine: failed to launch instance: error getting blockDeviceMappings: error describing AMI: InvalidAMIID.Malformed: Invalid id: \"ami-0254f6b4eaa3b9d0f-invalid\"\n\tstatus code: 400, request id: 0ecd4a64-2e37-46e8-86e1-f8e7383aee26" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs","uid":"ea3b9ea5-b5a1-4b0f-ae23-1dac1ceca0b6","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"142183"} "reason"="FailedCreate"
I0527 07:25:34.629960       1 controller.go:282] controller-runtime/controller "msg"="Successfully Reconciled"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs"}
I0527 07:25:34.630026       1 controller.go:169] zhsunaws527-cwprz-worker-us-east-2ccc-fhrrs: reconciling Machine

Expected results:
The machine phase is set "Failed", machine controller stop reconciling status

Additional info:

--- Additional comment from Joel Speed on 2020-05-27 08:25:21 UTC ---

@sunzhaohua Did this eventually go failed or was it stuck in provisioning forever? It appears that the code is following the right/expected paths, but the status update seems to have failed 

You can see in the logs that we have reached controller.go:424 where is is going into phase Failed, so the error is being returned and recognised correctly, and for some reason setPhase is failing to actually update the API. I would have expected this to be a transient issue.
https://github.com/openshift/cluster-api-provider-aws/blob/5e266b553d8e7c5809d94653e2531167c865a762/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go#L422-L447

Could you also check the annotations on the Machine when this happened? There should also be "machine.openshift.io/instance-state": "unknown" in the annotations at this point

--- Additional comment from sunzhaohua on 2020-05-27 09:00:40 UTC ---

@joel speed It stuck in Provisioning, I tried several times, all stuck in Provisioning, the longest time is 89m.  "machine.openshift.io/instance-state": "unknown" annotations could see in the machine. My cluster is here: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/95078/artifact/workdir/install-dir/auth/kubeconfig/*view*/ in case you need to debug, machine still in Provisioning

Comment 3 sunzhaohua 2020-06-03 06:15:38 UTC
Verified
Clusterversion: 4.5.0-0.nightly-2020-06-02-174025
create a new machine with invalid image, machine became to Failed.

$ oc get machine
NAME                                PHASE     TYPE            REGION        ZONE            AGE
hongli-pl025-vqdlx-master-0         Running   n1-standard-4   us-central1   us-central1-a   5h40m
hongli-pl025-vqdlx-master-1         Running   n1-standard-4   us-central1   us-central1-b   5h40m
hongli-pl025-vqdlx-master-2         Running   n1-standard-4   us-central1   us-central1-c   5h40m
hongli-pl025-vqdlx-worker-a-xnwrk   Running   n1-standard-4   us-central1   us-central1-a   5h35m
hongli-pl025-vqdlx-worker-b-zwz4s   Running   n1-standard-4   us-central1   us-central1-b   5h35m
hongli-pl025-vqdlx-worker-c-9srjr   Running   n1-standard-4   us-central1   us-central1-c   5h35m
hongli-pl025-vqdlx-worker-f-5rk7j   Failed                                                  10s

Comment 4 errata-xmlrpc 2020-07-13 17:42:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.