Description of problem: Machine status should be "Failed" when creating a machine with invalid machine configuration Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-19-173908 How reproducible: Always Steps to Reproduce: 1. Creating a machineset with invalid configuration, such as invalid secret: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet metadata: labels: machine.openshift.io/cluster-api-cluster: zhsung-nrjcj name: zhsung-nrjcj-w-bb namespace: openshift-machine-api spec: replicas: 1 selector: matchLabels: machine.openshift.io/cluster-api-cluster: zhsung-nrjcj machine.openshift.io/cluster-api-machineset: zhsung-nrjcj-w-bb template: metadata: creationTimestamp: null labels: machine.openshift.io/cluster-api-cluster: zhsung-nrjcj machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: zhsung-nrjcj-w-bb spec: metadata: creationTimestamp: null providerSpec: value: apiVersion: gcpprovider.openshift.io/v1beta1 canIPForward: false credentialsSecret: name: gcp-cloud-credentials-invalid deletionProtection: false disks: - autoDelete: true boot: true image: zhsung-nrjcj-rhcos-image-invalid labels: null sizeGb: 128 type: pd-ssd kind: GCPMachineProviderSpec machineType: n1-standard-4 metadata: creationTimestamp: null networkInterfaces: - network: zhsung-nrjcj-network subnetwork: zhsung-nrjcj-worker-subnet projectID: openshift-qe region: us-central1 serviceAccounts: - email: zhsung-nrjcj-w.gserviceaccount.com scopes: - https://www.googleapis.com/auth/cloud-platform tags: - zhsung-nrjcj-worker userDataSecret: name: worker-user-data zone: us-central1-b 2. Check machines and logs $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsung-nrjcj-m-0 Running n1-standard-4 us-central1 us-central1-a 31h zhsung-nrjcj-m-1 Running n1-standard-4 us-central1 us-central1-b 31h zhsung-nrjcj-m-2 Running n1-standard-4 us-central1 us-central1-c 31h zhsung-nrjcj-w-a-5hrpb Running n1-standard-4 us-central1 us-central1-a 6h11m zhsung-nrjcj-w-b-vqgqj Running n1-standard-4 us-central1 us-central1-b 31h zhsung-nrjcj-w-bb-qhnsn 26m 3. Delete machine Actual results: Machine has no status and couldn't be deleted, stuck in Deleting status I0221 08:38:37.315317 1 controller.go:163] zhsung-nrjcj-w-bb-qhnsn: reconciling Machine I0221 08:38:37.315895 1 actuator.go:80] zhsung-nrjcj-w-bb-qhnsn: Checking if machine exists E0221 08:38:37.316696 1 controller.go:255] zhsung-nrjcj-w-bb-qhnsn: failed to check if machine exists: zhsung-nrjcj-w-bb-qhnsn: failed to create scope for machine: error getting credentials secret "gcp-cloud-credentials-invalid" in namespace "openshift-machine-api": Secret "gcp-cloud-credentials-invalid" not found $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsung-nrjcj-m-0 Running n1-standard-4 us-central1 us-central1-a 32h zhsung-nrjcj-m-1 Running n1-standard-4 us-central1 us-central1-b 32h zhsung-nrjcj-m-2 Running n1-standard-4 us-central1 us-central1-c 32h zhsung-nrjcj-w-a-5hrpb Running n1-standard-4 us-central1 us-central1-a 6h17m zhsung-nrjcj-w-b-vqgqj Running n1-standard-4 us-central1 us-central1-b 31h zhsung-nrjcj-w-bb-qhnsn Deleting 32m Expected results: The machine phase is set "Failed" Additional info:
This a very particular scenario for invalidConfig as it makes the reconciling loop to fail before being able to check the instance existence. It could be the case where the spec is modified after creation for an existing instance. We don't want fail machines in such scenario as that's unrecoverable. We only fail them on creation or when the backed instance is deleted out of band. As we are working defaulting/validation for machines providerSpecs we will explore ways to mitigate this.
We haven't prioritise investigating this, still would like to keep it open for now. Tagging upcomingSprint.
This will be mitigated by https://github.com/openshift/machine-api-operator/pull/615. Still won't fix it as per https://bugzilla.redhat.com/show_bug.cgi?id=1805639#c1
https://github.com/openshift/machine-api-operator/pull/660#issue-457817017 Adding upcomingSprint label as this will need more testing and PR reviewing before merging
The proposed fix is still under discussion, we will hopefully make some progress on this next sprint
Tagging to try to reprioritise and either close/fix this during next sprint
Bumping this to 4.7
We didn't manage to reach an agreement on how to solve this issue, will review next sprint
*** Bug 1881865 has been marked as a duplicate of this bug. ***
We discussed this recently, we are going to send a warning when the secret doesn't exist, otherwise we could end up with a race during cluster bootstrap, moving back to assigned to remind myself to update the PR
I've made the adjustments on the PR and that's now up for review, moving this back to post
there has been some back and forth discussion on the PR for this, it is still under review but should be resolved in the next sprint
Verified, warn users when a credentials secret does not exist. clusterversion: 4.7.0-0.nightly-2020-12-14-165231 $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunaws16-pqhbg-master-0 Running m5.xlarge us-east-2 us-east-2a 7h41m zhsunaws16-pqhbg-master-1 Running m5.xlarge us-east-2 us-east-2b 7h41m zhsunaws16-pqhbg-master-2 Running m5.xlarge us-east-2 us-east-2c 7h41m zhsunaws16-pqhbg-worker-us-east-2a-98bpq Running m5.large us-east-2 us-east-2a 7h31m zhsunaws16-pqhbg-worker-us-east-2b-mgmdl Running m5.large us-east-2 us-east-2b 7h31m zhsunaws16-pqhbg-worker-us-east-2c-cdlg5 84s zhsunaws16-pqhbg-worker-us-east-2c-szt75 Running m5.large us-east-2 us-east-2c 7h31m I1216 08:57:41.713358 1 controller.go:171] zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: reconciling Machine I1216 08:57:41.713379 1 actuator.go:100] zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: actuator checking if machine exists E1216 08:57:41.713839 1 controller.go:274] zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: failed to check if machine exists: zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: failed to create scope for machine: failed to create aws client: aws credentials secret openshift-machine-api/aws-cloud-credentials-invalid: Secret "aws-cloud-credentials-invalid" not found not found E1216 08:57:41.713913 1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: failed to create scope for machine: failed to create aws client: aws credentials secret openshift-machine-api/aws-cloud-credentials-invalid: Secret \"aws-cloud-credentials-invalid\" not found not found" "controller"="machine_controller" "name"="zhsunaws16-pqhbg-worker-us-east-2c-cdlg5" "namespace"="openshift-machine-api" I1216 08:58:18.868090 1 controller.go:58] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsunaws16-pqhbg-worker-us-east-2c" "namespace"="openshift-machine-api" W1216 08:58:18.893532 1 warnings.go:67] providerSpec.credentialsSecret: Invalid value: "aws-cloud-credentials-invalid": not found. Expected CredentialsSecret to exist
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633