1805639 – Machine status should be "Failed" when creating a machine with invalid machine configuration

Bug 1805639 - Machine status should be "Failed" when creating a machine with invalid machine configuration

Summary: Machine status should be "Failed" when creating a machine with invalid machin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1881865 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-21 09:19 UTC by sunzhaohua
Modified:	2021-02-24 15:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The Machine API provided no feedback to users when credentials secrets were invalid Consequence: It was hard to diagnose when there were issues with the cloud provider credentials Fix: Provide a warning if the credentials secret does not exist or is in the wrong format Result: Users are now warned when creating/updating MachineSets that there may be an issue with their credentials
Clone Of:
Environment:
Last Closed:	2021-02-24 15:10:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-operator pull 673	0	None	closed	Bug 1805639: Validate credentialsSecret	2021-02-15 19:12:49 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:11:46 UTC

Description sunzhaohua 2020-02-21 09:19:52 UTC

Description of problem:
Machine status should be "Failed" when creating a machine with invalid machine configuration
 
Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-19-173908

How reproducible:
Always

Steps to Reproduce:
1. Creating a machineset with invalid configuration, such as invalid secret:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: zhsung-nrjcj
  name: zhsung-nrjcj-w-bb
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: zhsung-nrjcj
      machine.openshift.io/cluster-api-machineset: zhsung-nrjcj-w-bb
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: zhsung-nrjcj
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: zhsung-nrjcj-w-bb
    spec:
      metadata:
        creationTimestamp: null
      providerSpec:
        value:
          apiVersion: gcpprovider.openshift.io/v1beta1
          canIPForward: false
          credentialsSecret:
            name: gcp-cloud-credentials-invalid
          deletionProtection: false
          disks:
          - autoDelete: true
            boot: true
            image: zhsung-nrjcj-rhcos-image-invalid
            labels: null
            sizeGb: 128
            type: pd-ssd
          kind: GCPMachineProviderSpec
          machineType: n1-standard-4
          metadata:
            creationTimestamp: null
          networkInterfaces:
          - network: zhsung-nrjcj-network
            subnetwork: zhsung-nrjcj-worker-subnet
          projectID: openshift-qe
          region: us-central1
          serviceAccounts:
          - email: zhsung-nrjcj-w.gserviceaccount.com
            scopes:
            - https://www.googleapis.com/auth/cloud-platform
          tags:
          - zhsung-nrjcj-worker
          userDataSecret:
            name: worker-user-data
          zone: us-central1-b
2. Check machines and logs
$ oc get machine
NAME                                     PHASE      TYPE            REGION        ZONE            AGE
zhsung-nrjcj-m-0                         Running    n1-standard-4   us-central1   us-central1-a   31h
zhsung-nrjcj-m-1                         Running    n1-standard-4   us-central1   us-central1-b   31h
zhsung-nrjcj-m-2                         Running    n1-standard-4   us-central1   us-central1-c   31h
zhsung-nrjcj-w-a-5hrpb                   Running    n1-standard-4   us-central1   us-central1-a   6h11m
zhsung-nrjcj-w-b-vqgqj                   Running    n1-standard-4   us-central1   us-central1-b   31h
zhsung-nrjcj-w-bb-qhnsn                                                                           26m

3. Delete machine    

Actual results:
Machine has no status and couldn't be deleted, stuck in Deleting status

I0221 08:38:37.315317       1 controller.go:163] zhsung-nrjcj-w-bb-qhnsn: reconciling Machine
I0221 08:38:37.315895       1 actuator.go:80] zhsung-nrjcj-w-bb-qhnsn: Checking if machine exists
E0221 08:38:37.316696       1 controller.go:255] zhsung-nrjcj-w-bb-qhnsn: failed to check if machine exists: zhsung-nrjcj-w-bb-qhnsn: failed to create scope for machine: error getting credentials secret "gcp-cloud-credentials-invalid" in namespace "openshift-machine-api": Secret "gcp-cloud-credentials-invalid" not found

$ oc get machine
NAME                                     PHASE      TYPE            REGION        ZONE            AGE
zhsung-nrjcj-m-0                         Running    n1-standard-4   us-central1   us-central1-a   32h
zhsung-nrjcj-m-1                         Running    n1-standard-4   us-central1   us-central1-b   32h
zhsung-nrjcj-m-2                         Running    n1-standard-4   us-central1   us-central1-c   32h
zhsung-nrjcj-w-a-5hrpb                   Running    n1-standard-4   us-central1   us-central1-a   6h17m
zhsung-nrjcj-w-b-vqgqj                   Running    n1-standard-4   us-central1   us-central1-b   31h
zhsung-nrjcj-w-bb-qhnsn                  Deleting                                                 32m


Expected results:
The machine phase is set "Failed"

Additional info:

Comment 1 Alberto 2020-05-29 10:52:32 UTC

This a very particular scenario for invalidConfig as it makes the reconciling loop to fail before being able to check the instance existence.
It could be the case where the spec is modified after creation for an existing instance. We don't want fail machines in such scenario as that's unrecoverable. We only fail them on creation or when the backed instance is deleted out of band.

As we are working defaulting/validation for machines providerSpecs we will explore ways to mitigate this.

Comment 2 Alberto 2020-06-19 08:16:38 UTC

We haven't prioritise investigating this, still would like to keep it open for now. Tagging upcomingSprint.

Comment 3 Alberto 2020-07-01 09:40:43 UTC

This will be mitigated by https://github.com/openshift/machine-api-operator/pull/615. Still won't fix it as per https://bugzilla.redhat.com/show_bug.cgi?id=1805639#c1

Comment 4 Alberto 2020-07-28 13:43:33 UTC

https://github.com/openshift/machine-api-operator/pull/660#issue-457817017
Adding upcomingSprint label as this will need more testing and PR reviewing before merging

Comment 5 Joel Speed 2020-08-20 10:53:23 UTC

The proposed fix is still under discussion, we will hopefully make some progress on this next sprint

Comment 6 Alberto 2020-09-10 08:11:51 UTC

Tagging to try to reprioritise and either close/fix this during next sprint

Comment 7 Alberto 2020-09-16 07:45:47 UTC

Bumping this to 4.7

Comment 8 Joel Speed 2020-10-01 16:29:36 UTC

We didn't manage to reach an agreement on how to solve this issue, will review next sprint

Comment 9 Danil Grigorev 2020-11-02 16:14:02 UTC

*** Bug 1881865 has been marked as a duplicate of this bug. ***

Comment 10 Joel Speed 2020-11-13 15:49:07 UTC

We discussed this recently, we are going to send a warning when the secret doesn't exist, otherwise we could end up with a race during cluster bootstrap, moving back to assigned to remind myself to update the PR

Comment 11 Joel Speed 2020-12-01 11:35:08 UTC

I've made the adjustments on the PR and that's now up for review, moving this back to post

Comment 12 Michael McCune 2020-12-04 21:18:27 UTC

there has been some back and forth discussion on the PR for this, it is still under review but should be resolved in the next sprint

Comment 14 sunzhaohua 2020-12-16 09:20:24 UTC

Verified, warn users when a credentials secret does not exist. 

clusterversion: 4.7.0-0.nightly-2020-12-14-165231
$ oc get machine
NAME                                       PHASE     TYPE        REGION      ZONE         AGE
zhsunaws16-pqhbg-master-0                  Running   m5.xlarge   us-east-2   us-east-2a   7h41m
zhsunaws16-pqhbg-master-1                  Running   m5.xlarge   us-east-2   us-east-2b   7h41m
zhsunaws16-pqhbg-master-2                  Running   m5.xlarge   us-east-2   us-east-2c   7h41m
zhsunaws16-pqhbg-worker-us-east-2a-98bpq   Running   m5.large    us-east-2   us-east-2a   7h31m
zhsunaws16-pqhbg-worker-us-east-2b-mgmdl   Running   m5.large    us-east-2   us-east-2b   7h31m
zhsunaws16-pqhbg-worker-us-east-2c-cdlg5                                                  84s
zhsunaws16-pqhbg-worker-us-east-2c-szt75   Running   m5.large    us-east-2   us-east-2c   7h31m


I1216 08:57:41.713358       1 controller.go:171] zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: reconciling Machine
I1216 08:57:41.713379       1 actuator.go:100] zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: actuator checking if machine exists
E1216 08:57:41.713839       1 controller.go:274] zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: failed to check if machine exists: zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: failed to create scope for machine: failed to create aws client: aws credentials secret openshift-machine-api/aws-cloud-credentials-invalid: Secret "aws-cloud-credentials-invalid" not found not found
E1216 08:57:41.713913       1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsunaws16-pqhbg-worker-us-east-2c-cdlg5: failed to create scope for machine: failed to create aws client: aws credentials secret openshift-machine-api/aws-cloud-credentials-invalid: Secret \"aws-cloud-credentials-invalid\" not found not found" "controller"="machine_controller" "name"="zhsunaws16-pqhbg-worker-us-east-2c-cdlg5" "namespace"="openshift-machine-api" 
I1216 08:58:18.868090       1 controller.go:58] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsunaws16-pqhbg-worker-us-east-2c" "namespace"="openshift-machine-api" 
W1216 08:58:18.893532       1 warnings.go:67] providerSpec.credentialsSecret: Invalid value: "aws-cloud-credentials-invalid": not found. Expected CredentialsSecret to exist

Comment 17 errata-xmlrpc 2021-02-24 15:10:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.