Bug 1906742 - [gcp]Machine should be "Failed" when creating a machine with invalid zone
Summary: [gcp]Machine should be "Failed" when creating a machine with invalid zone
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Samuel Stuchly
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-11 10:49 UTC by sunzhaohua
Modified: 2021-11-26 11:51 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-26 11:51:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description sunzhaohua 2020-12-11 10:49:05 UTC
Description of problem:
Machine should be "Failed" when creating a machine with invalid zone

How reproducible:
always

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-12-09-112139

Steps to Reproduce:
1.Creating a machine with invalid zone
    spec:
      metadata: {}
      providerSpec:
        value:
          apiVersion: gcpprovider.openshift.io/v1beta1
          canIPForward: false
          credentialsSecret:
            name: gcp-cloud-credentials
          deletionProtection: false
          disks:
          - autoDelete: true
            boot: true
            image: projects/rhcos-cloud/global/images/rhcos-47-83-202012030221-0-gcp-x86-64
            labels: null
            sizeGb: 128
            type: pd-ssd
          kind: GCPMachineProviderSpec
          machineType: n1-standard-4
          metadata:
            creationTimestamp: null
          networkInterfaces:
          - network: zhsungcp11-bjhl5-network
            subnetwork: zhsungcp11-bjhl5-worker-subnet
          projectID: openshift-qe
          region: us-central1
          serviceAccounts:
          - email: zhsungcp11-bjhl5-w.gserviceaccount.com
            scopes:
            - https://www.googleapis.com/auth/cloud-platform
          tags:
          - zhsungcp11-bjhl5-worker
          userDataSecret:
            name: worker-user-data
          zone: us-central1-f-invalid

2. Check machines and logs
3.

Actual results:
Machine has no status.
 
$ oc get machine
NAME                              PHASE     TYPE            REGION        ZONE            AGE
zhsungcp11-bjhl5-master-0         Running   n1-standard-4   us-central1   us-central1-a   7h42m
zhsungcp11-bjhl5-master-1         Running   n1-standard-4   us-central1   us-central1-b   7h42m
zhsungcp11-bjhl5-master-2         Running   n1-standard-4   us-central1   us-central1-c   7h42m
zhsungcp11-bjhl5-worker-a-5k5cd   Running   n1-standard-4   us-central1   us-central1-a   7h34m
zhsungcp11-bjhl5-worker-b-vwv2r   Running   n1-standard-4   us-central1   us-central1-b   7h34m
zhsungcp11-bjhl5-worker-c-6mzhr   Running   n1-standard-4   us-central1   us-central1-c   3h45m
zhsungcp11-bjhl5-worker-f-kxjq6                                                           52m
 
I1211 08:45:40.094107       1 controller.go:171] zhsungcp11-bjhl5-worker-f-kxjq6: reconciling Machine
I1211 08:45:40.094118       1 actuator.go:84] zhsungcp11-bjhl5-worker-f-kxjq6: Checking if machine exists

E1211 08:45:40.173919       1 controller.go:104] controllers/MachineSet "msg"="Failed to reconcile MachineSet" "error"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-f-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-f-invalid'. Unknown zone., invalid" "machineset"="zhsungcp11-bjhl5-worker-f" "namespace"="openshift-machine-api" 
I1211 08:45:40.174123       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-f-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-f-invalid'. Unknown zone., invalid" "object"={"kind":"MachineSet","namespace":"openshift-machine-api","name":"zhsungcp11-bjhl5-worker-f","uid":"fb2170b0-d1ec-411d-bd40-5dd3c1b9c843","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"152130"} "reason"="ReconcileError"

Expected results:
The machine phase is "Failed"

Additional info:

Comment 1 Joel Speed 2020-12-15 15:46:52 UTC
I believe the logs pasted in this example are actually from the MachineSet controller rather than the Machine Controller. We will need to try to reproduce this and grab logs from the Machine controller instead

Comment 2 sunzhaohua 2020-12-16 03:56:39 UTC
$ oc logs -f machine-api-controllers-bdbc54576-dzjds -c machine-controller

I1216 03:53:11.900999       1 controller.go:81] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" 
I1216 03:53:11.970222       1 controller.go:171] zhsungcp16-dczjm-worker-c-r88kz: reconciling Machine
I1216 03:53:11.993600       1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machine_controller" "name"="zhsungcp16-dczjm-worker-c-r88kz" "namespace"="openshift-machine-api" 
I1216 03:53:11.993677       1 controller.go:171] zhsungcp16-dczjm-worker-c-r88kz: reconciling Machine
I1216 03:53:11.993690       1 actuator.go:84] zhsungcp16-dczjm-worker-c-r88kz: Checking if machine exists
E1216 03:53:12.017657       1 controller.go:104] controllers/MachineSet "msg"="Failed to reconcile MachineSet" "error"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-c-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-c-invalid'. Unknown zone., invalid" "machineset"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" 
I1216 03:53:12.019731       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-c-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-c-invalid'. Unknown zone., invalid" "object"={"kind":"MachineSet","namespace":"openshift-machine-api","name":"zhsungcp16-dczjm-worker-c","uid":"b599fc79-bdc4-4629-8e00-c0d9bfe9836f","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"66956"} "reason"="ReconcileError"
I1216 03:53:12.039892       1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machineset" "name"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" "reconcilerGroup"="machine.openshift.io" "reconcilerKind"="MachineSet" 
I1216 03:53:12.040124       1 controller.go:81] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" 
E1216 03:53:12.098246       1 controller.go:104] controllers/MachineSet "msg"="Failed to reconcile MachineSet" "error"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-c-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-c-invalid'. Unknown zone., invalid" "machineset"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" 
I1216 03:53:12.098513       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-c-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-c-invalid'. Unknown zone., invalid" "object"={"kind":"MachineSet","namespace":"openshift-machine-api","name":"zhsungcp16-dczjm-worker-c","uid":"b599fc79-bdc4-4629-8e00-c0d9bfe9836f","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"66960"} "reason"="ReconcileError"
E1216 03:53:12.117744       1 controller.go:274] zhsungcp16-dczjm-worker-c-r88kz: failed to check if machine exists: zhsungcp16-dczjm-worker-c-r88kz: Machine does not exist
E1216 03:53:12.117836       1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsungcp16-dczjm-worker-c-r88kz: Machine does not exist" "controller"="machine_controller" "name"="zhsungcp16-dczjm-worker-c-r88kz" "namespace"="openshift-machine-api" 
I1216 03:53:12.118632       1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machineset" "name"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" "reconcilerGroup"="machine.openshift.io" "reconcilerKind"="MachineSet" 
I1216 03:53:13.118223       1 controller.go:171] zhsungcp16-dczjm-worker-c-r88kz: reconciling Machine
I1216 03:53:13.118262       1 actuator.go:84] zhsungcp16-dczjm-worker-c-r88kz: Checking if machine exists
E1216 03:53:13.309834       1 controller.go:274] zhsungcp16-dczjm-worker-c-r88kz: failed to check if machine exists: zhsungcp16-dczjm-worker-c-r88kz: Machine does not exist
E1216 03:53:13.310026       1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsungcp16-dczjm-worker-c-r88kz: Machine does not exist" "controller"="machine_controller" "name"="zhsungcp16-dczjm-worker-c-r88kz" "namespace"="openshift-machine-api" 
I1216 03:53:14.310447       1 controller.go:171] zhsungcp16-dczjm-worker-c-r88kz: reconciling Machine

Comment 3 Joel Speed 2021-01-05 17:19:47 UTC
We should be able to detect a broken zone based on the 400 response from the exists call.

Let's aim to fix this during the next release, setting target to --- until the 4.8 target is created

Comment 4 Joel Speed 2021-02-08 10:19:26 UTC
Master is now open for 4.8 fixes so we can start looking into this now

Comment 5 Joel Speed 2021-05-19 13:54:29 UTC
We need to be able to determine some way to identify that the zone does not exist, and mark the machine failed, only when the machine has not yet been created.
The original proposed solution was too broad and would mark the machine as failed if exists ever failed. Perhaps instead we can make sure that the create call fails correctly when there is an invalid zone, this would be safer I believe

Comment 8 Joel Speed 2021-11-26 11:51:05 UTC
Sam explored the various ways that we could potentially fix this issue, but with each one there was a risk that we might leak instances, which we cannot risk.
The safest route for now unfortunately is to leave this bug as is, users will be warned when they have made a mistake and should be able to fix it.

The only potential fix we could do here would be to make sure that the zone is immutable once created, but that is not guaranteed to be enforced as it has to be done via a webhook.


Note You need to log in before you can comment on or make changes to this bug.