Description of problem: Machine should be "Failed" when creating a machine with invalid zone How reproducible: always Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-09-112139 Steps to Reproduce: 1.Creating a machine with invalid zone spec: metadata: {} providerSpec: value: apiVersion: gcpprovider.openshift.io/v1beta1 canIPForward: false credentialsSecret: name: gcp-cloud-credentials deletionProtection: false disks: - autoDelete: true boot: true image: projects/rhcos-cloud/global/images/rhcos-47-83-202012030221-0-gcp-x86-64 labels: null sizeGb: 128 type: pd-ssd kind: GCPMachineProviderSpec machineType: n1-standard-4 metadata: creationTimestamp: null networkInterfaces: - network: zhsungcp11-bjhl5-network subnetwork: zhsungcp11-bjhl5-worker-subnet projectID: openshift-qe region: us-central1 serviceAccounts: - email: zhsungcp11-bjhl5-w.gserviceaccount.com scopes: - https://www.googleapis.com/auth/cloud-platform tags: - zhsungcp11-bjhl5-worker userDataSecret: name: worker-user-data zone: us-central1-f-invalid 2. Check machines and logs 3. Actual results: Machine has no status. $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsungcp11-bjhl5-master-0 Running n1-standard-4 us-central1 us-central1-a 7h42m zhsungcp11-bjhl5-master-1 Running n1-standard-4 us-central1 us-central1-b 7h42m zhsungcp11-bjhl5-master-2 Running n1-standard-4 us-central1 us-central1-c 7h42m zhsungcp11-bjhl5-worker-a-5k5cd Running n1-standard-4 us-central1 us-central1-a 7h34m zhsungcp11-bjhl5-worker-b-vwv2r Running n1-standard-4 us-central1 us-central1-b 7h34m zhsungcp11-bjhl5-worker-c-6mzhr Running n1-standard-4 us-central1 us-central1-c 3h45m zhsungcp11-bjhl5-worker-f-kxjq6 52m I1211 08:45:40.094107 1 controller.go:171] zhsungcp11-bjhl5-worker-f-kxjq6: reconciling Machine I1211 08:45:40.094118 1 actuator.go:84] zhsungcp11-bjhl5-worker-f-kxjq6: Checking if machine exists E1211 08:45:40.173919 1 controller.go:104] controllers/MachineSet "msg"="Failed to reconcile MachineSet" "error"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-f-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-f-invalid'. Unknown zone., invalid" "machineset"="zhsungcp11-bjhl5-worker-f" "namespace"="openshift-machine-api" I1211 08:45:40.174123 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-f-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-f-invalid'. Unknown zone., invalid" "object"={"kind":"MachineSet","namespace":"openshift-machine-api","name":"zhsungcp11-bjhl5-worker-f","uid":"fb2170b0-d1ec-411d-bd40-5dd3c1b9c843","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"152130"} "reason"="ReconcileError" Expected results: The machine phase is "Failed" Additional info:
I believe the logs pasted in this example are actually from the MachineSet controller rather than the Machine Controller. We will need to try to reproduce this and grab logs from the Machine controller instead
$ oc logs -f machine-api-controllers-bdbc54576-dzjds -c machine-controller I1216 03:53:11.900999 1 controller.go:81] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" I1216 03:53:11.970222 1 controller.go:171] zhsungcp16-dczjm-worker-c-r88kz: reconciling Machine I1216 03:53:11.993600 1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machine_controller" "name"="zhsungcp16-dczjm-worker-c-r88kz" "namespace"="openshift-machine-api" I1216 03:53:11.993677 1 controller.go:171] zhsungcp16-dczjm-worker-c-r88kz: reconciling Machine I1216 03:53:11.993690 1 actuator.go:84] zhsungcp16-dczjm-worker-c-r88kz: Checking if machine exists E1216 03:53:12.017657 1 controller.go:104] controllers/MachineSet "msg"="Failed to reconcile MachineSet" "error"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-c-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-c-invalid'. Unknown zone., invalid" "machineset"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" I1216 03:53:12.019731 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-c-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-c-invalid'. Unknown zone., invalid" "object"={"kind":"MachineSet","namespace":"openshift-machine-api","name":"zhsungcp16-dczjm-worker-c","uid":"b599fc79-bdc4-4629-8e00-c0d9bfe9836f","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"66956"} "reason"="ReconcileError" I1216 03:53:12.039892 1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machineset" "name"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" "reconcilerGroup"="machine.openshift.io" "reconcilerKind"="MachineSet" I1216 03:53:12.040124 1 controller.go:81] controllers/MachineSet "msg"="Reconciling" "machineset"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" E1216 03:53:12.098246 1 controller.go:104] controllers/MachineSet "msg"="Failed to reconcile MachineSet" "error"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-c-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-c-invalid'. Unknown zone., invalid" "machineset"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" I1216 03:53:12.098513 1 recorder.go:52] controller-runtime/manager/events "msg"="Warning" "message"="error fetching machine type \"n1-standard-4\": error fetching machine type \"n1-standard-4\" in zone \"us-central1-c-invalid\": googleapi: Error 400: Invalid value for field 'zone': 'us-central1-c-invalid'. Unknown zone., invalid" "object"={"kind":"MachineSet","namespace":"openshift-machine-api","name":"zhsungcp16-dczjm-worker-c","uid":"b599fc79-bdc4-4629-8e00-c0d9bfe9836f","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"66960"} "reason"="ReconcileError" E1216 03:53:12.117744 1 controller.go:274] zhsungcp16-dczjm-worker-c-r88kz: failed to check if machine exists: zhsungcp16-dczjm-worker-c-r88kz: Machine does not exist E1216 03:53:12.117836 1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsungcp16-dczjm-worker-c-r88kz: Machine does not exist" "controller"="machine_controller" "name"="zhsungcp16-dczjm-worker-c-r88kz" "namespace"="openshift-machine-api" I1216 03:53:12.118632 1 controller.go:261] controller "msg"="Successfully Reconciled" "controller"="machineset" "name"="zhsungcp16-dczjm-worker-c" "namespace"="openshift-machine-api" "reconcilerGroup"="machine.openshift.io" "reconcilerKind"="MachineSet" I1216 03:53:13.118223 1 controller.go:171] zhsungcp16-dczjm-worker-c-r88kz: reconciling Machine I1216 03:53:13.118262 1 actuator.go:84] zhsungcp16-dczjm-worker-c-r88kz: Checking if machine exists E1216 03:53:13.309834 1 controller.go:274] zhsungcp16-dczjm-worker-c-r88kz: failed to check if machine exists: zhsungcp16-dczjm-worker-c-r88kz: Machine does not exist E1216 03:53:13.310026 1 controller.go:237] controller "msg"="Reconciler error" "error"="zhsungcp16-dczjm-worker-c-r88kz: Machine does not exist" "controller"="machine_controller" "name"="zhsungcp16-dczjm-worker-c-r88kz" "namespace"="openshift-machine-api" I1216 03:53:14.310447 1 controller.go:171] zhsungcp16-dczjm-worker-c-r88kz: reconciling Machine
We should be able to detect a broken zone based on the 400 response from the exists call. Let's aim to fix this during the next release, setting target to --- until the 4.8 target is created
Master is now open for 4.8 fixes so we can start looking into this now
We need to be able to determine some way to identify that the zone does not exist, and mark the machine failed, only when the machine has not yet been created. The original proposed solution was too broad and would mark the machine as failed if exists ever failed. Perhaps instead we can make sure that the create call fails correctly when there is an invalid zone, this would be safer I believe
Sam explored the various ways that we could potentially fix this issue, but with each one there was a risk that we might leak instances, which we cannot risk. The safest route for now unfortunately is to leave this bug as is, users will be warned when they have made a mistake and should be able to fix it. The only potential fix we could do here would be to make sure that the zone is immutable once created, but that is not guaranteed to be enforced as it has to be done via a webhook.