Description of problem: Failed to delete machine that has invalid zone, even add annotation "machine.openshift.io/exclude-node-draining=", the machine couldn't be deleted. Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-14-211610 How reproducible: Always Steps to Reproduce: 1. Create a machine with invalid zone apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: labels: machine.openshift.io/cluster-api-cluster: zhsun3-8vcmx machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker name: zhsun3-8vcmx-w-a-1 namespace: openshift-machine-api spec: metadata: creationTimestamp: null providerSpec: value: apiVersion: gcpprovider.openshift.io/v1beta1 canIPForward: false credentialsSecret: name: gcp-cloud-credentials deletionProtection: false disks: - autoDelete: true boot: true image: zhsun3-8vcmx-rhcos-image labels: null sizeGb: 128 type: pd-ssd kind: GCPMachineProviderSpec machineType: n1-standard-4 metadata: creationTimestamp: null networkInterfaces: - network: zhsun3-8vcmx-network subnetwork: zhsun3-8vcmx-worker-subnet projectID: openshift-gce-devel region: us-central1 serviceAccounts: - email: zhsun3-8vcmx-w.gserviceaccount.com scopes: - https://www.googleapis.com/auth/cloud-platform tags: - zhsun3-8vcmx-worker userDataSecret: name: worker-user-data zone: us-central1-a-invalid 2. Delete machine Actual results: Machines couldn't be deleted. $ oc delete machine zhsun3-8vcmx-w-b machine.machine.openshift.io "zhsun3-8vcmx-w-b" deleted ^C I0816 05:39:30.652303 1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-b" I0816 05:39:30.652472 1 controller.go:310] Machine "zhsun3-8vcmx-w-b" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0816 05:39:30.652590 1 controller.go:205] Reconciling machine "zhsun3-8vcmx-w-b" triggers delete I0816 05:39:30.652609 1 actuator.go:116] zhsun3-8vcmx-w-b: Deleting machine E0816 05:39:30.991773 1 controller.go:220] Failed to delete machine "zhsun3-8vcmx-w-b": unable to verify project/zone exists: openshift-gce-devel/us-central1-b-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b-invalid' was not found, notFound I0816 05:39:57.097079 1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-b" I0816 05:39:57.097203 1 controller.go:310] Machine "zhsun3-8vcmx-w-b" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0816 05:39:57.097219 1 controller.go:205] Reconciling machine "zhsun3-8vcmx-w-b" triggers delete I0816 05:39:57.097227 1 actuator.go:116] zhsun3-8vcmx-w-b: Deleting machine E0816 05:39:57.351594 1 controller.go:220] Failed to delete machine "zhsun3-8vcmx-w-b": unable to verify project/zone exists: openshift-gce-devel/us-central1-b-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b-invalid' was not found, notFound Expected results: Machine could be deleted. Additional info:
This is tricky since if a zone is invalid, the actuator can not decide if it's invalid because it does not exist or because the name was malformed. Given it can not find corresponding instance in GCE, it can not delete it. So, it rather backs of until the zone is available. Can you share entire log from the machine controller? Do you see similar error message when an instance is being created?
create machine then delete machine, machine controller logs. I test this in aws, machine could be deleted. gcp machine-controller logs: I0819 09:05:35.582748 1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid" I0819 09:05:35.584190 1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:05:35.596044 1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid" I0819 09:05:35.596083 1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:05:35.596100 1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists E0819 09:05:35.967996 1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound I0819 09:05:36.968378 1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid" I0819 09:05:36.968416 1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:05:36.968432 1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists E0819 09:05:37.277957 1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound I0819 09:05:38.278384 1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid" I0819 09:05:38.278433 1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:05:38.278463 1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists E0819 09:05:38.595997 1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound I0819 09:06:24.426631 1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid" I0819 09:06:24.426676 1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:06:24.426716 1 controller.go:205] Reconciling machine "zhsun2-4g7pw-w-a-invalid" triggers delete I0819 09:06:24.426725 1 actuator.go:116] zhsun2-4g7pw-w-a-invalid: Deleting machine E0819 09:06:24.600772 1 controller.go:220] Failed to delete machine "zhsun2-4g7pw-w-a-invalid": unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound I0819 09:06:26.764445 1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid" I0819 09:06:26.764503 1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:06:26.764518 1 controller.go:205] Reconciling machine "zhsun2-4g7pw-w-a-invalid" triggers delete I0819 09:06:26.764527 1 actuator.go:116] zhsun2-4g7pw-w-a-invalid: Deleting machine E0819 09:06:27.089704 1 controller.go:220] Failed to delete machine "zhsun2-4g7pw-w-a-invalid": unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound aws machine-controller logs: I0819 09:08:55.425914 1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid" I0819 09:08:55.425950 1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:08:55.435053 1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid" I0819 09:08:55.435077 1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:08:55.435107 1 actuator.go:481] zhsun-7558q-worker-us-east-2a-invalid: Checking if machine exists I0819 09:08:55.546206 1 actuator.go:489] zhsun-7558q-worker-us-east-2a-invalid: Instance does not exist I0819 09:08:55.546232 1 controller.go:259] Reconciling machine object zhsun-7558q-worker-us-east-2a-invalid triggers idempotent create. I0819 09:08:55.546241 1 actuator.go:113] zhsun-7558q-worker-us-east-2a-invalid: creating machine E0819 09:08:55.546461 1 utils.go:191] NodeRef not found in machine zhsun-7558q-worker-us-east-2a-invalid I0819 09:08:55.588524 1 instances.go:44] No stopped instances found for machine zhsun-7558q-worker-us-east-2a-invalid I0819 09:08:55.588582 1 instances.go:142] Using AMI ami-06c85f9d106577272 I0819 09:08:55.588595 1 instances.go:74] Describing security groups based on filters E0819 09:08:55.744085 1 instances.go:113] error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid] status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31 E0819 09:08:55.744161 1 actuator.go:107] zhsun-7558q-worker-us-east-2a-invalid: Machine error: error launching instance: error getting subnet IDs: error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid] status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31, E0819 09:08:55.744171 1 actuator.go:116] zhsun-7558q-worker-us-east-2a-invalid: error creating machine: error launching instance: error getting subnet IDs: error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid] status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31, I0819 09:09:17.172675 1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid" I0819 09:09:17.172710 1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0819 09:09:17.172739 1 controller.go:205] Reconciling machine "zhsun-7558q-worker-us-east-2a-invalid" triggers delete I0819 09:09:17.172753 1 actuator.go:333] zhsun-7558q-worker-us-east-2a-invalid: deleting machine W0819 09:09:17.227440 1 actuator.go:376] zhsun-7558q-worker-us-east-2a-invalid: no instances found to delete for machine I0819 09:09:17.237614 1 controller.go:239] Machine "zhsun-7558q-worker-us-east-2a-invalid" deletion successful
Thanks for the logs. So, as expected, machine with invalid zone does not result in instance created in GCE. So, the issue at this point is about creating a machine with invalid zone that can not be delete afterwards. AWS works a bit differently. In AWS case, it's sufficient to know machine name to find an instance in AWS and delete it. In GCP case, knowledge of a zone is required as well. Without it, the actuator can not even check if an instance in GCE exists. Thus, the reason why GCP fails to delete a machine object with invalid zone and AWS does not. Not sure if we can do anything about it at this point unless we use webhooks to validate the gcp provider config. Short term fix is to rename the zone to already existing. That might be troublesome with a machineset creating tens of machines with invalid zone. Long term solution is to use webhook validation which will talk to GCE and check if provided zone exists. Which assumes the webhook validation will be always able to talk to GCE. If not, machine(s) will be refused.
https://github.com/openshift/cluster-api/pull/67
https://github.com/openshift/cluster-api-provider-gcp/pull/51
Verified in 4.2.0-0.nightly-2019-09-02-172410. The machine with invalid zone can be deleted.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922