Bug 1741765 - [gcp] Failed to delete machine that has invalid zone
Summary: [gcp] Failed to delete machine that has invalid zone
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Michael Gugino
QA Contact: sunzhaohua
URL:
Whiteboard: gcp
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-16 06:02 UTC by sunzhaohua
Modified: 2019-10-16 06:36 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:36:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-gcp pull 51 0 None closed bug 1741765: Allow invalid project/zone machines to be deleted 2020-10-19 16:46:23 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:36:26 UTC

Description sunzhaohua 2019-08-16 06:02:18 UTC
Description of problem:
Failed to delete machine that has invalid zone, even add annotation "machine.openshift.io/exclude-node-draining=", the machine couldn't be deleted.

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-14-211610

How reproducible:
Always

Steps to Reproduce:
1.  Create a machine with invalid zone
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: zhsun3-8vcmx
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: zhsun3-8vcmx-w-a-1
  namespace: openshift-machine-api
spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      apiVersion: gcpprovider.openshift.io/v1beta1
      canIPForward: false
      credentialsSecret:
        name: gcp-cloud-credentials
      deletionProtection: false
      disks:
      - autoDelete: true
        boot: true
        image: zhsun3-8vcmx-rhcos-image
        labels: null
        sizeGb: 128
        type: pd-ssd
      kind: GCPMachineProviderSpec
      machineType: n1-standard-4
      metadata:
        creationTimestamp: null
      networkInterfaces:
      - network: zhsun3-8vcmx-network
        subnetwork: zhsun3-8vcmx-worker-subnet
      projectID: openshift-gce-devel
      region: us-central1
      serviceAccounts:
      - email: zhsun3-8vcmx-w.gserviceaccount.com
        scopes:
        - https://www.googleapis.com/auth/cloud-platform
      tags:
      - zhsun3-8vcmx-worker
      userDataSecret:
        name: worker-user-data
      zone: us-central1-a-invalid

2. Delete machine
      
Actual results:
Machines couldn't be deleted.
$ oc delete machine zhsun3-8vcmx-w-b
machine.machine.openshift.io "zhsun3-8vcmx-w-b" deleted
^C


I0816 05:39:30.652303       1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-b"
I0816 05:39:30.652472       1 controller.go:310] Machine "zhsun3-8vcmx-w-b" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0816 05:39:30.652590       1 controller.go:205] Reconciling machine "zhsun3-8vcmx-w-b" triggers delete
I0816 05:39:30.652609       1 actuator.go:116] zhsun3-8vcmx-w-b: Deleting machine
E0816 05:39:30.991773       1 controller.go:220] Failed to delete machine "zhsun3-8vcmx-w-b": unable to verify project/zone exists: openshift-gce-devel/us-central1-b-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b-invalid' was not found, notFound
I0816 05:39:57.097079       1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-b"
I0816 05:39:57.097203       1 controller.go:310] Machine "zhsun3-8vcmx-w-b" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0816 05:39:57.097219       1 controller.go:205] Reconciling machine "zhsun3-8vcmx-w-b" triggers delete
I0816 05:39:57.097227       1 actuator.go:116] zhsun3-8vcmx-w-b: Deleting machine
E0816 05:39:57.351594       1 controller.go:220] Failed to delete machine "zhsun3-8vcmx-w-b": unable to verify project/zone exists: openshift-gce-devel/us-central1-b-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b-invalid' was not found, notFound


Expected results:
Machine could be deleted.

Additional info:

Comment 1 Jan Chaloupka 2019-08-16 10:26:02 UTC
This is tricky since if a zone is invalid, the actuator can not decide if it's invalid because it does not exist or because the name was malformed. Given it can not find corresponding instance in GCE, it can not delete it. So, it rather backs of until the zone is available.

Can you share entire log from the machine controller? Do you see similar error message when an instance is being created?

Comment 2 sunzhaohua 2019-08-19 09:16:04 UTC
create machine then delete machine, machine controller logs. I test this in aws, machine could be deleted.

gcp machine-controller logs:

I0819 09:05:35.582748       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:05:35.584190       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:05:35.596044       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:05:35.596083       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:05:35.596100       1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists
E0819 09:05:35.967996       1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound
I0819 09:05:36.968378       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:05:36.968416       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:05:36.968432       1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists
E0819 09:05:37.277957       1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound
I0819 09:05:38.278384       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:05:38.278433       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:05:38.278463       1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists
E0819 09:05:38.595997       1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound



I0819 09:06:24.426631       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:06:24.426676       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:06:24.426716       1 controller.go:205] Reconciling machine "zhsun2-4g7pw-w-a-invalid" triggers delete
I0819 09:06:24.426725       1 actuator.go:116] zhsun2-4g7pw-w-a-invalid: Deleting machine
E0819 09:06:24.600772       1 controller.go:220] Failed to delete machine "zhsun2-4g7pw-w-a-invalid": unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound
I0819 09:06:26.764445       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:06:26.764503       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:06:26.764518       1 controller.go:205] Reconciling machine "zhsun2-4g7pw-w-a-invalid" triggers delete
I0819 09:06:26.764527       1 actuator.go:116] zhsun2-4g7pw-w-a-invalid: Deleting machine
E0819 09:06:27.089704       1 controller.go:220] Failed to delete machine "zhsun2-4g7pw-w-a-invalid": unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound



aws machine-controller logs:
I0819 09:08:55.425914       1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid"
I0819 09:08:55.425950       1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:08:55.435053       1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid"
I0819 09:08:55.435077       1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:08:55.435107       1 actuator.go:481] zhsun-7558q-worker-us-east-2a-invalid: Checking if machine exists
I0819 09:08:55.546206       1 actuator.go:489] zhsun-7558q-worker-us-east-2a-invalid: Instance does not exist
I0819 09:08:55.546232       1 controller.go:259] Reconciling machine object zhsun-7558q-worker-us-east-2a-invalid triggers idempotent create.
I0819 09:08:55.546241       1 actuator.go:113] zhsun-7558q-worker-us-east-2a-invalid: creating machine
E0819 09:08:55.546461       1 utils.go:191] NodeRef not found in machine zhsun-7558q-worker-us-east-2a-invalid
I0819 09:08:55.588524       1 instances.go:44] No stopped instances found for machine zhsun-7558q-worker-us-east-2a-invalid
I0819 09:08:55.588582       1 instances.go:142] Using AMI ami-06c85f9d106577272
I0819 09:08:55.588595       1 instances.go:74] Describing security groups based on filters
E0819 09:08:55.744085       1 instances.go:113] error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid]
	status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31
E0819 09:08:55.744161       1 actuator.go:107] zhsun-7558q-worker-us-east-2a-invalid: Machine error: error launching instance: error getting subnet IDs: error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid]
	status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31,
E0819 09:08:55.744171       1 actuator.go:116] zhsun-7558q-worker-us-east-2a-invalid: error creating machine: error launching instance: error getting subnet IDs: error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid]
	status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31,


I0819 09:09:17.172675       1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid"
I0819 09:09:17.172710       1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:09:17.172739       1 controller.go:205] Reconciling machine "zhsun-7558q-worker-us-east-2a-invalid" triggers delete
I0819 09:09:17.172753       1 actuator.go:333] zhsun-7558q-worker-us-east-2a-invalid: deleting machine
W0819 09:09:17.227440       1 actuator.go:376] zhsun-7558q-worker-us-east-2a-invalid: no instances found to delete for machine
I0819 09:09:17.237614       1 controller.go:239] Machine "zhsun-7558q-worker-us-east-2a-invalid" deletion successful

Comment 3 Jan Chaloupka 2019-08-19 10:43:04 UTC
Thanks for the logs.

So, as expected, machine with invalid zone does not result in instance created in GCE. So, the issue at this point is about creating a machine with invalid zone that can not be delete afterwards.

AWS works a bit differently. In AWS case, it's sufficient to know machine name to find an instance in AWS and delete it. In GCP case, knowledge of a zone is required as well. Without it, the actuator can not even check if an instance in GCE exists. Thus, the reason why GCP fails to delete a machine object with invalid zone and AWS does not.

Not sure if we can do anything about it at this point unless we use webhooks to validate the gcp provider config. Short term fix is to rename the zone to already existing. That might be troublesome with a machineset creating tens of machines with invalid zone. Long term solution is to use webhook validation which will talk to GCE and check if provided zone exists. Which assumes the webhook validation will be always able to talk to GCE. If not, machine(s) will be refused.

Comment 4 Michael Gugino 2019-08-27 12:32:09 UTC
https://github.com/openshift/cluster-api/pull/67

Comment 7 Jianwei Hou 2019-09-03 06:07:47 UTC
Verified in 4.2.0-0.nightly-2019-09-02-172410.

The machine with invalid zone can be deleted.

Comment 8 errata-xmlrpc 2019-10-16 06:36:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.