Bug 1741765

Summary: [gcp] Failed to delete machine that has invalid zone
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Michael Gugino <mgugino>
Status: CLOSED ERRATA QA Contact: sunzhaohua <zhsun>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2.0CC: agarcial, brad.ison, jchaloup, jhou, mpatel
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: gcp
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:36:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sunzhaohua 2019-08-16 06:02:18 UTC
Description of problem:
Failed to delete machine that has invalid zone, even add annotation "machine.openshift.io/exclude-node-draining=", the machine couldn't be deleted.

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-14-211610

How reproducible:
Always

Steps to Reproduce:
1.  Create a machine with invalid zone
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: zhsun3-8vcmx
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: zhsun3-8vcmx-w-a-1
  namespace: openshift-machine-api
spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      apiVersion: gcpprovider.openshift.io/v1beta1
      canIPForward: false
      credentialsSecret:
        name: gcp-cloud-credentials
      deletionProtection: false
      disks:
      - autoDelete: true
        boot: true
        image: zhsun3-8vcmx-rhcos-image
        labels: null
        sizeGb: 128
        type: pd-ssd
      kind: GCPMachineProviderSpec
      machineType: n1-standard-4
      metadata:
        creationTimestamp: null
      networkInterfaces:
      - network: zhsun3-8vcmx-network
        subnetwork: zhsun3-8vcmx-worker-subnet
      projectID: openshift-gce-devel
      region: us-central1
      serviceAccounts:
      - email: zhsun3-8vcmx-w.gserviceaccount.com
        scopes:
        - https://www.googleapis.com/auth/cloud-platform
      tags:
      - zhsun3-8vcmx-worker
      userDataSecret:
        name: worker-user-data
      zone: us-central1-a-invalid

2. Delete machine
      
Actual results:
Machines couldn't be deleted.
$ oc delete machine zhsun3-8vcmx-w-b
machine.machine.openshift.io "zhsun3-8vcmx-w-b" deleted
^C


I0816 05:39:30.652303       1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-b"
I0816 05:39:30.652472       1 controller.go:310] Machine "zhsun3-8vcmx-w-b" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0816 05:39:30.652590       1 controller.go:205] Reconciling machine "zhsun3-8vcmx-w-b" triggers delete
I0816 05:39:30.652609       1 actuator.go:116] zhsun3-8vcmx-w-b: Deleting machine
E0816 05:39:30.991773       1 controller.go:220] Failed to delete machine "zhsun3-8vcmx-w-b": unable to verify project/zone exists: openshift-gce-devel/us-central1-b-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b-invalid' was not found, notFound
I0816 05:39:57.097079       1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-b"
I0816 05:39:57.097203       1 controller.go:310] Machine "zhsun3-8vcmx-w-b" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0816 05:39:57.097219       1 controller.go:205] Reconciling machine "zhsun3-8vcmx-w-b" triggers delete
I0816 05:39:57.097227       1 actuator.go:116] zhsun3-8vcmx-w-b: Deleting machine
E0816 05:39:57.351594       1 controller.go:220] Failed to delete machine "zhsun3-8vcmx-w-b": unable to verify project/zone exists: openshift-gce-devel/us-central1-b-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-b-invalid' was not found, notFound


Expected results:
Machine could be deleted.

Additional info:

Comment 1 Jan Chaloupka 2019-08-16 10:26:02 UTC
This is tricky since if a zone is invalid, the actuator can not decide if it's invalid because it does not exist or because the name was malformed. Given it can not find corresponding instance in GCE, it can not delete it. So, it rather backs of until the zone is available.

Can you share entire log from the machine controller? Do you see similar error message when an instance is being created?

Comment 2 sunzhaohua 2019-08-19 09:16:04 UTC
create machine then delete machine, machine controller logs. I test this in aws, machine could be deleted.

gcp machine-controller logs:

I0819 09:05:35.582748       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:05:35.584190       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:05:35.596044       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:05:35.596083       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:05:35.596100       1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists
E0819 09:05:35.967996       1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound
I0819 09:05:36.968378       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:05:36.968416       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:05:36.968432       1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists
E0819 09:05:37.277957       1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound
I0819 09:05:38.278384       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:05:38.278433       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:05:38.278463       1 actuator.go:80] zhsun2-4g7pw-w-a-invalid: Checking if machine exists
E0819 09:05:38.595997       1 controller.go:245] Failed to check if machine "zhsun2-4g7pw-w-a-invalid" exists: unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound



I0819 09:06:24.426631       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:06:24.426676       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:06:24.426716       1 controller.go:205] Reconciling machine "zhsun2-4g7pw-w-a-invalid" triggers delete
I0819 09:06:24.426725       1 actuator.go:116] zhsun2-4g7pw-w-a-invalid: Deleting machine
E0819 09:06:24.600772       1 controller.go:220] Failed to delete machine "zhsun2-4g7pw-w-a-invalid": unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound
I0819 09:06:26.764445       1 controller.go:141] Reconciling Machine "zhsun2-4g7pw-w-a-invalid"
I0819 09:06:26.764503       1 controller.go:310] Machine "zhsun2-4g7pw-w-a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:06:26.764518       1 controller.go:205] Reconciling machine "zhsun2-4g7pw-w-a-invalid" triggers delete
I0819 09:06:26.764527       1 actuator.go:116] zhsun2-4g7pw-w-a-invalid: Deleting machine
E0819 09:06:27.089704       1 controller.go:220] Failed to delete machine "zhsun2-4g7pw-w-a-invalid": unable to verify project/zone exists: openshift-gce-devel/us-central1-a-invalid; err: googleapi: Error 404: The resource 'projects/openshift-gce-devel/zones/us-central1-a-invalid' was not found, notFound



aws machine-controller logs:
I0819 09:08:55.425914       1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid"
I0819 09:08:55.425950       1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:08:55.435053       1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid"
I0819 09:08:55.435077       1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:08:55.435107       1 actuator.go:481] zhsun-7558q-worker-us-east-2a-invalid: Checking if machine exists
I0819 09:08:55.546206       1 actuator.go:489] zhsun-7558q-worker-us-east-2a-invalid: Instance does not exist
I0819 09:08:55.546232       1 controller.go:259] Reconciling machine object zhsun-7558q-worker-us-east-2a-invalid triggers idempotent create.
I0819 09:08:55.546241       1 actuator.go:113] zhsun-7558q-worker-us-east-2a-invalid: creating machine
E0819 09:08:55.546461       1 utils.go:191] NodeRef not found in machine zhsun-7558q-worker-us-east-2a-invalid
I0819 09:08:55.588524       1 instances.go:44] No stopped instances found for machine zhsun-7558q-worker-us-east-2a-invalid
I0819 09:08:55.588582       1 instances.go:142] Using AMI ami-06c85f9d106577272
I0819 09:08:55.588595       1 instances.go:74] Describing security groups based on filters
E0819 09:08:55.744085       1 instances.go:113] error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid]
	status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31
E0819 09:08:55.744161       1 actuator.go:107] zhsun-7558q-worker-us-east-2a-invalid: Machine error: error launching instance: error getting subnet IDs: error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid]
	status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31,
E0819 09:08:55.744171       1 actuator.go:116] zhsun-7558q-worker-us-east-2a-invalid: error creating machine: error launching instance: error getting subnet IDs: error describing availability zones: InvalidParameterValue: Invalid availability zone: [us-east-2a-invalid]
	status code: 400, request id: d05c2f9b-b785-45ed-b49a-930d2b355d31,


I0819 09:09:17.172675       1 controller.go:141] Reconciling Machine "zhsun-7558q-worker-us-east-2a-invalid"
I0819 09:09:17.172710       1 controller.go:310] Machine "zhsun-7558q-worker-us-east-2a-invalid" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0819 09:09:17.172739       1 controller.go:205] Reconciling machine "zhsun-7558q-worker-us-east-2a-invalid" triggers delete
I0819 09:09:17.172753       1 actuator.go:333] zhsun-7558q-worker-us-east-2a-invalid: deleting machine
W0819 09:09:17.227440       1 actuator.go:376] zhsun-7558q-worker-us-east-2a-invalid: no instances found to delete for machine
I0819 09:09:17.237614       1 controller.go:239] Machine "zhsun-7558q-worker-us-east-2a-invalid" deletion successful

Comment 3 Jan Chaloupka 2019-08-19 10:43:04 UTC
Thanks for the logs.

So, as expected, machine with invalid zone does not result in instance created in GCE. So, the issue at this point is about creating a machine with invalid zone that can not be delete afterwards.

AWS works a bit differently. In AWS case, it's sufficient to know machine name to find an instance in AWS and delete it. In GCP case, knowledge of a zone is required as well. Without it, the actuator can not even check if an instance in GCE exists. Thus, the reason why GCP fails to delete a machine object with invalid zone and AWS does not.

Not sure if we can do anything about it at this point unless we use webhooks to validate the gcp provider config. Short term fix is to rename the zone to already existing. That might be troublesome with a machineset creating tens of machines with invalid zone. Long term solution is to use webhook validation which will talk to GCE and check if provided zone exists. Which assumes the webhook validation will be always able to talk to GCE. If not, machine(s) will be refused.

Comment 4 Michael Gugino 2019-08-27 12:32:09 UTC
https://github.com/openshift/cluster-api/pull/67

Comment 7 Jianwei Hou 2019-09-03 06:07:47 UTC
Verified in 4.2.0-0.nightly-2019-09-02-172410.

The machine with invalid zone can be deleted.

Comment 8 errata-xmlrpc 2019-10-16 06:36:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922