Bug 1742227

Summary: [gcp] Machines not cleaned up without proper naming
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: InstallerAssignee: Jeremiah Stuever <jstuever>
Installer sub component: openshift-installer QA Contact: Yang Yang <yanyang>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: low CC: agarcial, jchaloup, jstuever, wking, zhsun
Version: 4.2.0Keywords: Reopened
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1741759 Environment:
Last Closed: 2020-05-04 11:13:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1741759    
Bug Blocks: 1788708, 1795436    

Description Michael Gugino 2019-08-16 17:56:30 UTC
+++ This bug was initially created as a clone of Bug #1741759 +++

Description of problem:
Create a machine with invalid label "machine.openshift.io/cluster-api-cluster: zhsun3-8vcmx-invalid", the machine could be created successfully but the intance could join the cluster.

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-14-211610

How reproducible:
Always

Steps to Reproduce:
1.  Create a machine with invalid label
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: zhsun3-8vcmx-invalid
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: zhsun3-8vcmx-w-a-1
  namespace: openshift-machine-api

spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      apiVersion: gcpprovider.openshift.io/v1beta1
      canIPForward: false
      credentialsSecret:
        name: gcp-cloud-credentials
      deletionProtection: false
      disks:
      - autoDelete: true
        boot: true
        image: zhsun3-8vcmx-rhcos-image
        labels: null
        sizeGb: 128
        type: pd-ssd
      kind: GCPMachineProviderSpec
      machineType: n1-standard-4
      metadata:
        creationTimestamp: null
      networkInterfaces:
      - network: zhsun3-8vcmx-network
        subnetwork: zhsun3-8vcmx-worker-subnet
      projectID: openshift-gce-devel
      region: us-central1
      serviceAccounts:
      - email: zhsun3-8vcmx-w.gserviceaccount.com
        scopes:
        - https://www.googleapis.com/auth/cloud-platform
      tags:
      - zhsun3-8vcmx-worker
      userDataSecret:
        name: worker-user-data
      zone: us-central1-a
      
2. Check machine, node and machine-controller logs


Actual results:
Machine could be created successful and instance could join the cluster.

$ oc get machine
NAME                     STATE   TYPE   REGION   ZONE   AGE
zhsun3-8vcmx-m-0                                        6h57m
zhsun3-8vcmx-m-1                                        6h57m
zhsun3-8vcmx-m-2                                        6h57m
zhsun3-8vcmx-w-a-1                                      27m
zhsun3-8vcmx-w-a-tdkpr                                  6h56m
zhsun3-8vcmx-w-b-8dh2m                                  4h34m
zhsun3-8vcmx-w-c-k5fvd                                  4h37m

$ oc get node
NAME                                                    STATUS   ROLES    AGE     VERSION
zhsun3-8vcmx-m-0.c.openshift-gce-devel.internal         Ready    master   6h39m   v1.14.0+24b552f85
zhsun3-8vcmx-m-1.c.openshift-gce-devel.internal         Ready    master   6h39m   v1.14.0+24b552f85
zhsun3-8vcmx-m-2.c.openshift-gce-devel.internal         Ready    master   6h40m   v1.14.0+24b552f85
zhsun3-8vcmx-w-a-1.c.openshift-gce-devel.internal       Ready    worker   3m6s    v1.14.0+24b552f85
zhsun3-8vcmx-w-a-tdkpr.c.openshift-gce-devel.internal   Ready    worker   6h31m   v1.14.0+24b552f85
zhsun3-8vcmx-w-b-8dh2m.c.openshift-gce-devel.internal   Ready    worker   4h10m   v1.14.0+24b552f85
zhsun3-8vcmx-w-c-k5fvd.c.openshift-gce-devel.internal   Ready    worker   4h13m   v1.14.0+24b552f85

I0815 07:41:00.579531       1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-a-1"
I0815 07:41:00.579583       1 controller.go:310] Machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0815 07:41:00.579596       1 actuator.go:80] zhsun3-8vcmx-w-a-1: Checking if machine exists
I0815 07:41:01.154451       1 reconciler.go:258] zhsun3-8vcmx-w-a-1: Machine does not exist
I0815 07:41:01.154502       1 controller.go:259] Reconciling machine object zhsun3-8vcmx-w-a-1 triggers idempotent create.
I0815 07:41:01.154509       1 actuator.go:62] zhsun3-8vcmx-w-a-1: Creating machine
I0815 07:41:02.522064       1 reconciler.go:151] zhsun3-8vcmx-w-a-1: Reconciling machine object with cloud state
I0815 07:41:02.683244       1 reconciler.go:197] zhsun3-8vcmx-w-a-1: machine status is "PROVISIONING", requeuing...
W0815 07:41:02.683333       1 controller.go:261] Failed to create machine "zhsun3-8vcmx-w-a-1": requeue in: 20s
I0815 07:41:02.683346       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0815 07:41:22.683673       1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-a-1"
I0815 07:41:22.683711       1 controller.go:310] Machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0815 07:41:22.683722       1 actuator.go:80] zhsun3-8vcmx-w-a-1: Checking if machine exists
I0815 07:41:23.020515       1 reconciler.go:253] Machine "zhsun3-8vcmx-w-a-1" already exists
I0815 07:41:23.020666       1 controller.go:250] Reconciling machine "zhsun3-8vcmx-w-a-1" triggers idempotent update
I0815 07:41:23.020674       1 actuator.go:98] zhsun3-8vcmx-w-a-1: Updating machine
I0815 07:41:23.021227       1 reconciler.go:151] zhsun3-8vcmx-w-a-1: Reconciling machine object with cloud state
I0815 07:41:23.249948       1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-a-1"
I0815 07:41:23.251611       1 controller.go:310] Machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0815 07:41:23.251756       1 actuator.go:80] zhsun3-8vcmx-w-a-1: Checking if machine exists
I0815 07:41:23.627532       1 reconciler.go:253] Machine "zhsun3-8vcmx-w-a-1" already exists
I0815 07:41:23.627710       1 controller.go:250] Reconciling machine "zhsun3-8vcmx-w-a-1" triggers idempotent update
I0815 07:41:23.627753       1 actuator.go:98] zhsun3-8vcmx-w-a-1: Updating machine
I0815 07:41:23.628281       1 reconciler.go:151] zhsun3-8vcmx-w-a-1: Reconciling machine object with cloud state
E0815 07:41:23.786336       1 controller.go:252] Error updating machine "openshift-machine-api/zhsun3-8vcmx-w-a-1": [machinescope] failed to update machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api": Operation cannot be fulfilled on machines.machine.openshift.io "zhsun3-8vcmx-w-a-1": the object has been modified; please apply your changes to the latest version and try again
I0815 07:41:24.786765       1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-a-1"
I0815 07:41:24.786822       1 controller.go:310] Machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0815 07:41:24.786839       1 actuator.go:80] zhsun3-8vcmx-w-a-1: Checking if machine exists
I0815 07:41:25.106501       1 reconciler.go:253] Machine "zhsun3-8vcmx-w-a-1" already exists
I0815 07:41:25.106542       1 controller.go:250] Reconciling machine "zhsun3-8vcmx-w-a-1" triggers idempotent update
I0815 07:41:25.106597       1 actuator.go:98] zhsun3-8vcmx-w-a-1: Updating machine
I0815 07:41:25.107248       1 reconciler.go:151] zhsun3-8vcmx-w-a-1: Reconciling machine object with cloud state

Expected results:
machine-controller logs output label is not correct.

Additional info:

--- Additional comment from Jan Chaloupka on 2019-08-16 10:43:51 UTC ---

machine.openshift.io/cluster-api-cluster labels is completely ignored by gcp machine controller. In case of aws, the label is used to tag instances to state cluster ownership. In case of gcp it seems we don't do this yet.

We should do the same for GCP: https://cloud.google.com/compute/docs/labeling-resources

--- Additional comment from Michael Gugino on 2019-08-16 16:51:13 UTC ---

Another thing we need to address for this: openshift-install destroy cluster only deletes instances that have the right prefix.  We need to coordinate with the installer to ensure we cleanup all instances with appropriate tags, and those tags should come from a cluster-level resource rather than a machine-level resource.

Comment 1 Michael Gugino 2019-08-16 18:00:22 UTC
So, just filed this bug to follow up on my comment in the cloned bug.

Currently, users can create arbitrary machinesets (and machines).  Looking at the installer's destroy code, there's no way to delete machines that users give a non-standard name to.  We should ensure we look at tags/labels instead of just hostname-regex to list/destroy appropriate machines.  If this is not practical, we should call this out specifically in the docs.

Comment 2 Abhinav Dahiya 2019-08-16 18:07:30 UTC
(In reply to Michael Gugino from comment #1)
> So, just filed this bug to follow up on my comment in the cloned bug.
> 
> Currently, users can create arbitrary machinesets (and machines).  Looking
> at the installer's destroy code, there's no way to delete machines that
> users give a non-standard name to.  We should ensure we look at tags/labels
> instead of just hostname-regex to list/destroy appropriate machines.  If
> this is not practical, we should call this out specifically in the docs.

Any non-standard machines on AWS are also not deleted. On GCP we cleanup based on prefix only. We don't document AWS destroy code too. So I think this is more of an RFE.

Comment 3 Michael Gugino 2019-08-16 19:13:13 UTC
I disagree that this is not a bug.  It should be covered by either docs or behavior.  It's inevitable that someone's going to create a machine or machineset with a one-off name, and it's going to be orphaned by the installer in case of destroy-cluster operation.

Comment 4 Abhinav Dahiya 2019-08-16 22:01:59 UTC
(In reply to Michael Gugino from comment #3)
> I disagree that this is not a bug.  It should be covered by either docs or
> behavior.  It's inevitable that someone's going to create a machine or
> machineset with a one-off name, and it's going to be orphaned by the
> installer in case of destroy-cluster operation.

It's not the job of installer to destroy all the resources that the user might create. It job is to destroy things it created, things it created with the help of the cluster, things the openshift operators created and things created by the kubernetes-controller-manager's cloud-controller.

Resources that user's create are their responsibility.

Comment 5 Michael Gugino 2019-08-16 23:02:49 UTC
(In reply to Abhinav Dahiya from comment #4)
> (In reply to Michael Gugino from comment #3)
> > I disagree that this is not a bug.  It should be covered by either docs or
> > behavior.  It's inevitable that someone's going to create a machine or
> > machineset with a one-off name, and it's going to be orphaned by the
> > installer in case of destroy-cluster operation.
> 
> It's not the job of installer to destroy all the resources that the user
> might create. It job is to destroy things it created, things it created with
> the help of the cluster, things the openshift operators created and things
> created by the kubernetes-controller-manager's cloud-controller.
> 
> Resources that user's create are their responsibility.

What things do we destroy that are created by the k8s-controller-manager's cloud-controller?  Could we utilize similar logic for instances created via machine-controller?

As I see it, if it's valid for a user to create a new machineset to do whatever they want with, and it's valid to later use destroy-cluster, then we're leaking instances, and the user probably has the expectation that we're cleaning those up.

I think this bug priority can be low, and we might decide to 'fix' it with just docs, but I also don't think adding the label/tag query would be that big of a change (and, IMO, more correct way to do it).

Comment 6 Abhinav Dahiya 2019-08-16 23:40:51 UTC
(In reply to Michael Gugino from comment #5)
> (In reply to Abhinav Dahiya from comment #4)
> > (In reply to Michael Gugino from comment #3)
> > > I disagree that this is not a bug.  It should be covered by either docs or
> > > behavior.  It's inevitable that someone's going to create a machine or
> > > machineset with a one-off name, and it's going to be orphaned by the
> > > installer in case of destroy-cluster operation.
> > 
> > It's not the job of installer to destroy all the resources that the user
> > might create. It job is to destroy things it created, things it created with
> > the help of the cluster, things the openshift operators created and things
> > created by the kubernetes-controller-manager's cloud-controller.
> > 
> > Resources that user's create are their responsibility.
> 
> What things do we destroy that are created by the k8s-controller-manager's
> cloud-controller?  Could we utilize similar logic for instances created via
> machine-controller?
> 
> As I see it, if it's valid for a user to create a new machineset to do
> whatever they want with, and it's valid to later use destroy-cluster, then
> we're leaking instances, and the user probably has the expectation that
> we're cleaning those up. 
> I think this bug priority can be low, and we might decide to 'fix' it with
> just docs, but I also don't think adding the label/tag query would be that
> big of a change (and, IMO, more correct way to do it).


GCP doesn't allow labeling MOST resources.. prefix is the only sane way to do it on GCP.. special casing and keeping information about the instances is not worth the effort. When GCP has more wide-spread support, we might consider it.

Comment 7 Abhinav Dahiya 2019-08-16 23:44:12 UTC
(In reply to Michael Gugino from comment #5)
> (In reply to Abhinav Dahiya from comment #4)
> > (In reply to Michael Gugino from comment #3)
> > > I disagree that this is not a bug.  It should be covered by either docs or
> > > behavior.  It's inevitable that someone's going to create a machine or
> > > machineset with a one-off name, and it's going to be orphaned by the
> > > installer in case of destroy-cluster operation.
> > 
> > It's not the job of installer to destroy all the resources that the user
> > might create. It job is to destroy things it created, things it created with
> > the help of the cluster, things the openshift operators created and things
> > created by the kubernetes-controller-manager's cloud-controller.
> > 
> > Resources that user's create are their responsibility.
> 
> What things do we destroy that are created by the k8s-controller-manager's
> cloud-controller?  Could we utilize similar logic for instances created via
> machine-controller?
> 
> As I see it, if it's valid for a user to create a new machineset to do
> whatever they want with, and it's valid to later use destroy-cluster, then
> we're leaking instances, and the user probably has the expectation that
> we're cleaning those up.

All openshift operators make sure that the resources created by them are collectable by the installer. The process might differ.. but all of them are around infrastructure.config.openshift.io .status.infrastructureName.

I would rather see machine-api operator making sure that the machine-controllers are configured to make sure the machines are atleast setup correctly such that destroy can clean them up, asking the user to 

a) tag instances in AWS correctlty
b) prefix instances in GCP etc.. seems like bad UX on the machine-api side. If the machine-controllers could enforce it, that would also solve the problem.

> I think this bug priority can be low, and we might decide to 'fix' it with
> just docs, but I also don't think adding the label/tag query would be that
> big of a change (and, IMO, more correct way to do it).

Comment 8 Michael Gugino 2019-08-17 00:33:17 UTC
(In reply to Abhinav Dahiya from comment #7)
> 
> All openshift operators make sure that the resources created by them are
> collectable by the installer. The process might differ.. but all of them are
> around infrastructure.config.openshift.io .status.infrastructureName.

Do you mean that the installer needs to talk to the cluster in order to discover which resources to clean up?  That's probably a good stop-gap solution, but we might want to destroy a cluster that is not responsive.

> I would rather see machine-api operator making sure that the
> machine-controllers are configured to make sure the machines are atleast
> setup correctly such that destroy can clean them up, asking the user to 
> 
> a) tag instances in AWS correctlty
> b) prefix instances in GCP etc.. seems like bad UX on the machine-api side.
> If the machine-controllers could enforce it, that would also solve the
> problem.

The machine-api-operator wouldn't have much input here.  Is there some place we can derive what the conforming strings/tags should be at runtime, such as a configmap or what have you?  We could certainly inject this via env or read it from the cluster directly in the machine-controller.

Comment 9 Jan Chaloupka 2019-08-30 14:40:45 UTC
https://github.com/openshift/cluster-api-provider-gcp/pull/57 allows GCE instances to be labels with cluster ID. We might use that eventually.

Comment 10 W. Trevor King 2020-01-21 15:12:05 UTC
Can we delete all resources (like machines) that belong to cluster-owned resources (like subnets)?  For example, the AWS destroyer tears down a number of resources if they belong to an owned VPC [1] (although since we've added support for bring-your-own-subnets, at least some of those should move to *BySubnet).

[1]: https://github.com/openshift/installer/blame/ae27357bcdf1b1d1e0179f8687e11951d9da6303/pkg/destroy/aws/aws.go#L1211-L1216

Comment 12 Yang Yang 2020-02-11 09:53:59 UTC
Verified with 4.4.0-0.nightly-2020-02-11-060435

Steps for verification are as below:
1. Create an IPI cluster on GCP
2. Add a machine to the cluster such that the new instance names do not contain the infra_id, e.g.
[root@preserve-yangyangmerrn-1 build]# oc get machine -n openshift-machine-api
NAME                    PHASE     TYPE            REGION        ZONE            AGE
jhou-machine            Running   n1-standard-4   us-central1   us-central1-a   92m
yybz3-qg2qj-m-0         Running   n1-standard-4   us-central1   us-central1-a   28h
yybz3-qg2qj-m-1         Running   n1-standard-4   us-central1   us-central1-b   28h
yybz3-qg2qj-m-2         Running   n1-standard-4   us-central1   us-central1-c   28h
yybz3-qg2qj-w-a-z6n69   Running   n1-standard-4   us-central1   us-central1-a   28h
yybz3-qg2qj-w-b-zfzg9   Running   n1-standard-4   us-central1   us-central1-b   28h
yybz3-qg2qj-w-c-rjgzb   Running   n1-standard-4   us-central1   us-central1-c   28h

3. Check the machine lable:
# gcloud compute instances list --format='table(name,status,labels.list())' | grep jhou
jhou-machine                         RUNNING   kubernetes-io-cluster-yybz3-qg2qj=owned

[root@preserve-yangyangmerrn-1 build]# gcloud compute instances list --format='table(name,status,labels.list())' | grep yybz3
jhou-machine                         RUNNING  kubernetes-io-cluster-yybz3-qg2qj=owned
yybz3-qg2qj-m-0                      RUNNING
yybz3-qg2qj-w-a-z6n69                RUNNING  kubernetes-io-cluster-yybz3-qg2qj=owned
yybz3-qg2qj-m-1                      RUNNING
yybz3-qg2qj-w-b-zfzg9                RUNNING  kubernetes-io-cluster-yybz3-qg2qj=owned
yybz3-qg2qj-m-2                      RUNNING
yybz3-qg2qj-w-c-rjgzb                RUNNING  kubernetes-io-cluster-yybz3-qg2qj=owned

4. Destroy the cluster
All the instances are destroyed hence moving it to verified state.

Comment 14 errata-xmlrpc 2020-05-04 11:13:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581