Bug 1742227
Summary: | [gcp] Machines not cleaned up without proper naming | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Michael Gugino <mgugino> |
Component: | Installer | Assignee: | Jeremiah Stuever <jstuever> |
Installer sub component: | openshift-installer | QA Contact: | Yang Yang <yanyang> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | low | CC: | agarcial, jchaloup, jstuever, wking, zhsun |
Version: | 4.2.0 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1741759 | Environment: | |
Last Closed: | 2020-05-04 11:13:08 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1741759 | ||
Bug Blocks: | 1788708, 1795436 |
Description
Michael Gugino
2019-08-16 17:56:30 UTC
So, just filed this bug to follow up on my comment in the cloned bug. Currently, users can create arbitrary machinesets (and machines). Looking at the installer's destroy code, there's no way to delete machines that users give a non-standard name to. We should ensure we look at tags/labels instead of just hostname-regex to list/destroy appropriate machines. If this is not practical, we should call this out specifically in the docs. (In reply to Michael Gugino from comment #1) > So, just filed this bug to follow up on my comment in the cloned bug. > > Currently, users can create arbitrary machinesets (and machines). Looking > at the installer's destroy code, there's no way to delete machines that > users give a non-standard name to. We should ensure we look at tags/labels > instead of just hostname-regex to list/destroy appropriate machines. If > this is not practical, we should call this out specifically in the docs. Any non-standard machines on AWS are also not deleted. On GCP we cleanup based on prefix only. We don't document AWS destroy code too. So I think this is more of an RFE. I disagree that this is not a bug. It should be covered by either docs or behavior. It's inevitable that someone's going to create a machine or machineset with a one-off name, and it's going to be orphaned by the installer in case of destroy-cluster operation. (In reply to Michael Gugino from comment #3) > I disagree that this is not a bug. It should be covered by either docs or > behavior. It's inevitable that someone's going to create a machine or > machineset with a one-off name, and it's going to be orphaned by the > installer in case of destroy-cluster operation. It's not the job of installer to destroy all the resources that the user might create. It job is to destroy things it created, things it created with the help of the cluster, things the openshift operators created and things created by the kubernetes-controller-manager's cloud-controller. Resources that user's create are their responsibility. (In reply to Abhinav Dahiya from comment #4) > (In reply to Michael Gugino from comment #3) > > I disagree that this is not a bug. It should be covered by either docs or > > behavior. It's inevitable that someone's going to create a machine or > > machineset with a one-off name, and it's going to be orphaned by the > > installer in case of destroy-cluster operation. > > It's not the job of installer to destroy all the resources that the user > might create. It job is to destroy things it created, things it created with > the help of the cluster, things the openshift operators created and things > created by the kubernetes-controller-manager's cloud-controller. > > Resources that user's create are their responsibility. What things do we destroy that are created by the k8s-controller-manager's cloud-controller? Could we utilize similar logic for instances created via machine-controller? As I see it, if it's valid for a user to create a new machineset to do whatever they want with, and it's valid to later use destroy-cluster, then we're leaking instances, and the user probably has the expectation that we're cleaning those up. I think this bug priority can be low, and we might decide to 'fix' it with just docs, but I also don't think adding the label/tag query would be that big of a change (and, IMO, more correct way to do it). (In reply to Michael Gugino from comment #5) > (In reply to Abhinav Dahiya from comment #4) > > (In reply to Michael Gugino from comment #3) > > > I disagree that this is not a bug. It should be covered by either docs or > > > behavior. It's inevitable that someone's going to create a machine or > > > machineset with a one-off name, and it's going to be orphaned by the > > > installer in case of destroy-cluster operation. > > > > It's not the job of installer to destroy all the resources that the user > > might create. It job is to destroy things it created, things it created with > > the help of the cluster, things the openshift operators created and things > > created by the kubernetes-controller-manager's cloud-controller. > > > > Resources that user's create are their responsibility. > > What things do we destroy that are created by the k8s-controller-manager's > cloud-controller? Could we utilize similar logic for instances created via > machine-controller? > > As I see it, if it's valid for a user to create a new machineset to do > whatever they want with, and it's valid to later use destroy-cluster, then > we're leaking instances, and the user probably has the expectation that > we're cleaning those up. > I think this bug priority can be low, and we might decide to 'fix' it with > just docs, but I also don't think adding the label/tag query would be that > big of a change (and, IMO, more correct way to do it). GCP doesn't allow labeling MOST resources.. prefix is the only sane way to do it on GCP.. special casing and keeping information about the instances is not worth the effort. When GCP has more wide-spread support, we might consider it. (In reply to Michael Gugino from comment #5) > (In reply to Abhinav Dahiya from comment #4) > > (In reply to Michael Gugino from comment #3) > > > I disagree that this is not a bug. It should be covered by either docs or > > > behavior. It's inevitable that someone's going to create a machine or > > > machineset with a one-off name, and it's going to be orphaned by the > > > installer in case of destroy-cluster operation. > > > > It's not the job of installer to destroy all the resources that the user > > might create. It job is to destroy things it created, things it created with > > the help of the cluster, things the openshift operators created and things > > created by the kubernetes-controller-manager's cloud-controller. > > > > Resources that user's create are their responsibility. > > What things do we destroy that are created by the k8s-controller-manager's > cloud-controller? Could we utilize similar logic for instances created via > machine-controller? > > As I see it, if it's valid for a user to create a new machineset to do > whatever they want with, and it's valid to later use destroy-cluster, then > we're leaking instances, and the user probably has the expectation that > we're cleaning those up. All openshift operators make sure that the resources created by them are collectable by the installer. The process might differ.. but all of them are around infrastructure.config.openshift.io .status.infrastructureName. I would rather see machine-api operator making sure that the machine-controllers are configured to make sure the machines are atleast setup correctly such that destroy can clean them up, asking the user to a) tag instances in AWS correctlty b) prefix instances in GCP etc.. seems like bad UX on the machine-api side. If the machine-controllers could enforce it, that would also solve the problem. > I think this bug priority can be low, and we might decide to 'fix' it with > just docs, but I also don't think adding the label/tag query would be that > big of a change (and, IMO, more correct way to do it). (In reply to Abhinav Dahiya from comment #7) > > All openshift operators make sure that the resources created by them are > collectable by the installer. The process might differ.. but all of them are > around infrastructure.config.openshift.io .status.infrastructureName. Do you mean that the installer needs to talk to the cluster in order to discover which resources to clean up? That's probably a good stop-gap solution, but we might want to destroy a cluster that is not responsive. > I would rather see machine-api operator making sure that the > machine-controllers are configured to make sure the machines are atleast > setup correctly such that destroy can clean them up, asking the user to > > a) tag instances in AWS correctlty > b) prefix instances in GCP etc.. seems like bad UX on the machine-api side. > If the machine-controllers could enforce it, that would also solve the > problem. The machine-api-operator wouldn't have much input here. Is there some place we can derive what the conforming strings/tags should be at runtime, such as a configmap or what have you? We could certainly inject this via env or read it from the cluster directly in the machine-controller. https://github.com/openshift/cluster-api-provider-gcp/pull/57 allows GCE instances to be labels with cluster ID. We might use that eventually. Can we delete all resources (like machines) that belong to cluster-owned resources (like subnets)? For example, the AWS destroyer tears down a number of resources if they belong to an owned VPC [1] (although since we've added support for bring-your-own-subnets, at least some of those should move to *BySubnet). [1]: https://github.com/openshift/installer/blame/ae27357bcdf1b1d1e0179f8687e11951d9da6303/pkg/destroy/aws/aws.go#L1211-L1216 Verified with 4.4.0-0.nightly-2020-02-11-060435 Steps for verification are as below: 1. Create an IPI cluster on GCP 2. Add a machine to the cluster such that the new instance names do not contain the infra_id, e.g. [root@preserve-yangyangmerrn-1 build]# oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jhou-machine Running n1-standard-4 us-central1 us-central1-a 92m yybz3-qg2qj-m-0 Running n1-standard-4 us-central1 us-central1-a 28h yybz3-qg2qj-m-1 Running n1-standard-4 us-central1 us-central1-b 28h yybz3-qg2qj-m-2 Running n1-standard-4 us-central1 us-central1-c 28h yybz3-qg2qj-w-a-z6n69 Running n1-standard-4 us-central1 us-central1-a 28h yybz3-qg2qj-w-b-zfzg9 Running n1-standard-4 us-central1 us-central1-b 28h yybz3-qg2qj-w-c-rjgzb Running n1-standard-4 us-central1 us-central1-c 28h 3. Check the machine lable: # gcloud compute instances list --format='table(name,status,labels.list())' | grep jhou jhou-machine RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned [root@preserve-yangyangmerrn-1 build]# gcloud compute instances list --format='table(name,status,labels.list())' | grep yybz3 jhou-machine RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned yybz3-qg2qj-m-0 RUNNING yybz3-qg2qj-w-a-z6n69 RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned yybz3-qg2qj-m-1 RUNNING yybz3-qg2qj-w-b-zfzg9 RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned yybz3-qg2qj-m-2 RUNNING yybz3-qg2qj-w-c-rjgzb RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned 4. Destroy the cluster All the instances are destroyed hence moving it to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |