+++ This bug was initially created as a clone of Bug #1741759 +++ Description of problem: Create a machine with invalid label "machine.openshift.io/cluster-api-cluster: zhsun3-8vcmx-invalid", the machine could be created successfully but the intance could join the cluster. Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-14-211610 How reproducible: Always Steps to Reproduce: 1. Create a machine with invalid label apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: labels: machine.openshift.io/cluster-api-cluster: zhsun3-8vcmx-invalid machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker name: zhsun3-8vcmx-w-a-1 namespace: openshift-machine-api spec: metadata: creationTimestamp: null providerSpec: value: apiVersion: gcpprovider.openshift.io/v1beta1 canIPForward: false credentialsSecret: name: gcp-cloud-credentials deletionProtection: false disks: - autoDelete: true boot: true image: zhsun3-8vcmx-rhcos-image labels: null sizeGb: 128 type: pd-ssd kind: GCPMachineProviderSpec machineType: n1-standard-4 metadata: creationTimestamp: null networkInterfaces: - network: zhsun3-8vcmx-network subnetwork: zhsun3-8vcmx-worker-subnet projectID: openshift-gce-devel region: us-central1 serviceAccounts: - email: zhsun3-8vcmx-w.gserviceaccount.com scopes: - https://www.googleapis.com/auth/cloud-platform tags: - zhsun3-8vcmx-worker userDataSecret: name: worker-user-data zone: us-central1-a 2. Check machine, node and machine-controller logs Actual results: Machine could be created successful and instance could join the cluster. $ oc get machine NAME STATE TYPE REGION ZONE AGE zhsun3-8vcmx-m-0 6h57m zhsun3-8vcmx-m-1 6h57m zhsun3-8vcmx-m-2 6h57m zhsun3-8vcmx-w-a-1 27m zhsun3-8vcmx-w-a-tdkpr 6h56m zhsun3-8vcmx-w-b-8dh2m 4h34m zhsun3-8vcmx-w-c-k5fvd 4h37m $ oc get node NAME STATUS ROLES AGE VERSION zhsun3-8vcmx-m-0.c.openshift-gce-devel.internal Ready master 6h39m v1.14.0+24b552f85 zhsun3-8vcmx-m-1.c.openshift-gce-devel.internal Ready master 6h39m v1.14.0+24b552f85 zhsun3-8vcmx-m-2.c.openshift-gce-devel.internal Ready master 6h40m v1.14.0+24b552f85 zhsun3-8vcmx-w-a-1.c.openshift-gce-devel.internal Ready worker 3m6s v1.14.0+24b552f85 zhsun3-8vcmx-w-a-tdkpr.c.openshift-gce-devel.internal Ready worker 6h31m v1.14.0+24b552f85 zhsun3-8vcmx-w-b-8dh2m.c.openshift-gce-devel.internal Ready worker 4h10m v1.14.0+24b552f85 zhsun3-8vcmx-w-c-k5fvd.c.openshift-gce-devel.internal Ready worker 4h13m v1.14.0+24b552f85 I0815 07:41:00.579531 1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-a-1" I0815 07:41:00.579583 1 controller.go:310] Machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0815 07:41:00.579596 1 actuator.go:80] zhsun3-8vcmx-w-a-1: Checking if machine exists I0815 07:41:01.154451 1 reconciler.go:258] zhsun3-8vcmx-w-a-1: Machine does not exist I0815 07:41:01.154502 1 controller.go:259] Reconciling machine object zhsun3-8vcmx-w-a-1 triggers idempotent create. I0815 07:41:01.154509 1 actuator.go:62] zhsun3-8vcmx-w-a-1: Creating machine I0815 07:41:02.522064 1 reconciler.go:151] zhsun3-8vcmx-w-a-1: Reconciling machine object with cloud state I0815 07:41:02.683244 1 reconciler.go:197] zhsun3-8vcmx-w-a-1: machine status is "PROVISIONING", requeuing... W0815 07:41:02.683333 1 controller.go:261] Failed to create machine "zhsun3-8vcmx-w-a-1": requeue in: 20s I0815 07:41:02.683346 1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s I0815 07:41:22.683673 1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-a-1" I0815 07:41:22.683711 1 controller.go:310] Machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0815 07:41:22.683722 1 actuator.go:80] zhsun3-8vcmx-w-a-1: Checking if machine exists I0815 07:41:23.020515 1 reconciler.go:253] Machine "zhsun3-8vcmx-w-a-1" already exists I0815 07:41:23.020666 1 controller.go:250] Reconciling machine "zhsun3-8vcmx-w-a-1" triggers idempotent update I0815 07:41:23.020674 1 actuator.go:98] zhsun3-8vcmx-w-a-1: Updating machine I0815 07:41:23.021227 1 reconciler.go:151] zhsun3-8vcmx-w-a-1: Reconciling machine object with cloud state I0815 07:41:23.249948 1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-a-1" I0815 07:41:23.251611 1 controller.go:310] Machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0815 07:41:23.251756 1 actuator.go:80] zhsun3-8vcmx-w-a-1: Checking if machine exists I0815 07:41:23.627532 1 reconciler.go:253] Machine "zhsun3-8vcmx-w-a-1" already exists I0815 07:41:23.627710 1 controller.go:250] Reconciling machine "zhsun3-8vcmx-w-a-1" triggers idempotent update I0815 07:41:23.627753 1 actuator.go:98] zhsun3-8vcmx-w-a-1: Updating machine I0815 07:41:23.628281 1 reconciler.go:151] zhsun3-8vcmx-w-a-1: Reconciling machine object with cloud state E0815 07:41:23.786336 1 controller.go:252] Error updating machine "openshift-machine-api/zhsun3-8vcmx-w-a-1": [machinescope] failed to update machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api": Operation cannot be fulfilled on machines.machine.openshift.io "zhsun3-8vcmx-w-a-1": the object has been modified; please apply your changes to the latest version and try again I0815 07:41:24.786765 1 controller.go:141] Reconciling Machine "zhsun3-8vcmx-w-a-1" I0815 07:41:24.786822 1 controller.go:310] Machine "zhsun3-8vcmx-w-a-1" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0815 07:41:24.786839 1 actuator.go:80] zhsun3-8vcmx-w-a-1: Checking if machine exists I0815 07:41:25.106501 1 reconciler.go:253] Machine "zhsun3-8vcmx-w-a-1" already exists I0815 07:41:25.106542 1 controller.go:250] Reconciling machine "zhsun3-8vcmx-w-a-1" triggers idempotent update I0815 07:41:25.106597 1 actuator.go:98] zhsun3-8vcmx-w-a-1: Updating machine I0815 07:41:25.107248 1 reconciler.go:151] zhsun3-8vcmx-w-a-1: Reconciling machine object with cloud state Expected results: machine-controller logs output label is not correct. Additional info: --- Additional comment from Jan Chaloupka on 2019-08-16 10:43:51 UTC --- machine.openshift.io/cluster-api-cluster labels is completely ignored by gcp machine controller. In case of aws, the label is used to tag instances to state cluster ownership. In case of gcp it seems we don't do this yet. We should do the same for GCP: https://cloud.google.com/compute/docs/labeling-resources --- Additional comment from Michael Gugino on 2019-08-16 16:51:13 UTC --- Another thing we need to address for this: openshift-install destroy cluster only deletes instances that have the right prefix. We need to coordinate with the installer to ensure we cleanup all instances with appropriate tags, and those tags should come from a cluster-level resource rather than a machine-level resource.
So, just filed this bug to follow up on my comment in the cloned bug. Currently, users can create arbitrary machinesets (and machines). Looking at the installer's destroy code, there's no way to delete machines that users give a non-standard name to. We should ensure we look at tags/labels instead of just hostname-regex to list/destroy appropriate machines. If this is not practical, we should call this out specifically in the docs.
(In reply to Michael Gugino from comment #1) > So, just filed this bug to follow up on my comment in the cloned bug. > > Currently, users can create arbitrary machinesets (and machines). Looking > at the installer's destroy code, there's no way to delete machines that > users give a non-standard name to. We should ensure we look at tags/labels > instead of just hostname-regex to list/destroy appropriate machines. If > this is not practical, we should call this out specifically in the docs. Any non-standard machines on AWS are also not deleted. On GCP we cleanup based on prefix only. We don't document AWS destroy code too. So I think this is more of an RFE.
I disagree that this is not a bug. It should be covered by either docs or behavior. It's inevitable that someone's going to create a machine or machineset with a one-off name, and it's going to be orphaned by the installer in case of destroy-cluster operation.
(In reply to Michael Gugino from comment #3) > I disagree that this is not a bug. It should be covered by either docs or > behavior. It's inevitable that someone's going to create a machine or > machineset with a one-off name, and it's going to be orphaned by the > installer in case of destroy-cluster operation. It's not the job of installer to destroy all the resources that the user might create. It job is to destroy things it created, things it created with the help of the cluster, things the openshift operators created and things created by the kubernetes-controller-manager's cloud-controller. Resources that user's create are their responsibility.
(In reply to Abhinav Dahiya from comment #4) > (In reply to Michael Gugino from comment #3) > > I disagree that this is not a bug. It should be covered by either docs or > > behavior. It's inevitable that someone's going to create a machine or > > machineset with a one-off name, and it's going to be orphaned by the > > installer in case of destroy-cluster operation. > > It's not the job of installer to destroy all the resources that the user > might create. It job is to destroy things it created, things it created with > the help of the cluster, things the openshift operators created and things > created by the kubernetes-controller-manager's cloud-controller. > > Resources that user's create are their responsibility. What things do we destroy that are created by the k8s-controller-manager's cloud-controller? Could we utilize similar logic for instances created via machine-controller? As I see it, if it's valid for a user to create a new machineset to do whatever they want with, and it's valid to later use destroy-cluster, then we're leaking instances, and the user probably has the expectation that we're cleaning those up. I think this bug priority can be low, and we might decide to 'fix' it with just docs, but I also don't think adding the label/tag query would be that big of a change (and, IMO, more correct way to do it).
(In reply to Michael Gugino from comment #5) > (In reply to Abhinav Dahiya from comment #4) > > (In reply to Michael Gugino from comment #3) > > > I disagree that this is not a bug. It should be covered by either docs or > > > behavior. It's inevitable that someone's going to create a machine or > > > machineset with a one-off name, and it's going to be orphaned by the > > > installer in case of destroy-cluster operation. > > > > It's not the job of installer to destroy all the resources that the user > > might create. It job is to destroy things it created, things it created with > > the help of the cluster, things the openshift operators created and things > > created by the kubernetes-controller-manager's cloud-controller. > > > > Resources that user's create are their responsibility. > > What things do we destroy that are created by the k8s-controller-manager's > cloud-controller? Could we utilize similar logic for instances created via > machine-controller? > > As I see it, if it's valid for a user to create a new machineset to do > whatever they want with, and it's valid to later use destroy-cluster, then > we're leaking instances, and the user probably has the expectation that > we're cleaning those up. > I think this bug priority can be low, and we might decide to 'fix' it with > just docs, but I also don't think adding the label/tag query would be that > big of a change (and, IMO, more correct way to do it). GCP doesn't allow labeling MOST resources.. prefix is the only sane way to do it on GCP.. special casing and keeping information about the instances is not worth the effort. When GCP has more wide-spread support, we might consider it.
(In reply to Michael Gugino from comment #5) > (In reply to Abhinav Dahiya from comment #4) > > (In reply to Michael Gugino from comment #3) > > > I disagree that this is not a bug. It should be covered by either docs or > > > behavior. It's inevitable that someone's going to create a machine or > > > machineset with a one-off name, and it's going to be orphaned by the > > > installer in case of destroy-cluster operation. > > > > It's not the job of installer to destroy all the resources that the user > > might create. It job is to destroy things it created, things it created with > > the help of the cluster, things the openshift operators created and things > > created by the kubernetes-controller-manager's cloud-controller. > > > > Resources that user's create are their responsibility. > > What things do we destroy that are created by the k8s-controller-manager's > cloud-controller? Could we utilize similar logic for instances created via > machine-controller? > > As I see it, if it's valid for a user to create a new machineset to do > whatever they want with, and it's valid to later use destroy-cluster, then > we're leaking instances, and the user probably has the expectation that > we're cleaning those up. All openshift operators make sure that the resources created by them are collectable by the installer. The process might differ.. but all of them are around infrastructure.config.openshift.io .status.infrastructureName. I would rather see machine-api operator making sure that the machine-controllers are configured to make sure the machines are atleast setup correctly such that destroy can clean them up, asking the user to a) tag instances in AWS correctlty b) prefix instances in GCP etc.. seems like bad UX on the machine-api side. If the machine-controllers could enforce it, that would also solve the problem. > I think this bug priority can be low, and we might decide to 'fix' it with > just docs, but I also don't think adding the label/tag query would be that > big of a change (and, IMO, more correct way to do it).
(In reply to Abhinav Dahiya from comment #7) > > All openshift operators make sure that the resources created by them are > collectable by the installer. The process might differ.. but all of them are > around infrastructure.config.openshift.io .status.infrastructureName. Do you mean that the installer needs to talk to the cluster in order to discover which resources to clean up? That's probably a good stop-gap solution, but we might want to destroy a cluster that is not responsive. > I would rather see machine-api operator making sure that the > machine-controllers are configured to make sure the machines are atleast > setup correctly such that destroy can clean them up, asking the user to > > a) tag instances in AWS correctlty > b) prefix instances in GCP etc.. seems like bad UX on the machine-api side. > If the machine-controllers could enforce it, that would also solve the > problem. The machine-api-operator wouldn't have much input here. Is there some place we can derive what the conforming strings/tags should be at runtime, such as a configmap or what have you? We could certainly inject this via env or read it from the cluster directly in the machine-controller.
https://github.com/openshift/cluster-api-provider-gcp/pull/57 allows GCE instances to be labels with cluster ID. We might use that eventually.
Can we delete all resources (like machines) that belong to cluster-owned resources (like subnets)? For example, the AWS destroyer tears down a number of resources if they belong to an owned VPC [1] (although since we've added support for bring-your-own-subnets, at least some of those should move to *BySubnet). [1]: https://github.com/openshift/installer/blame/ae27357bcdf1b1d1e0179f8687e11951d9da6303/pkg/destroy/aws/aws.go#L1211-L1216
Verified with 4.4.0-0.nightly-2020-02-11-060435 Steps for verification are as below: 1. Create an IPI cluster on GCP 2. Add a machine to the cluster such that the new instance names do not contain the infra_id, e.g. [root@preserve-yangyangmerrn-1 build]# oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jhou-machine Running n1-standard-4 us-central1 us-central1-a 92m yybz3-qg2qj-m-0 Running n1-standard-4 us-central1 us-central1-a 28h yybz3-qg2qj-m-1 Running n1-standard-4 us-central1 us-central1-b 28h yybz3-qg2qj-m-2 Running n1-standard-4 us-central1 us-central1-c 28h yybz3-qg2qj-w-a-z6n69 Running n1-standard-4 us-central1 us-central1-a 28h yybz3-qg2qj-w-b-zfzg9 Running n1-standard-4 us-central1 us-central1-b 28h yybz3-qg2qj-w-c-rjgzb Running n1-standard-4 us-central1 us-central1-c 28h 3. Check the machine lable: # gcloud compute instances list --format='table(name,status,labels.list())' | grep jhou jhou-machine RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned [root@preserve-yangyangmerrn-1 build]# gcloud compute instances list --format='table(name,status,labels.list())' | grep yybz3 jhou-machine RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned yybz3-qg2qj-m-0 RUNNING yybz3-qg2qj-w-a-z6n69 RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned yybz3-qg2qj-m-1 RUNNING yybz3-qg2qj-w-b-zfzg9 RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned yybz3-qg2qj-m-2 RUNNING yybz3-qg2qj-w-c-rjgzb RUNNING kubernetes-io-cluster-yybz3-qg2qj=owned 4. Destroy the cluster All the instances are destroyed hence moving it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581