Bug 1806471 (OCPRHV-137-4.6) - OCPRHV-137: Bootstrap node is left in the engine after destroying failed cluster
Summary: OCPRHV-137: Bootstrap node is left in the engine after destroying failed cluster
Keywords:
Status: CLOSED ERRATA
Alias: OCPRHV-137-4.6
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.6.0
Assignee: Douglas Schilling Landgraf
QA Contact: Guilherme Santos
URL:
Whiteboard:
: 1836342 1855861 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-24 09:48 UTC by Jan Zmeskal
Modified: 2020-10-27 15:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 15:55:19 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3868 0 None closed ovirt: tag and remove tmp/bootstrap machines 2020-10-22 11:23:04 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:56:00 UTC

Description Jan Zmeskal 2020-02-24 09:48:06 UTC
Description of problem:
When OCP4 installation fails and you run `openshift-isntall destroy cluster` command, bootstrap VM is left in the engine.

Version-Release number of the following components:
./openshift-install v4.4.0
built from commit 4010a2e42a95fc8eace70d629653b6f60a27b021
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:5c50516bd5669faec3729fa4e4705d073f9c720c769df4c77afe05dc20533963


How reproducible:
100 %

Steps to Reproduce:
1. Run a OCP4 instllation that fails but manages to create master nodes and a bootstrap node
2. Run openshift-install destroy cluster command against your cluster

Actual results:
./openshift-install destroy cluster --dir=test-cluster --log-level=debug
DEBUG OpenShift Installer v4.4.0                   
DEBUG Built from commit 4010a2e42a95fc8eace70d629653b6f60a27b021 
INFO searching VMs by tag=purple-zdcjt            
INFO Found %!s(int=5) VMs                         
INFO Stopping VM purple-zdcjt-master-2 : errors: %s%!(EXTRA <nil>) 
INFO Stopping VM purple-zdcjt-worker-0-9fhtt : errors: %s%!(EXTRA <nil>) 
INFO Stopping VM purple-zdcjt-master-1 : errors: %s%!(EXTRA <nil>) 
INFO Stopping VM purple-zdcjt-master-0 : errors: %s%!(EXTRA <nil>) 
INFO Stopping VM purple-zdcjt-worker-0-m8clh : errors: %s%!(EXTRA <nil>) 
INFO Removing VM purple-zdcjt-master-0 : errors: %s%!(EXTRA <nil>) 
INFO Removing VM purple-zdcjt-master-1 : errors: %s%!(EXTRA <nil>) 
INFO Removing VM purple-zdcjt-worker-0-9fhtt : errors: %s%!(EXTRA <nil>) 
INFO Removing VM purple-zdcjt-master-2 : errors: %s%!(EXTRA <nil>) 
INFO Removing VM purple-zdcjt-worker-0-m8clh : errors: %s%!(EXTRA <nil>) 
ERROR Removing VMs - error: %!s(<nil>)             
INFO Removing tag purple-zdcjt : errors: %s%!(EXTRA <nil>) 
ERROR Removing Tag - error: %!s(<nil>)             
ERROR Removing Template - error: %!s(<nil>)        
DEBUG Purging asset "Terraform Variables" from disk 
DEBUG Purging asset "Kubeconfig Admin Client" from disk 
DEBUG Purging asset "Kubeadmin Password" from disk 
DEBUG Purging asset "Certificate (journal-gatewayd)" from disk 
DEBUG Purging asset "Metadata" from disk           
DEBUG Purging asset "Cluster" from disk    

Expected results:
Bootstrap VM should be removed as well

Comment 1 Greg Sheremeta 2020-03-30 17:27:29 UTC
*** Bug 1818529 has been marked as a duplicate of this bug. ***

Comment 2 Greg Sheremeta 2020-03-30 17:28:44 UTC
Also those "errors: %s%!(EXTRA <nil>)" prints should be cleaned up.

Comment 3 W. Trevor King 2020-04-15 02:48:34 UTC
> Also those "errors: %s%!(EXTRA <nil>)" prints should be cleaned up.

This was handled for 4.5 via [1].  But if you want that backported to earlier 4.y, you should create a separate bug series, because the logging fix has nothing to do with the bootstrap cleanup issue this bug is about.

[1]: https://github.com/openshift/installer/pull/3445

Comment 4 Greg Sheremeta 2020-04-15 15:22:58 UTC
(In reply to W. Trevor King from comment #3)
> > Also those "errors: %s%!(EXTRA <nil>)" prints should be cleaned up.
> 
> This was handled for 4.5 via [1].  But if you want that backported to
> earlier 4.y, you should create a separate bug series, because the logging
> fix has nothing to do with the bootstrap cleanup issue this bug is about.
> 
> [1]: https://github.com/openshift/installer/pull/3445

Well, I had opened Bug 1818529 but marked it as a dupe. You have a point -- it's probably not a dupe.

Comment 7 Douglas Schilling Landgraf 2020-06-01 15:18:07 UTC
*** Bug 1836342 has been marked as a duplicate of this bug. ***

Comment 8 Douglas Schilling Landgraf 2020-06-03 11:01:54 UTC
I don't see any problem with the code: installer/pkg/destroy/ovirt/destroyer.go
What I have found: The bootstrap VM is not tagged as the tmp VM too.
Working in a possible patch.

Comment 9 Douglas Schilling Landgraf 2020-06-03 18:26:35 UTC
Here a PR with an example in golang how to detect the vms are tagged or not:
https://github.com/oVirt/ovirt-engine-sdk-go/pull/205/commits/0da9b64a27fedc3b41ec0057d42c80223b559dc9

Comment 10 Douglas Schilling Landgraf 2020-06-04 12:29:47 UTC
(In reply to Greg Sheremeta from comment #2)
> Also those "errors: %s%!(EXTRA <nil>)" prints should be cleaned up.

+1

Comment 11 Greg Sheremeta 2020-06-04 13:13:00 UTC
(In reply to Douglas Schilling Landgraf from comment #10)
> (In reply to Greg Sheremeta from comment #2)
> > Also those "errors: %s%!(EXTRA <nil>)" prints should be cleaned up.
> 
> +1

per Comment 3, already done

Comment 12 Douglas Schilling Landgraf 2020-06-04 14:50:57 UTC
(In reply to Greg Sheremeta from comment #11)
> (In reply to Douglas Schilling Landgraf from comment #10)
> > (In reply to Greg Sheremeta from comment #2)
> > > Also those "errors: %s%!(EXTRA <nil>)" prints should be cleaned up.
> > 
> > +1
> 
> per Comment 3, already done

Yep, I have just tried that in my local env. I can't see it anymore.

Comment 13 Douglas Schilling Landgraf 2020-06-05 11:15:35 UTC
moving back to assign to keep in my radar. We will need a second patch.

Comment 14 Roy Golan 2020-06-07 14:43:08 UTC
I think the temp VM should have a different tag name than $cluster_id because that means its
only going to be removed when destroying the cluster, so it we keep being a waste during the liftime
of the cluster.

Instead we should destroy it on destroy bootstrap, where it more logically belongs. To do that
all we need is to tag it with $cluster_id-bootstrap, and then in the installer code remove all VMs
by this tag. The bootstrap VM should also have the same tag, so both get removed.

To overcome the problem where you can't define the tag twice and have it updated, what we need is
to declare the tag once, but have it assigned with a list of vm id, just like masters:

resource "ovirt_tag" "cluster_tag" {
  name   = var.cluster_id
  vm_ids = [for instance in ovirt_vm.master.* : instance.id]
}

But instead of ovirt_vm.master.* we should replace it with ovirt_vm.bootstrap.* and conctat ovirt_vm.tmp_import_vm, so:

resource "ovirt_tag" "cluster_bootstrap_tag" {
  name   = "${var.cluster_id}-bootstrap"
  vm_ids = [concat(ovirt_vm.bootstrap.id, tmp_import_vm_id)]
}

you would need to pass the tmp_import_vm_id to the bootstrap module.

Comment 15 Douglas Schilling Landgraf 2020-06-07 15:46:46 UTC
(In reply to Roy Golan from comment #14)
> I think the temp VM should have a different tag name than $cluster_id
> because that means its
> only going to be removed when destroying the cluster, so it we keep being a
> waste during the liftime
> of the cluster.
> 
> Instead we should destroy it on destroy bootstrap, where it more logically
> belongs. To do that
> all we need is to tag it with $cluster_id-bootstrap, and then in the
> installer code remove all VMs
> by this tag. The bootstrap VM should also have the same tag, so both get
> removed.
> 
> To overcome the problem where you can't define the tag twice and have it
> updated, what we need is
> to declare the tag once, but have it assigned with a list of vm id, just
> like masters:
> 
> resource "ovirt_tag" "cluster_tag" {
>   name   = var.cluster_id
>   vm_ids = [for instance in ovirt_vm.master.* : instance.id]
> }
> 
> But instead of ovirt_vm.master.* we should replace it with
> ovirt_vm.bootstrap.* and conctat ovirt_vm.tmp_import_vm, so:
> 
> resource "ovirt_tag" "cluster_bootstrap_tag" {
>   name   = "${var.cluster_id}-bootstrap"
>   vm_ids = [concat(ovirt_vm.bootstrap.id, tmp_import_vm_id)]
> }
> 
> you would need to pass the tmp_import_vm_id to the bootstrap module.

replied in the github.

Comment 16 Gal Zaidman 2020-06-17 16:22:11 UTC
due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 18 Douglas Schilling Landgraf 2020-07-09 12:01:52 UTC
due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 20 Douglas Schilling Landgraf 2020-07-11 13:37:59 UTC
*** Bug 1855861 has been marked as a duplicate of this bug. ***

Comment 24 Guilherme Santos 2020-07-22 12:43:09 UTC
Verified on:
4.6.0-0.nightly-2020-07-22-074636

Steps:
1. # openshift-install create cluster --log-level=debug --dir=resources
2. somehow, cancel the installation once it passes the "Creating infrastructure resources" step (CTRL+C, kill the process, turn off the DNS...)
# openshift-install destroy cluster --dir=resources
3. check on engine UI if bootstrap vm is there (Compute -> Virtual Machines)

Results:
bootstrap vm  deleted

Comment 26 errata-xmlrpc 2020-10-27 15:55:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.