Bug 1877933 - kata-operator deploy does not clean up after itself when install fails
Summary: kata-operator deploy does not clean up after itself when install fails
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: sandboxed-containers
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Jens Freimann
QA Contact: Cameron Meadors
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-10 19:24 UTC by Cameron Meadors
Modified: 2021-04-12 09:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-12 09:08:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Cameron Meadors 2020-09-10 19:24:44 UTC
Description of problem:

If the install of the kata-operator fails, then you can't install again.

Version-Release number of selected component (if applicable):

openshift version 4.6.0-0.nightly-2020-09-10-121352

kata-operator version https://github.com/openshift/kata-operator/commit/798a74f79104bc65502f6d2060d93205e294b8fb

How reproducible:

Everytime

Steps to Reproduce:
1. run 'deploy/deploy.sh'
2. run 'oc apply -f deploy/crds/kataconfiguration.openshift.io_v1alpha1_kataconfig_cr.yaml'
3. See failure in pod logs
4. run 'oc delete -f deploy/crds/kataconfiguration.openshift.io_v1alpha1_kataconfig_cr.yaml
5. run 'oc delete ns kata-operator'
6. make changes to deploy scripts
7. run 'deploy/deploy.sh'
8, run 'oc apply -f deploy/crds/kataconfiguration.openshift.io_v1alpha1_kataconfig_cr.yaml'

Actual results:

W0910 18:55:44.896965  217511 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
/usr/bin/mkdir -p /host/opt/kata-install
error creating Runtime bundle layout in /usr/local/kata


Expected results:

Deploy works or reports more specific errors to why it fails.

Additional info:

Removing /opt/kata-install and /usr/local/kata resets everything.

Comment 2 Jens Freimann 2020-09-11 09:00:52 UTC
I created a PR [1] to clean up the temporary directories automatically when the pods are stopped.

[1] https://github.com/openshift/kata-operator/pull/8

Comment 3 Cameron Meadors 2020-09-21 19:55:13 UTC
My way of reproducing a failed install seems to have been completely resolved, https://bugzilla.redhat.com/show_bug.cgi?id=1877934.  Install works with either v1.0 or 4.6 as versions.  This is expected and good, but make this bug hard to verify.  Any suggestions on how to get a failed install?

Comment 4 Jens Freimann 2020-09-23 13:00:38 UTC
I could provide you a private build of the commit before PR #8 was merged. Would that help?

Comment 5 Cameron Meadors 2020-09-23 13:28:07 UTC
If that is no trouble, yes.  Then I could positively verify this and close it.

Comment 6 Cameron Meadors 2020-10-05 18:31:21 UTC
Just hit this again.  Failed install, uninstalled, and reinstalled with the error in the original description.

Comment 7 Cameron Meadors 2020-10-06 19:39:10 UTC
I suspect a build issue is giving me an older version where this is not fixed.  I am changing this back to Verified until I know otherwise.

Comment 8 Cameron Meadors 2020-10-07 14:11:49 UTC
Forgot that I VERIFIED based on so assumptions, like we had a good install.  Since we still see operator install failures I am going to put this in ON_QA until I am actually confirm it is fixed.

Comment 10 Cameron Meadors 2020-10-21 19:22:18 UTC
fix in https://github.com/openshift/kata-operator/pull/16

Comment 11 Cameron Meadors 2020-10-21 19:26:55 UTC
and kata-operator-daemon: https://github.com/harche/kata-operator-daemon/pull/18

Comment 12 Cameron Meadors 2020-10-23 13:52:09 UTC
I have retested on our latest build for kata-operator (commit 723d90e) on 4.6 rc4.  I have installed and interrupted the install a various points by deleting the kataconfig and subsequent installs still work.  Developer testing was more specific but on a dev environment.  With that I am saying this is tested well enough for 4.6.


Note You need to log in before you can comment on or make changes to this bug.