1879958 – [CNV-2.5][Uninstall] It is not possible to uninstall CNV from OCP-4.6-fc.5

Bug 1879958 - [CNV-2.5][Uninstall] It is not possible to uninstall CNV from OCP-4.6-fc.5

Summary: [CNV-2.5][Uninstall] It is not possible to uninstall CNV from OCP-4.6-fc.5

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	2.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	2.5.0
Assignee:	Simone Tiraboschi
QA Contact:	Inbar Rose
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-17 12:46 UTC by Lukas Bednar
Modified:	2020-11-17 13:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:	hco-bundle-registry:v2.5.0-210
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-17 13:24:24 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt hyperconverged-cluster-operator pull 807	0	None	closed	Set failurePolicy=Fail for HCO webhook	2020-11-16 09:11:31 UTC
Red Hat Product Errata	RHEA-2020:5127	0	None	None	None	2020-11-17 13:24:39 UTC

Description Lukas Bednar 2020-09-17 12:46:03 UTC

Description of problem:

This issue is caused by two unrelated issues which comes together.
Issues involved: OLM bug, CDI doesn't start.

One issue is that HCO is being killed by OLM, and makes the HCO verification webhook to be not available.
It is side effect of https://github.com/operator-framework/operator-lifecycle-manager/pull/1761

The second issue is that CDI failed to start and doesn't respond on HCO requests to delete Datavolumes.

There is pending fix https://github.com/kubevirt/hyperconverged-cluster-operator/pull/807 .

Here is additional info provided by Simone explaining how it comes together.

It's still a side effect of the OLM bug, where OLM bug is killing HCO pod continously.
So HCO web-hook is not available and we have a short timeout there, so we can skip validation of delete requests with strange side effects.
In this specific case, we should validate the presence of DataVolumes
triggering CDI webhook in dry run mode, but CDI webhook is not working because cdi-apiserver is not starting.

If we skip the validation on HCO side because HCO is not up, HCO will admit it because it skipped the dry run test, but then it will fail trying to really executing because cdi apiserver is never going to be reachable.

Version-Release number of selected component (if applicable):
OCP-4.6-fc.5
HCO-v2.5.0-186

How reproducible: 100

Steps to Reproduce:
1. Deploy CNV on OCP
2. Uninstall CNV
3.

Actual results: CNV is stuck in environment.

Expected results: CNV is gone

Additional info:

Comment 1 Simone Tiraboschi 2020-09-18 09:45:54 UTC

It's really a corner case that is not going to happen if:
- HCO delete request can reach CDI webhook
- OLM is not going to kill HCO continuously so that it can miss some validation request

I think it's still worth to fix it simply setting failurePolicy=Fail to implicitly refuse all the delete request that missed the webhook validation as virt operator and CDI operator are doing.

The root cause here is that currently we have one policy on CDI operator and a different one on HCO side and on corner cases the user can get stuck in the middle.

Comment 2 Lukas Bednar 2020-09-23 13:03:48 UTC

CDI operator is is not failing in HCO-v2.5.0-210, so I can not really verify that fix.
So at least here is the prove that failurePolicy on validation webhook is set to Fail.

oc get csv -o yaml  -n openshift-cnv kubevirt-hyperconverged-operator.v2.5.0
.... TRIMMED ...
  - admissionReviewVersions:
    - v1beta1
    - v1
    containerPort: 4343
    deploymentName: hco-operator
    failurePolicy: Fail                      <<<< HERE
    generateName: validate-hco.kubevirt.io
    rules:
    - apiGroups:
      - hco.kubevirt.io
      apiVersions:
      - v1alpha1
      - v1beta1
      operations:
      - CREATE
      - DELETE
      resources:
      - hyperconvergeds
    sideEffects: None
    timeoutSeconds: 30
    type: ValidatingAdmissionWebhook
    webhookPath: /validate-hco-kubevirt-io-v1beta1-hyperconverged
.... TRIMMED ...

Verified on hco-v2.5.0-222 .

Comment 5 errata-xmlrpc 2020-11-17 13:24:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5127

Note You need to log in before you can comment on or make changes to this bug.