Bug 1879958

Summary: [CNV-2.5][Uninstall] It is not possible to uninstall CNV from OCP-4.6-fc.5
Product: Container Native Virtualization (CNV) Reporter: Lukas Bednar <lbednar>
Component: InstallationAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED ERRATA QA Contact: Inbar Rose <irose>
Severity: high Docs Contact:
Priority: urgent    
Version: 2.5.0CC: cnv-qe-bugs, fdeutsch, lbednar, ncredi, stirabos
Target Milestone: ---   
Target Release: 2.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hco-bundle-registry:v2.5.0-210 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-17 13:24:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lukas Bednar 2020-09-17 12:46:03 UTC
Description of problem:

This issue is caused by two unrelated issues which comes together.
Issues involved: OLM bug, CDI doesn't start.

One issue is that HCO is being killed by OLM, and makes the HCO verification webhook to be not available.
It is side effect of https://github.com/operator-framework/operator-lifecycle-manager/pull/1761

The second issue is that CDI failed to start and doesn't respond on HCO requests to delete Datavolumes.

There is pending fix https://github.com/kubevirt/hyperconverged-cluster-operator/pull/807 .

Here is additional info provided by Simone explaining how it comes together.

It's still a side effect of the OLM bug, where OLM bug is killing HCO pod continously.
So HCO web-hook is not available and we have a short timeout there, so we can skip validation of delete requests with strange side effects.
In this specific case, we should validate the presence of DataVolumes
triggering CDI webhook in dry run mode, but CDI webhook is not working because cdi-apiserver is not starting.

If we skip the validation on HCO side because HCO is not up, HCO will admit it because it skipped the dry run test, but then it will fail trying to really executing because cdi apiserver is never going to be reachable.

Version-Release number of selected component (if applicable):
OCP-4.6-fc.5
HCO-v2.5.0-186


How reproducible: 100


Steps to Reproduce:
1. Deploy CNV on OCP
2. Uninstall CNV
3.

Actual results: CNV is stuck in environment.


Expected results: CNV is gone


Additional info:

Comment 1 Simone Tiraboschi 2020-09-18 09:45:54 UTC
It's really a corner case that is not going to happen if:
- HCO delete request can reach CDI webhook
- OLM is not going to kill HCO continuously so that it can miss some validation request

I think it's still worth to fix it simply setting failurePolicy=Fail to implicitly refuse all the delete request that missed the webhook validation as virt operator and CDI operator are doing.

The root cause here is that currently we have one policy on CDI operator and a different one on HCO side and on corner cases the user can get stuck in the middle.

Comment 2 Lukas Bednar 2020-09-23 13:03:48 UTC
CDI operator is is not failing in HCO-v2.5.0-210, so I can not really verify that fix.
So at least here is the prove that failurePolicy on validation webhook is set to Fail.

oc get csv -o yaml  -n openshift-cnv kubevirt-hyperconverged-operator.v2.5.0
.... TRIMMED ...
  - admissionReviewVersions:
    - v1beta1
    - v1
    containerPort: 4343
    deploymentName: hco-operator
    failurePolicy: Fail                      <<<< HERE
    generateName: validate-hco.kubevirt.io
    rules:
    - apiGroups:
      - hco.kubevirt.io
      apiVersions:
      - v1alpha1
      - v1beta1
      operations:
      - CREATE
      - DELETE
      resources:
      - hyperconvergeds
    sideEffects: None
    timeoutSeconds: 30
    type: ValidatingAdmissionWebhook
    webhookPath: /validate-hco-kubevirt-io-v1beta1-hyperconverged
.... TRIMMED ...

Verified on hco-v2.5.0-222 .

Comment 5 errata-xmlrpc 2020-11-17 13:24:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5127