Description of problem: This issue is caused by two unrelated issues which comes together. Issues involved: OLM bug, CDI doesn't start. One issue is that HCO is being killed by OLM, and makes the HCO verification webhook to be not available. It is side effect of https://github.com/operator-framework/operator-lifecycle-manager/pull/1761 The second issue is that CDI failed to start and doesn't respond on HCO requests to delete Datavolumes. There is pending fix https://github.com/kubevirt/hyperconverged-cluster-operator/pull/807 . Here is additional info provided by Simone explaining how it comes together. It's still a side effect of the OLM bug, where OLM bug is killing HCO pod continously. So HCO web-hook is not available and we have a short timeout there, so we can skip validation of delete requests with strange side effects. In this specific case, we should validate the presence of DataVolumes triggering CDI webhook in dry run mode, but CDI webhook is not working because cdi-apiserver is not starting. If we skip the validation on HCO side because HCO is not up, HCO will admit it because it skipped the dry run test, but then it will fail trying to really executing because cdi apiserver is never going to be reachable. Version-Release number of selected component (if applicable): OCP-4.6-fc.5 HCO-v2.5.0-186 How reproducible: 100 Steps to Reproduce: 1. Deploy CNV on OCP 2. Uninstall CNV 3. Actual results: CNV is stuck in environment. Expected results: CNV is gone Additional info:
It's really a corner case that is not going to happen if: - HCO delete request can reach CDI webhook - OLM is not going to kill HCO continuously so that it can miss some validation request I think it's still worth to fix it simply setting failurePolicy=Fail to implicitly refuse all the delete request that missed the webhook validation as virt operator and CDI operator are doing. The root cause here is that currently we have one policy on CDI operator and a different one on HCO side and on corner cases the user can get stuck in the middle.
CDI operator is is not failing in HCO-v2.5.0-210, so I can not really verify that fix. So at least here is the prove that failurePolicy on validation webhook is set to Fail. oc get csv -o yaml -n openshift-cnv kubevirt-hyperconverged-operator.v2.5.0 .... TRIMMED ... - admissionReviewVersions: - v1beta1 - v1 containerPort: 4343 deploymentName: hco-operator failurePolicy: Fail <<<< HERE generateName: validate-hco.kubevirt.io rules: - apiGroups: - hco.kubevirt.io apiVersions: - v1alpha1 - v1beta1 operations: - CREATE - DELETE resources: - hyperconvergeds sideEffects: None timeoutSeconds: 30 type: ValidatingAdmissionWebhook webhookPath: /validate-hco-kubevirt-io-v1beta1-hyperconverged .... TRIMMED ... Verified on hco-v2.5.0-222 .
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:5127