Description of problem: On uninstall, cdi-operator removes the finalizer that protect its CR before really completing its deleting tasks. The CR for cdi-operator is protected by a finalizer, $ oc get cdis cdi-kubevirt-hyperconverged -o json | jq .metadata.finalizers [ "operator.cdi.kubevirt.io" ] The idea is that the operator detects a delete request, perform its deletion tasks on the resources it's managing and ONLY AS THE LAST ACTION, removes the finalizer that protects its CR. Currently cdi-operator removes its finalizer too early, see: $ oc get APIService v1beta1.cdi.kubevirt.io && oc get cdis cdi-kubevirt-hyperconverged && oc delete cdis cdi-kubevirt-hyperconverged && oc get APIService v1beta1.cdi.kubevirt.io NAME SERVICE AVAILABLE AGE v1beta1.cdi.kubevirt.io Local True 107m NAME AGE PHASE cdi-kubevirt-hyperconverged 4m31s Deployed cdi.cdi.kubevirt.io "cdi-kubevirt-hyperconverged" deleted NAME SERVICE AVAILABLE AGE v1beta1.cdi.kubevirt.io Local True 107m so it can happen that the CR for cdi operator is fully removed while the operator has still to remove some of its managed resources (v1beta1.cdi.kubevirt.io in this example but it's the same for others). This causes further problems if the delete request for CDI CR is part of a larger process such as the installation of the whole product as a side effect of the deletion of the namespace that serves it. In that case for instance an external actor, like the OLM, can allow the removal of the subscription while cdi-operator is still working. This will cause the serviceaccount object used by cdi-operator to be deleted as well while the operator is still working and so the operator cannot complete its deleting tasks. So at the end cdi-operator is not able to complete its job, and as side effect the deletion of the namespace is going to bu stuck due to: { "lastTransitionTime": "2020-11-25T12:43:26Z", "message": "Discovery failed for some groups, 3 failing: unable to retrieve the complete list of server APIs: subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request, upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request, upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request", "reason": "DiscoveryFailed", "status": "True", "type": "NamespaceDeletionDiscoveryFailure" }, Version-Release number of selected component (if applicable): 2.5.1 How reproducible: 100% Steps to Reproduce: 1. try to delete the CR for cdi-operator and quickly check APIService v1beta1.cdi.kubevirt.io (oc get APIService v1beta1.cdi.kubevirt.io && oc get cdis cdi-kubevirt-hyperconverged && oc delete cdis cdi-kubevirt-hyperconverged && oc get APIService v1beta1.cdi.kubevirt.io) 2. 3. Actual results: the finalizer on CDI CR is deleted while APIService v1beta1.cdi.kubevirt.io is still there Expected results: during the cleanup process, removing the finalizer on cdis CR should be the last action performed by the operator Additional info: Maybe the issue should be fixed on controller-lifecycle-operator-sdk code and not really on CDI side
Can you please take a look? maybe the issue is in controller-lifecycle-operator-sdk
@jdzon do you remember whether this is part of controller-lifecycle-operator-sdk?
The CR finalizer management code is in the controller-lifecycle-operator-sdk: - https://github.com/kubevirt/controller-lifecycle-operator-sdk/blob/d6d15f4dd2d402973c5021cfb52259b33a2500f0/pkg/sdk/reconciler/reconciler.go#L137 - the actual removal: https://github.com/kubevirt/controller-lifecycle-operator-sdk/blob/d6d15f4dd2d402973c5021cfb52259b33a2500f0/pkg/sdk/reconciler/reconciler.go#L722 The finalizer removal seems to be the very last action perfomed during the reconciliation so I suspect that the problem might be somewhere else. What comes to my mind: the code in controller-lifecycle-operator-sdk that removes the dependant resources returns before the removal is complete, the propagation policy for delete operation is wrong, there is missing callback on the lib consumer side, the callback for the leftover resource has a bug.
(In reply to Jakub Dzon from comment #3) > the code in controller-lifecycle-operator-sdk that removes > the dependant resources returns before the removal is complete, kubectl/oc delete command by defaults uses: --wait=true: If true, wait for resources to be gone before returning. This waits for finalizers. on the other side the client from https://github.com/kubernetes-sigs/controller-runtime is not waiting/blocking so maybe the issue is just there.
Simone, does this happen only with CDI or also with other components such as kubevirt?
the fix for https://bugzilla.redhat.com/1881874 is enough to prevent/workaround this *** This bug has been marked as a duplicate of bug 1881874 ***