Bug 1901642 - on uninstall, cdi-operator removes the finalizer that protect its CR before really completing its deleting tasks
Summary: on uninstall, cdi-operator removes the finalizer that protect its CR before r...
Keywords:
Status: CLOSED DUPLICATE of bug 1881874
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.5.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 2.6.0
Assignee: Adam Litke
QA Contact: dalia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-25 17:34 UTC by Simone Tiraboschi
Modified: 2021-01-12 14:29 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-12 14:29:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Simone Tiraboschi 2020-11-25 17:34:09 UTC
Description of problem:

On uninstall, cdi-operator removes the finalizer that protect its CR before really completing its deleting tasks.

The CR for cdi-operator is protected by a finalizer,

$ oc get cdis cdi-kubevirt-hyperconverged -o json | jq .metadata.finalizers
[
  "operator.cdi.kubevirt.io"
]

The idea is that the operator detects a delete request, perform its deletion tasks on the resources it's managing and ONLY AS THE LAST ACTION, removes the finalizer that protects its CR.

Currently cdi-operator removes its finalizer too early,
see:

$ oc get APIService v1beta1.cdi.kubevirt.io && oc get cdis cdi-kubevirt-hyperconverged && oc delete cdis cdi-kubevirt-hyperconverged && oc get APIService v1beta1.cdi.kubevirt.io
NAME                      SERVICE   AVAILABLE   AGE
v1beta1.cdi.kubevirt.io   Local     True        107m
NAME                          AGE     PHASE
cdi-kubevirt-hyperconverged   4m31s   Deployed
cdi.cdi.kubevirt.io "cdi-kubevirt-hyperconverged" deleted
NAME                      SERVICE   AVAILABLE   AGE
v1beta1.cdi.kubevirt.io   Local     True        107m

so it can happen that the CR for cdi operator is fully removed while the operator has still to remove some of its managed resources (v1beta1.cdi.kubevirt.io in this example but it's the same for others).

This causes further problems if the delete request for CDI CR is part of a larger process such as the installation of the whole product as a side effect of the deletion of the namespace that serves it.

In that case for instance an external actor, like the OLM, can allow the removal of the subscription while cdi-operator is still working.
This will cause the serviceaccount object used by cdi-operator to be deleted as well while the operator is still working and so the operator cannot complete its deleting tasks.

So at the end cdi-operator is not able to complete its job,
and as side effect the deletion of the namespace is going to bu stuck due to:

                    {
                        "lastTransitionTime": "2020-11-25T12:43:26Z",
                        "message": "Discovery failed for some groups, 3 failing: unable to retrieve the complete list of server APIs: subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request, upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request, upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request",
                        "reason": "DiscoveryFailed",
                        "status": "True",
                        "type": "NamespaceDeletionDiscoveryFailure"
                    },

Version-Release number of selected component (if applicable):
2.5.1

How reproducible:
100%

Steps to Reproduce:
1. try to delete the CR for cdi-operator and quickly check APIService v1beta1.cdi.kubevirt.io (oc get APIService v1beta1.cdi.kubevirt.io && oc get cdis cdi-kubevirt-hyperconverged && oc delete cdis cdi-kubevirt-hyperconverged && oc get APIService v1beta1.cdi.kubevirt.io)
2.
3.

Actual results:
the finalizer on CDI CR is deleted while APIService v1beta1.cdi.kubevirt.io is still there

Expected results:
during the cleanup process, removing the finalizer on cdis CR should be the last action performed by the operator

Additional info:
Maybe the issue should be fixed on controller-lifecycle-operator-sdk code and not really on CDI side

Comment 1 Simone Tiraboschi 2020-11-25 17:35:55 UTC
Can you please take a look?
maybe the issue is in controller-lifecycle-operator-sdk

Comment 2 Piotr Kliczewski 2020-11-25 21:55:09 UTC
@jdzon do you remember whether this is part of controller-lifecycle-operator-sdk?

Comment 3 Jakub Dzon 2020-11-26 07:33:24 UTC
The CR finalizer management code is in the controller-lifecycle-operator-sdk:
- https://github.com/kubevirt/controller-lifecycle-operator-sdk/blob/d6d15f4dd2d402973c5021cfb52259b33a2500f0/pkg/sdk/reconciler/reconciler.go#L137
- the actual removal: https://github.com/kubevirt/controller-lifecycle-operator-sdk/blob/d6d15f4dd2d402973c5021cfb52259b33a2500f0/pkg/sdk/reconciler/reconciler.go#L722

The finalizer removal seems to be the very last action perfomed during the reconciliation so I suspect that the problem might be somewhere else. What comes to my mind: the code in controller-lifecycle-operator-sdk that removes the dependant resources returns before the removal is complete, the propagation policy for delete operation is wrong, there is missing callback on the lib consumer side, the callback for the leftover resource has a bug.

Comment 4 Simone Tiraboschi 2020-11-26 16:42:33 UTC
(In reply to Jakub Dzon from comment #3)
> the code in controller-lifecycle-operator-sdk that removes
> the dependant resources returns before the removal is complete,

kubectl/oc delete command by defaults uses:
      --wait=true: If true, wait for resources to be gone before returning. This waits for finalizers.

on the other side the client from https://github.com/kubernetes-sigs/controller-runtime is not waiting/blocking so maybe the issue is just there.

Comment 5 Adam Litke 2021-01-11 19:16:10 UTC
Simone, does this happen only with CDI or also with other components such as kubevirt?

Comment 7 Simone Tiraboschi 2021-01-12 14:29:10 UTC
the fix for https://bugzilla.redhat.com/1881874 is enough to prevent/workaround this

*** This bug has been marked as a duplicate of bug 1881874 ***


Note You need to log in before you can comment on or make changes to this bug.