Bug 2049075

Summary: openshift-storage namespace is stuck in terminating state during uninstall due to remaining csi-addons resources
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Rachael <rgeorge>
Component: rookAssignee: Rakshith <rar>
Status: CLOSED ERRATA QA Contact: Amrita Mahapatra <ammahapa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.10CC: madam, mmuench, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, rar, uchapaga
Target Milestone: ---Keywords: Triaged
Target Release: ODF 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.10.0-158 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-04-13 18:52:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rachael 2022-02-01 14:01:08 UTC
Description of problem (please be detailed as possible and provide log
snippets):

During the uninstall process, when the openshift-storage namespace is deleted, it was observed that the project was stuck in terminating state. On debugging further, it appeared to be caused due to remaining csi-addons resources that were not deleted as seen from the output below:

$ oc get project openshift-storage
NAME                DISPLAY NAME   STATUS
openshift-storage                  Terminating

$ oc get project openshift-storage -o yaml
[...]

 - lastTransitionTime: "2022-02-01T12:13:12Z"
    message: 'Some resources are remaining: csiaddonsnodes.csiaddons.openshift.io
      has 5 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2022-02-01T12:13:12Z"
    message: 'Some content in the namespace has finalizers remaining: csiaddons.openshift.io/csiaddonsnode
      in 5 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating


$ oc get csiaddonsnodes.csiaddons.openshift.io -n openshift-storage
NAME                                         NAMESPACE           AGE    DRIVERNAME                           ENDPOINT            NODEID
csi-rbdplugin-6nb2j                          openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.0.212.41:9070    ip-10-0-212-41.us-east-2.compute.internal
csi-rbdplugin-gfv5q                          openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.0.183.27:9070    ip-10-0-183-27.us-east-2.compute.internal
csi-rbdplugin-kbjnq                          openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.0.148.212:9070   ip-10-0-148-212.us-east-2.compute.internal
csi-rbdplugin-provisioner-84bd488586-gfmlx   openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.131.0.24:9070    ip-10-0-148-212.us-east-2.compute.internal
csi-rbdplugin-provisioner-84bd488586-q65qz   openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.128.2.18:9070    ip-10-0-183-27.us-east-2.compute.internal


Version of all relevant components (if applicable):
---------------------------------------------------
OCP: 4.10.0-0.nightly-2022-01-31-012936
ODF: odf-operator.v4.10.0        full_version=4.10.0-122


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Install ODF 4.10
2. Uninstall ODF4.10 by following the steps mentioned here: https://access.redhat.com/articles/6525111


Actual results:
---------------
openshift-storage namespace is stuck in terminating state


Expected results:
-----------------
Namespace deletion should be successful.

Additional info:
----------------

Manually deleting the resources did not work either. The deletion of the resources were stuck too.

Comment 3 Niels de Vos 2022-02-01 14:22:43 UTC
The csi-addons sidecar for the Ceph-CSI provisioner and node-plugin (RBD only, not CephFS at the moment) create a CSIAddonsNode CR. This CR is expected to be deleted when the Ceph-CSI pods are removed.

The CSIAddonsNode CR has a finalizer so that the csi-addons-controller can handle the deletion of the Pod (and cleanup internal states, like connections and pending operations).

In the case of deleting the `openshift-storage` namespace, it seems that the csi-addons-controller is deleted before the CSIAddonsNode of the Ceph-CSI pods was removed.

The deletion of the components should happen in the following order (items marked [like this] are automatic):

1. delete the Ceph-CSI deployment/daemonset
2. [CSIAddonsNode for the Ceph-CSI pods will get deleted]
3. [csi-addons-controller handles the CSIAddonsNode finalizer]
4. delete the csi-addons-controller deployment

We'll need to see how the order of the deletion in the namespace can be influenced.

Comment 4 umanga 2022-02-01 15:42:43 UTC
If csi-addons-controller handles CSIAddonsNode CR deletion, it needs to set a controller ownerRef on the CR.
This allows to block controller deletion until dependent CRs (CSIAddonsNode in this case) are deleted.

Also, using `oc delete --cascade=foreground` should allow deleting dependent resources before owner resources.
By default kubernetes treats it as `--cascade=background` which allows deleting owner before dependents which might cause this termination issue.

This bug needs both code fix and doc fix IMO.

Comment 6 Niels de Vos 2022-02-03 14:45:55 UTC
(In reply to umanga from comment #4)
> If csi-addons-controller handles CSIAddonsNode CR deletion, it needs to set
> a controller ownerRef on the CR.
> This allows to block controller deletion until dependent CRs (CSIAddonsNode
> in this case) are deleted.

The CSIAddonsNode objects get created by a Ceph-CSI sidecar. There is an ownerRef on the Pod that created the CSIAddonsNode CR.

In order to add an ownerRef to the csi-addons-controller (deployment), the components become very intertwined. In theory it is also possible to have the csi-addons-controller running in a different namespace than CSI-drivers (although with ODF all will be in the same namespace).

I do not think an additional ownerRef in the CSIAddonsNode CR is possible for solving this issue.

https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications

Comment 18 errata-xmlrpc 2022-04-13 18:52:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372