Bug 2049075 - openshift-storage namespace is stuck in terminating state during uninstall due to remaining csi-addons resources
Summary: openshift-storage namespace is stuck in terminating state during uninstall du...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.10.0
Assignee: Rakshith
QA Contact: Amrita Mahapatra
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-01 14:01 UTC by Rachael
Modified: 2023-08-09 17:03 UTC (History)
8 users (show)

Fixed In Version: 4.10.0-158
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-13 18:52:41 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 346 0 None open Bug 2049075: csi: cleanup csi driver resources when zero cephclusters exist 2022-02-15 06:07:20 UTC
Github rook rook issues 9697 0 None open CSI: CSI driver deployment and daemon set must be cleaned up when no CephCluster CR exists 2022-02-03 14:45:55 UTC
Github rook rook pull 9713 0 None open csi: cleanup csi driver resources when zero cephclusters exist 2022-02-09 12:02:09 UTC
Red Hat Product Errata RHSA-2022:1372 0 None None None 2022-04-13 18:52:59 UTC

Description Rachael 2022-02-01 14:01:08 UTC
Description of problem (please be detailed as possible and provide log
snippets):

During the uninstall process, when the openshift-storage namespace is deleted, it was observed that the project was stuck in terminating state. On debugging further, it appeared to be caused due to remaining csi-addons resources that were not deleted as seen from the output below:

$ oc get project openshift-storage
NAME                DISPLAY NAME   STATUS
openshift-storage                  Terminating

$ oc get project openshift-storage -o yaml
[...]

 - lastTransitionTime: "2022-02-01T12:13:12Z"
    message: 'Some resources are remaining: csiaddonsnodes.csiaddons.openshift.io
      has 5 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2022-02-01T12:13:12Z"
    message: 'Some content in the namespace has finalizers remaining: csiaddons.openshift.io/csiaddonsnode
      in 5 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating


$ oc get csiaddonsnodes.csiaddons.openshift.io -n openshift-storage
NAME                                         NAMESPACE           AGE    DRIVERNAME                           ENDPOINT            NODEID
csi-rbdplugin-6nb2j                          openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.0.212.41:9070    ip-10-0-212-41.us-east-2.compute.internal
csi-rbdplugin-gfv5q                          openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.0.183.27:9070    ip-10-0-183-27.us-east-2.compute.internal
csi-rbdplugin-kbjnq                          openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.0.148.212:9070   ip-10-0-148-212.us-east-2.compute.internal
csi-rbdplugin-provisioner-84bd488586-gfmlx   openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.131.0.24:9070    ip-10-0-148-212.us-east-2.compute.internal
csi-rbdplugin-provisioner-84bd488586-q65qz   openshift-storage   118m   openshift-storage.rbd.csi.ceph.com   10.128.2.18:9070    ip-10-0-183-27.us-east-2.compute.internal


Version of all relevant components (if applicable):
---------------------------------------------------
OCP: 4.10.0-0.nightly-2022-01-31-012936
ODF: odf-operator.v4.10.0        full_version=4.10.0-122


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Install ODF 4.10
2. Uninstall ODF4.10 by following the steps mentioned here: https://access.redhat.com/articles/6525111


Actual results:
---------------
openshift-storage namespace is stuck in terminating state


Expected results:
-----------------
Namespace deletion should be successful.

Additional info:
----------------

Manually deleting the resources did not work either. The deletion of the resources were stuck too.

Comment 3 Niels de Vos 2022-02-01 14:22:43 UTC
The csi-addons sidecar for the Ceph-CSI provisioner and node-plugin (RBD only, not CephFS at the moment) create a CSIAddonsNode CR. This CR is expected to be deleted when the Ceph-CSI pods are removed.

The CSIAddonsNode CR has a finalizer so that the csi-addons-controller can handle the deletion of the Pod (and cleanup internal states, like connections and pending operations).

In the case of deleting the `openshift-storage` namespace, it seems that the csi-addons-controller is deleted before the CSIAddonsNode of the Ceph-CSI pods was removed.

The deletion of the components should happen in the following order (items marked [like this] are automatic):

1. delete the Ceph-CSI deployment/daemonset
2. [CSIAddonsNode for the Ceph-CSI pods will get deleted]
3. [csi-addons-controller handles the CSIAddonsNode finalizer]
4. delete the csi-addons-controller deployment

We'll need to see how the order of the deletion in the namespace can be influenced.

Comment 4 umanga 2022-02-01 15:42:43 UTC
If csi-addons-controller handles CSIAddonsNode CR deletion, it needs to set a controller ownerRef on the CR.
This allows to block controller deletion until dependent CRs (CSIAddonsNode in this case) are deleted.

Also, using `oc delete --cascade=foreground` should allow deleting dependent resources before owner resources.
By default kubernetes treats it as `--cascade=background` which allows deleting owner before dependents which might cause this termination issue.

This bug needs both code fix and doc fix IMO.

Comment 6 Niels de Vos 2022-02-03 14:45:55 UTC
(In reply to umanga from comment #4)
> If csi-addons-controller handles CSIAddonsNode CR deletion, it needs to set
> a controller ownerRef on the CR.
> This allows to block controller deletion until dependent CRs (CSIAddonsNode
> in this case) are deleted.

The CSIAddonsNode objects get created by a Ceph-CSI sidecar. There is an ownerRef on the Pod that created the CSIAddonsNode CR.

In order to add an ownerRef to the csi-addons-controller (deployment), the components become very intertwined. In theory it is also possible to have the csi-addons-controller running in a different namespace than CSI-drivers (although with ODF all will be in the same namespace).

I do not think an additional ownerRef in the CSIAddonsNode CR is possible for solving this issue.

https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications

Comment 18 errata-xmlrpc 2022-04-13 18:52:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1372


Note You need to log in before you can comment on or make changes to this bug.