Bug 1937117 - Deletion of StorageCluster doesn't remove ceph toolbox pod
Summary: Deletion of StorageCluster doesn't remove ceph toolbox pod
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ODF 4.11.0
Assignee: Malay Kumar parida
QA Contact: Amrita Mahapatra
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-09 21:21 UTC by Martin Bukatovic
Modified: 2023-08-09 17:00 UTC (History)
11 users (show)

Fixed In Version: 4.11.0-78
Doc Type: Known Issue
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-24 13:48:17 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage ocs-operator pull 1602 0 None Merged Add the ceph toolbox enable/disable & reconcile to storagecluster 2022-05-23 09:15:26 UTC
Github red-hat-storage ocs-operator pull 1691 0 None open Bug 1937117: [release-4.11] Add the ceph toolbox enable/disable & reconcile to storagecluster 2022-05-23 09:15:45 UTC
Red Hat Product Errata RHSA-2022:6156 0 None None None 2022-08-24 13:48:51 UTC

Description Martin Bukatovic 2021-03-09 21:21:21 UTC
Description of problem
======================

When Ceph Toolbox feature is enabled and deployed (by setting enableCephTools
value to true) and then the StorageCluster is removed, ceph toolbox pod is not
removed with the rest of OCS cluster components, even though the toolbox itself
has no purpose without other ceph components.

This creates confusion when another StorageCluster is deployed, because one
will end up with old ceph toolbox pod running next to new ceph cluster, and
such configuration won't work together as the toolbox is using cephx keys valid
for the old cluster, not the new one which is currently running.

Version-Release number of selected component
============================================

OCP 4.7.0-0.nightly-2021-03-06-183610
OCS 4.7.0-284.ci

How reproducible
================

1/1

Steps to Reproduce
==================

1. Install OCP cluster.

2. Via OCP Console install OCS operator and create StorageCluster.

3. Deploy Ceph toolbox pod via enableCephTools knob:

```
$ oc patch ocsinitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
```

4. Check that ceph toolbox pod works (try to run `ceph osd tree` there).

```
$ oc get pods -n openshift-storage | grep ceph-tools
$ oc rsh -n openshift-storage rook-ceph-tools-foo-bar bash
[root@compute-0 /]# ceph osd tree
```

5. Via OCP Console, delete StorageCluster and wait for the removal to finish.

6. Check pods running in openshift-storage namespace.

```
$ oc get pods -n openshift-storage
```

7. Create StorageCluster via OCP Console again.

8. Try to use ceph toolbox pod.

```
$ oc get pods -n openshift-storage | grep ceph-tools
$ oc rsh -n openshift-storage rook-ceph-tools-foo-bar bash
[root@compute-0 /]# ceph osd tree
```

Actual results
==============

During step #6, after removal of StorageCluster, I see the following pods
running in openshift-storage namespace:

```
$ oc get pods -n openshift-storage
NAME                                           READY   STATUS      RESTARTS   AGE
cluster-cleanup-job-compute-0-gjlb2            0/1     Completed   0          28s
cluster-cleanup-job-compute-1-l5lnf            0/1     Completed   0          28s
cluster-cleanup-job-compute-2-5szfn            0/1     Completed   0          28s
cluster-cleanup-job-compute-3-xpx7r            0/1     Completed   0          28s
cluster-cleanup-job-compute-4-dwrh7            0/1     Completed   0          28s
cluster-cleanup-job-control-plane-2-8mj7m      0/1     Completed   0          27s
csi-cephfsplugin-2mdqf                         3/3     Running     0          70m
csi-cephfsplugin-45twp                         3/3     Running     0          70m
csi-cephfsplugin-4kz5k                         3/3     Running     0          70m
csi-cephfsplugin-74stm                         3/3     Running     0          70m
csi-cephfsplugin-bkzqn                         3/3     Running     0          70m
csi-cephfsplugin-provisioner-849d54494-2hpbh   6/6     Running     0          70m
csi-cephfsplugin-provisioner-849d54494-sfbxq   6/6     Running     0          70m
csi-cephfsplugin-t8sm9                         3/3     Running     0          70m
csi-rbdplugin-k5hb2                            3/3     Running     0          70m
csi-rbdplugin-lbxw7                            3/3     Running     0          70m
csi-rbdplugin-mth4g                            3/3     Running     0          70m
csi-rbdplugin-nsxbn                            3/3     Running     0          70m
csi-rbdplugin-provisioner-86df955ff9-97tjx     6/6     Running     0          70m
csi-rbdplugin-provisioner-86df955ff9-ls8hj     6/6     Running     0          70m
csi-rbdplugin-tbpm7                            3/3     Running     0          70m
csi-rbdplugin-xwjwq                            3/3     Running     0          70m
noobaa-operator-b7bcf8694-pz44h                1/1     Running     1          103m
ocs-metrics-exporter-7678848477-dh5xq          1/1     Running     0          103m
ocs-operator-7b54b9c84d-mf8ps                  1/1     Running     0          103m
rook-ceph-operator-7b898c76c-84tlh             1/1     Running     0          103m
rook-ceph-tools-69f66f5b4f-mts88               1/1     Running     0          20m
```

You can see that rook-ceph-tools pod is still up and running, while the rest
of rook-ceph components is gone.

During step #8, after new StorageCluster was created, I see that the old ceph
toolbox pod can't connect to currently running ceph cluster:

```
$ oc rsh -n openshift-storage rook-ceph-tools-69f66f5b4f-mts88 bash
[root@compute-0 /]# ceph osd tree
[errno 1] error connecting to the cluster
```

Obviously, new cluster uses new set of cephX keys, so this doesn't work.

Expected results
================

Ceph toolbox pod is removed along with the rest of ceph OCS components.

When a new StorageCluster is created, one has to enable ceph toolbox again.

It's not possible to end up with old toolbox pod running next to a new ceph
cluster.

Additional info
===============

When you end up with old ceph toolbox pod and new ceph cluster, obvious
workaround is to redeploy new ceph toolbox by disabling and enabling the
toolbox again:

```
$ oc patch ocsinitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": false }]'
$ oc patch ocsinitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]'
```

Comment 2 Mudit Agarwal 2021-03-10 05:27:11 UTC
Don't think this is a regression, moving it out. Please move back if it is.

I don't think toolbox is a part of our uninstall strategy given that this is something extra from the usual workflow but I will let Talur/Jose decide that.

Comment 3 Mudit Agarwal 2021-03-10 05:37:49 UTC
Neha, do we need to document this?

Comment 4 Mudit Agarwal 2021-06-04 17:00:47 UTC
Moving this to documentation based on my last comment, please reassign if some one thinks otherwise.

Comment 5 Martin Bukatovic 2021-06-04 17:29:56 UTC
I strongly disagree with a plan to fix this via documentation.

There is no point in having a toolbox pod around when a cluster is gone. If /spec/enableCephTools is True, toolbox should be removed as any other ceph pod.

Comment 6 Mudit Agarwal 2021-06-04 17:40:22 UTC
Our automated uninstall mostly takes care of removing things which were created as part of installation.
We don't guarantee every thing to be removed with this feature, this was developed to support the customer with obvious things.
  
Tool box is something extra from the usual workflow, normally it is created manually and should be deleted manually.

Talur, please correct me if I am wrong.

Comment 7 Jose A. Rivera 2021-06-08 15:04:57 UTC
I agree that from a technical perspective this is a problem that we should be taking care of. We created the Pod, we should remove it. Thinking about this a bit, the fix should be fairly easy, so marking it with devel_ack+. I'll leave it up to QE if they want to add this to their verifications for OCS 4.8.

Comment 8 Elad 2021-06-08 15:06:17 UTC
Notes for QE:
- In ocs-ci uninstall, we can add a step to check the removal of the CT pod
- We should make sure that the uninstall still works as expected if there was no CT pod deployed

Comment 11 Mudit Agarwal 2021-06-09 18:11:17 UTC
This can't be fixed before dev freeze (we don't have a PR yet) and it is not a blocker/regression. Moving it out, will fix it in master asap.

Comment 19 Mudit Agarwal 2021-09-30 11:50:19 UTC
I am not sure if toolbox was also covered as part of those changes, Blaine can confirm.

Comment 20 Blaine Gardner 2021-09-30 16:08:25 UTC
The only BZ I have a record of involving Yati is this one: https://bugzilla.redhat.com/show_bug.cgi?id=1968510
It is not related to this PR. 

I seem to recall communicating with someone who was adjusting ocs-operator's uninstall ordering, which might be somewhat related, but I can no longer find a message/email/BZ reference to that.

IMO, this bug is not directly related to either of the issues I mentioned. It isn't strictly the same bug.

Comment 21 Mudit Agarwal 2021-10-11 13:49:44 UTC
Discussed with Talur, we will take it up in 4.10

Comment 24 Mudit Agarwal 2022-02-22 14:49:00 UTC
Nitin, this is a good first time issue. Lets fix it in main asap.

Comment 27 Nitin Goyal 2022-03-09 07:33:54 UTC
Clearing the need info as the bug has been assigned to the Malay and he will work on it

Comment 33 errata-xmlrpc 2022-08-24 13:48:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156


Note You need to log in before you can comment on or make changes to this bug.