Description of problem ====================== When Ceph Toolbox feature is enabled and deployed (by setting enableCephTools value to true) and then the StorageCluster is removed, ceph toolbox pod is not removed with the rest of OCS cluster components, even though the toolbox itself has no purpose without other ceph components. This creates confusion when another StorageCluster is deployed, because one will end up with old ceph toolbox pod running next to new ceph cluster, and such configuration won't work together as the toolbox is using cephx keys valid for the old cluster, not the new one which is currently running. Version-Release number of selected component ============================================ OCP 4.7.0-0.nightly-2021-03-06-183610 OCS 4.7.0-284.ci How reproducible ================ 1/1 Steps to Reproduce ================== 1. Install OCP cluster. 2. Via OCP Console install OCS operator and create StorageCluster. 3. Deploy Ceph toolbox pod via enableCephTools knob: ``` $ oc patch ocsinitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]' ``` 4. Check that ceph toolbox pod works (try to run `ceph osd tree` there). ``` $ oc get pods -n openshift-storage | grep ceph-tools $ oc rsh -n openshift-storage rook-ceph-tools-foo-bar bash [root@compute-0 /]# ceph osd tree ``` 5. Via OCP Console, delete StorageCluster and wait for the removal to finish. 6. Check pods running in openshift-storage namespace. ``` $ oc get pods -n openshift-storage ``` 7. Create StorageCluster via OCP Console again. 8. Try to use ceph toolbox pod. ``` $ oc get pods -n openshift-storage | grep ceph-tools $ oc rsh -n openshift-storage rook-ceph-tools-foo-bar bash [root@compute-0 /]# ceph osd tree ``` Actual results ============== During step #6, after removal of StorageCluster, I see the following pods running in openshift-storage namespace: ``` $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE cluster-cleanup-job-compute-0-gjlb2 0/1 Completed 0 28s cluster-cleanup-job-compute-1-l5lnf 0/1 Completed 0 28s cluster-cleanup-job-compute-2-5szfn 0/1 Completed 0 28s cluster-cleanup-job-compute-3-xpx7r 0/1 Completed 0 28s cluster-cleanup-job-compute-4-dwrh7 0/1 Completed 0 28s cluster-cleanup-job-control-plane-2-8mj7m 0/1 Completed 0 27s csi-cephfsplugin-2mdqf 3/3 Running 0 70m csi-cephfsplugin-45twp 3/3 Running 0 70m csi-cephfsplugin-4kz5k 3/3 Running 0 70m csi-cephfsplugin-74stm 3/3 Running 0 70m csi-cephfsplugin-bkzqn 3/3 Running 0 70m csi-cephfsplugin-provisioner-849d54494-2hpbh 6/6 Running 0 70m csi-cephfsplugin-provisioner-849d54494-sfbxq 6/6 Running 0 70m csi-cephfsplugin-t8sm9 3/3 Running 0 70m csi-rbdplugin-k5hb2 3/3 Running 0 70m csi-rbdplugin-lbxw7 3/3 Running 0 70m csi-rbdplugin-mth4g 3/3 Running 0 70m csi-rbdplugin-nsxbn 3/3 Running 0 70m csi-rbdplugin-provisioner-86df955ff9-97tjx 6/6 Running 0 70m csi-rbdplugin-provisioner-86df955ff9-ls8hj 6/6 Running 0 70m csi-rbdplugin-tbpm7 3/3 Running 0 70m csi-rbdplugin-xwjwq 3/3 Running 0 70m noobaa-operator-b7bcf8694-pz44h 1/1 Running 1 103m ocs-metrics-exporter-7678848477-dh5xq 1/1 Running 0 103m ocs-operator-7b54b9c84d-mf8ps 1/1 Running 0 103m rook-ceph-operator-7b898c76c-84tlh 1/1 Running 0 103m rook-ceph-tools-69f66f5b4f-mts88 1/1 Running 0 20m ``` You can see that rook-ceph-tools pod is still up and running, while the rest of rook-ceph components is gone. During step #8, after new StorageCluster was created, I see that the old ceph toolbox pod can't connect to currently running ceph cluster: ``` $ oc rsh -n openshift-storage rook-ceph-tools-69f66f5b4f-mts88 bash [root@compute-0 /]# ceph osd tree [errno 1] error connecting to the cluster ``` Obviously, new cluster uses new set of cephX keys, so this doesn't work. Expected results ================ Ceph toolbox pod is removed along with the rest of ceph OCS components. When a new StorageCluster is created, one has to enable ceph toolbox again. It's not possible to end up with old toolbox pod running next to a new ceph cluster. Additional info =============== When you end up with old ceph toolbox pod and new ceph cluster, obvious workaround is to redeploy new ceph toolbox by disabling and enabling the toolbox again: ``` $ oc patch ocsinitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": false }]' $ oc patch ocsinitialization ocsinit -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/enableCephTools", "value": true }]' ```
Don't think this is a regression, moving it out. Please move back if it is. I don't think toolbox is a part of our uninstall strategy given that this is something extra from the usual workflow but I will let Talur/Jose decide that.
Neha, do we need to document this?
Moving this to documentation based on my last comment, please reassign if some one thinks otherwise.
I strongly disagree with a plan to fix this via documentation. There is no point in having a toolbox pod around when a cluster is gone. If /spec/enableCephTools is True, toolbox should be removed as any other ceph pod.
Our automated uninstall mostly takes care of removing things which were created as part of installation. We don't guarantee every thing to be removed with this feature, this was developed to support the customer with obvious things. Tool box is something extra from the usual workflow, normally it is created manually and should be deleted manually. Talur, please correct me if I am wrong.
I agree that from a technical perspective this is a problem that we should be taking care of. We created the Pod, we should remove it. Thinking about this a bit, the fix should be fairly easy, so marking it with devel_ack+. I'll leave it up to QE if they want to add this to their verifications for OCS 4.8.
Notes for QE: - In ocs-ci uninstall, we can add a step to check the removal of the CT pod - We should make sure that the uninstall still works as expected if there was no CT pod deployed
This can't be fixed before dev freeze (we don't have a PR yet) and it is not a blocker/regression. Moving it out, will fix it in master asap.
I am not sure if toolbox was also covered as part of those changes, Blaine can confirm.
The only BZ I have a record of involving Yati is this one: https://bugzilla.redhat.com/show_bug.cgi?id=1968510 It is not related to this PR. I seem to recall communicating with someone who was adjusting ocs-operator's uninstall ordering, which might be somewhat related, but I can no longer find a message/email/BZ reference to that. IMO, this bug is not directly related to either of the issues I mentioned. It isn't strictly the same bug.
Discussed with Talur, we will take it up in 4.10
Nitin, this is a good first time issue. Lets fix it in main asap.
Clearing the need info as the bug has been assigned to the Malay and he will work on it
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156