Description of problem (please be detailed as possible and provide log snippests): RBD ReclaimSpaceJob job is failing with the error "'Failed to make node request: node Client not found'" in IBM Cloud managed platform. Yaml output of the ReclaimSpaceJob is: {'apiVersion': 'csiaddons.openshift.io/v1alpha1', 'kind': 'ReclaimSpaceJob', 'metadata': {'creationTimestamp': '2023-04-05T07:43:42Z', 'generation': 1, 'name': 'reclaimspacejob-pvc-test-fa1b6242a1b5495e9c4dce355af2909-7baf483276ca4626a38f17136cf96124', 'namespace': 'namespace-test-10ffca73549e41719cfae55f9', 'resourceVersion': '433127', 'uid': '29de712a-464e-45d7-a427-a37a16c2ed3d'}, 'spec': {'backOffLimit': 10, 'retryDeadlineSeconds': 900, 'target': {'persistentVolumeClaim': 'pvc-test-fa1b6242a1b5495e9c4dce355af2909'}}, 'status': {'completionTime': '2023-04-05T07:43:47Z', 'conditions': [{'lastTransitionTime': '2023-04-05T07:43:47Z', 'message': 'Failed to make node request: node Client not found', 'observedGeneration': 1, 'reason': 'failed', 'status': 'True', 'type': 'Failed'}], 'message': 'Maximum retry limit reached', 'result': 'Failed', 'retries': 10, 'startTime': '2023-04-05T07:43:42Z'}} must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-090icm1r3-t1/j-090icm1r3-t1_20230404T212616/logs/failed_testcase_ocs_logs_1680647973/test_rbd_space_reclaim_ocs_logs/ocs_must_gather/ Version of all relevant components (if applicable): ODF 4.10.11-1 OCP 4.10.53 The issue is observed in previous versions of ODF 4.10.z as well. =============================================================================== Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, RBD space reclaim feature is not working. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create RBD PVC and attach it to a pod.(In IBM cloud managed platform) 2. Write 2GB and delete the file. 3. Create ReclaimSpaceJob for the PVC. OR Run ocs-ci test case tests/manage/pv_services/space_reclaim/test_rbd_space_reclaim.py::TestRbdSpaceReclaim::test_rbd_space_reclaim Actual results: ReclaimSpaceJob is not success. Expected results: ReclaimSpaceJob should be success. Additional info: Until very recently, only ODF 4.10 was supported over ROKS. So only ODF versions upto 4.10 is tested in the platform.
The PR https://github.com/red-hat-storage/ocs-operator/pull/1979 about moving the defaults from configmap to csv is merged for 4.10.12 and this issue is being reported on 4.10.11. And as you mentioned for this specific case with IBM clusters the cm is created before odf deployment, And ocs operator is configured in such a way that if it sees the rook-ceph-operator cm exists, it will not touch it Ref-https://github.com/red-hat-storage/ocs-operator/blob/release-4.10/controllers/ocsinitialization/ocsinitialization_controller.go#L354. So the settings like CSI_LOG_LEVEL and CSI_ENABLE_CSIADDONS won't be added to the cm here. Hence the issue is being hit.
As I mentioned above with 4.10.12 I think this issue will not be hit anymore as we have moved default settings like CSI_LOG_LEVEL and CSI_ENABLE_CSIADDONS to the csv which will be set in the rook operator pod as enc variables. So can you try the same thing with ODF 4.10.12 builds & confirm what happens there?