Bug 2188034 - ReclaimSpaceJob failing in IBM Cloud managed platform [NEEDINFO]
Summary: ReclaimSpaceJob failing in IBM Cloud managed platform
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.10
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Malay Kumar parida
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-19 14:30 UTC by Jilju Joy
Modified: 2023-08-09 17:00 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-04-24 08:37:28 UTC
Embargoed:
mparida: needinfo? (jijoy)
mrajanna: needinfo? (jijoy)


Attachments (Terms of Use)

Description Jilju Joy 2023-04-19 14:30:34 UTC
Description of problem (please be detailed as possible and provide log
snippests):

RBD ReclaimSpaceJob job is failing with the error "'Failed to make node request: node Client not found'" in IBM Cloud managed platform.

Yaml output of the ReclaimSpaceJob is:

{'apiVersion': 'csiaddons.openshift.io/v1alpha1', 'kind': 'ReclaimSpaceJob', 'metadata': {'creationTimestamp': '2023-04-05T07:43:42Z', 'generation': 1, 'name': 'reclaimspacejob-pvc-test-fa1b6242a1b5495e9c4dce355af2909-7baf483276ca4626a38f17136cf96124', 'namespace': 'namespace-test-10ffca73549e41719cfae55f9', 'resourceVersion': '433127', 'uid': '29de712a-464e-45d7-a427-a37a16c2ed3d'}, 'spec': {'backOffLimit': 10, 'retryDeadlineSeconds': 900, 'target': {'persistentVolumeClaim': 'pvc-test-fa1b6242a1b5495e9c4dce355af2909'}}, 'status': {'completionTime': '2023-04-05T07:43:47Z', 'conditions': [{'lastTransitionTime': '2023-04-05T07:43:47Z', 'message': 'Failed to make node request: node Client not found', 'observedGeneration': 1, 'reason': 'failed', 'status': 'True', 'type': 'Failed'}], 'message': 'Maximum retry limit reached', 'result': 'Failed', 'retries': 10, 'startTime': '2023-04-05T07:43:42Z'}}

must-gather logs: 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-090icm1r3-t1/j-090icm1r3-t1_20230404T212616/logs/failed_testcase_ocs_logs_1680647973/test_rbd_space_reclaim_ocs_logs/ocs_must_gather/

Version of all relevant components (if applicable):
ODF 4.10.11-1 
OCP 4.10.53

The issue is observed in previous versions of ODF 4.10.z as well.

===============================================================================
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, RBD space reclaim feature is not working.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create RBD PVC and attach it to a pod.(In IBM cloud managed platform)
2. Write 2GB and delete the file.
3. Create ReclaimSpaceJob for the PVC.
OR
Run ocs-ci test case
tests/manage/pv_services/space_reclaim/test_rbd_space_reclaim.py::TestRbdSpaceReclaim::test_rbd_space_reclaim


Actual results:
ReclaimSpaceJob is not success.

Expected results:
ReclaimSpaceJob should be success.

Additional info:
Until very recently, only ODF 4.10 was supported over ROKS. So only ODF versions upto 4.10 is tested in the platform.

Comment 4 Malay Kumar parida 2023-04-20 06:04:36 UTC
The PR https://github.com/red-hat-storage/ocs-operator/pull/1979 about moving the defaults from configmap to csv is merged for 4.10.12 and this issue is being reported on 4.10.11. And as you mentioned for this specific case with IBM clusters the cm is created before odf deployment, And ocs operator is configured in such a way that if it sees the rook-ceph-operator cm exists, it will not touch it Ref-https://github.com/red-hat-storage/ocs-operator/blob/release-4.10/controllers/ocsinitialization/ocsinitialization_controller.go#L354. So the settings like CSI_LOG_LEVEL and CSI_ENABLE_CSIADDONS won't be added to the cm here. Hence the issue is being hit.

Comment 5 Malay Kumar parida 2023-04-20 06:05:00 UTC
As I mentioned above with 4.10.12 I think this issue will not be hit anymore as we have moved default settings like CSI_LOG_LEVEL and CSI_ENABLE_CSIADDONS to the csv which will be set in the rook operator pod as enc variables. So can you try the same thing with ODF 4.10.12 builds & confirm what happens there?


Note You need to log in before you can comment on or make changes to this bug.