Description of problem (please be detailed as possible and provide log snippests): MDS pods were stuck in CLBO state. Due to this we are unable to verify alerts for MDS cache and cpu. rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-66c44f86lvhxl 1/2 CrashLoopBackOff 9 (100s ago) 37m 10.128.2.73 compute-0 <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-598589f8fqmtp 1/2 CrashLoopBackOff 9 (61s ago) 37m 10.129.2.235 compute-1 <none> Version of all relevant components (if applicable): kit: 4.15.0-0.nightly-2023-12-25-100326 ODF: 4.15.0-96 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy ODF cluster with 4.15 in Vsphere platform. 2. Follow the Workarounds to get ceph-exporter pods up and running.[BZ-2255328] 3. Run a pod which will create more no.of files to stress MDS. 4. Observed MDS pods went to CLBO [liveness probe failed] Actual results: MDS pods went to CLBO state. Expected results: MDS pods should be in running state without any failures. Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=2255328 is fixed now in `4.15.0-102`. So need to for any workaround by using the the custom image.
Nagendra, can you please share the setup details. Is it a vspehere setup or baremetal or aws? What is the memory and cpu sizes? How many cores etc. Please understand that the feature has just added an alert, if your system is crashing it has nothing to do with the feature. Something is wrong with the setup only.
(In reply to Mudit Agarwal from comment #9) > Nagendra, can you please share the setup details. > > Is it a vspehere setup or baremetal or aws? What is the memory and cpu > sizes? How many cores etc. > > Please understand that the feature has just added an alert, if your system > is crashing it has nothing to do with the feature. > Something is wrong with the setup only. It is Vsphere setup. Please find below node level resources. ENV_DATA: platform: 'vsphere' deployment_type: 'upi' worker_replicas: 3 master_replicas: 3 worker_num_cpus: '16' master_num_cpus: '4' master_memory: '16384' compute_memory: '65536' fio_storageutilization_min_mbps: 10.0 --> I observed 4.14 there are 3CPU and 8Gi Memory assigned for MDS pod. 4.15 [2 CPU, 6Gi].
Continuing from previous comment #10, In 4.15, I can see reduction in the resources [2CPU and 6Gi memory] of MDS pod. Is this expected?
(In reply to Nagendra Reddy from comment #10) > (In reply to Mudit Agarwal from comment #9) > --> I observed 4.14 there are 3CPU and 8Gi Memory assigned for MDS pod. > 4.15 [2 CPU, 6Gi]. Could be due to the different resource profiles available in 4.15. That is - Lean, balanced and Performance (https://issues.redhat.com/browse/RHSTOR-4547)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383