Bug 2256161 - MDS pod in CrashLoopBackOff due to Liveness probe failure
Summary: MDS pod in CrashLoopBackOff due to Liveness probe failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ODF 4.15.0
Assignee: Santosh Pillai
QA Contact: Nagendra Reddy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-12-29 06:44 UTC by Nagendra Reddy
Modified: 2024-03-19 15:30 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-03-19 15:30:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2024:1383 0 None None None 2024-03-19 15:30:02 UTC

Description Nagendra Reddy 2023-12-29 06:44:31 UTC
Description of problem (please be detailed as possible and provide log
snippests):

MDS pods were stuck in CLBO state. Due to this we are unable to verify alerts for MDS cache  and cpu.

rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-66c44f86lvhxl   1/2     CrashLoopBackOff   9 (100s ago)   37m     10.128.2.73    compute-0   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-598589f8fqmtp   1/2     CrashLoopBackOff   9 (61s ago)    37m     10.129.2.235   compute-1   <none>   

Version of all relevant components (if applicable):
kit: 4.15.0-0.nightly-2023-12-25-100326
ODF: 4.15.0-96


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes
Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1
Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy ODF cluster with 4.15 in Vsphere platform.
2. Follow the Workarounds to get ceph-exporter pods up and running.[BZ-2255328]
3. Run a pod which will create more no.of files to stress MDS.
4. Observed MDS pods went to CLBO [liveness probe failed]


Actual results:

MDS pods went to CLBO state.
Expected results:
MDS pods should be in running state without any failures. 

Additional info:

Comment 5 Santosh Pillai 2024-01-03 10:17:47 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=2255328 is fixed now in `4.15.0-102`.  So need to for any workaround by using the the custom image.

Comment 9 Mudit Agarwal 2024-01-05 07:22:43 UTC
Nagendra, can you please share the setup details. 

Is it a vspehere setup or baremetal or aws? What is the memory and cpu sizes? How many cores etc.

Please understand that the feature has just added an alert, if your system is crashing it has nothing to do with the feature. 
Something is wrong with the setup only.

Comment 10 Nagendra Reddy 2024-01-05 10:33:47 UTC
(In reply to Mudit Agarwal from comment #9)
> Nagendra, can you please share the setup details. 
> 
> Is it a vspehere setup or baremetal or aws? What is the memory and cpu
> sizes? How many cores etc.
> 
> Please understand that the feature has just added an alert, if your system
> is crashing it has nothing to do with the feature. 
> Something is wrong with the setup only.

It is Vsphere setup. Please find below node level resources.

ENV_DATA:
  platform: 'vsphere'
  deployment_type: 'upi'
  worker_replicas: 3
  master_replicas: 3
  worker_num_cpus: '16'
  master_num_cpus: '4'
  master_memory: '16384'
  compute_memory: '65536'
  fio_storageutilization_min_mbps: 10.0



--> I observed 4.14 there are 3CPU and 8Gi Memory assigned for MDS pod.
 4.15 [2 CPU, 6Gi].

Comment 11 Nagendra Reddy 2024-01-05 10:37:26 UTC
Continuing from previous comment #10, 

In 4.15, I can see reduction in the resources [2CPU and 6Gi memory] of MDS pod. Is this expected?

Comment 12 Santosh Pillai 2024-01-05 11:06:26 UTC
(In reply to Nagendra Reddy from comment #10)
> (In reply to Mudit Agarwal from comment #9)

> --> I observed 4.14 there are 3CPU and 8Gi Memory assigned for MDS pod.


>  4.15 [2 CPU, 6Gi].

Could be due to the different resource profiles available in 4.15. That is - Lean, balanced and Performance (https://issues.redhat.com/browse/RHSTOR-4547)

Comment 18 errata-xmlrpc 2024-03-19 15:30:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383


Note You need to log in before you can comment on or make changes to this bug.