2256161 – MDS pod in CrashLoopBackOff due to Liveness probe failure

Bug 2256161 - MDS pod in CrashLoopBackOff due to Liveness probe failure

Summary: MDS pod in CrashLoopBackOff due to Liveness probe failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Santosh Pillai
QA Contact:	Nagendra Reddy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-12-29 06:44 UTC by Nagendra Reddy
Modified:	2024-03-19 15:30 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:30:00 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2024:1383	0	None	None	None	2024-03-19 15:30:02 UTC

Description Nagendra Reddy 2023-12-29 06:44:31 UTC

Description of problem (please be detailed as possible and provide log
snippests):

MDS pods were stuck in CLBO state. Due to this we are unable to verify alerts for MDS cache and cpu.

rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-66c44f86lvhxl 1/2 CrashLoopBackOff 9 (100s ago) 37m 10.128.2.73 compute-0 <none> <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-598589f8fqmtp 1/2 CrashLoopBackOff 9 (61s ago) 37m 10.129.2.235 compute-1 <none>

Version of all relevant components (if applicable):
kit: 4.15.0-0.nightly-2023-12-25-100326
ODF: 4.15.0-96

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Yes
Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1
Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1.Deploy ODF cluster with 4.15 in Vsphere platform.
2. Follow the Workarounds to get ceph-exporter pods up and running.[BZ-2255328]
3. Run a pod which will create more no.of files to stress MDS.
4. Observed MDS pods went to CLBO [liveness probe failed]

Actual results:

MDS pods went to CLBO state.
Expected results:
MDS pods should be in running state without any failures.

Additional info:

Comment 5 Santosh Pillai 2024-01-03 10:17:47 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=2255328 is fixed now in `4.15.0-102`.  So need to for any workaround by using the the custom image.

Comment 9 Mudit Agarwal 2024-01-05 07:22:43 UTC

Nagendra, can you please share the setup details. 

Is it a vspehere setup or baremetal or aws? What is the memory and cpu sizes? How many cores etc.

Please understand that the feature has just added an alert, if your system is crashing it has nothing to do with the feature. 
Something is wrong with the setup only.

Comment 10 Nagendra Reddy 2024-01-05 10:33:47 UTC

(In reply to Mudit Agarwal from comment #9)
> Nagendra, can you please share the setup details. 
> 
> Is it a vspehere setup or baremetal or aws? What is the memory and cpu
> sizes? How many cores etc.
> 
> Please understand that the feature has just added an alert, if your system
> is crashing it has nothing to do with the feature. 
> Something is wrong with the setup only.

It is Vsphere setup. Please find below node level resources.

ENV_DATA:
  platform: 'vsphere'
  deployment_type: 'upi'
  worker_replicas: 3
  master_replicas: 3
  worker_num_cpus: '16'
  master_num_cpus: '4'
  master_memory: '16384'
  compute_memory: '65536'
  fio_storageutilization_min_mbps: 10.0



--> I observed 4.14 there are 3CPU and 8Gi Memory assigned for MDS pod.
 4.15 [2 CPU, 6Gi].

Comment 11 Nagendra Reddy 2024-01-05 10:37:26 UTC

Continuing from previous comment #10, 

In 4.15, I can see reduction in the resources [2CPU and 6Gi memory] of MDS pod. Is this expected?

Comment 12 Santosh Pillai 2024-01-05 11:06:26 UTC

(In reply to Nagendra Reddy from comment #10)
> (In reply to Mudit Agarwal from comment #9)

> --> I observed 4.14 there are 3CPU and 8Gi Memory assigned for MDS pod.


>  4.15 [2 CPU, 6Gi].

Could be due to the different resource profiles available in 4.15. That is - Lean, balanced and Performance (https://issues.redhat.com/browse/RHSTOR-4547)

Comment 18 errata-xmlrpc 2024-03-19 15:30:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.