2091951 – [GSS] OCS pods are restarting due to liveness probe failure

Bug 2091951 - [GSS] OCS pods are restarting due to liveness probe failure

Summary: [GSS] OCS pods are restarting due to liveness probe failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.11.0
Assignee:	Subham Rai
QA Contact:	avdhoot
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2094357
TreeView+	depends on / blocked

Reported:	2022-05-31 11:58 UTC by Sonal
Modified:	2023-08-09 17:03 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	.OpenShift Data Foundation pods are restarting due to liveness probe failure Previously, the liveness probe on pods caused a restart of Ceph pods. This release update increases the default timeout for the liveness probe. The pods now get more time before restarting due to liveness where the nodes have more loads and fewer CPU/memory resources.
Clone Of:
Environment:
Last Closed:	2022-08-24 13:54:12 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage rook pull 390	None	open	Bug 2091951: core: increase liveness probe timeout to 2s	2022-06-20 03:59:57 UTC
Github	rook rook pull 10460	None	open	core: increase liveness probe timeout to 2s	2022-06-16 12:54:21 UTC
Red Hat Product Errata	RHSA-2022:6156	None	None	None	2022-08-24 13:54:26 UTC

Description Sonal 2022-05-31 11:58:53 UTC

Description of problem:

Liveness probe failing for osd, mds, mon pods causing pods to restart very frequently and often the pods are in CLBO state.

Version-Release number of selected component (if applicable):
OCS 4.8.9
Environment: OCP is running on Vspehere

How reproducible:
In customer's environment

Actual results:
Pos are restarting due to liveness probe failure

Expected results:
Liveness probe should not fail

Additional info:
In the next private comment

Comment 4 Subham Rai 2022-05-31 15:28:13 UTC

The issue of liveness probe failing has been reported in the past also with an older versions and we have multiple of fix regarding this. We can need to check if this version ocs 4.8.9 has those fix.

Comment 9 Subham Rai 2022-06-02 08:21:03 UTC

The most common reason for Liveness probe failure is due to lack of cpu/memory or node being slow. Did you try increasing the time for the liveness Probe if that helps?

Also, please share must-gather to debug further. Thanks

Comment 23 Subham Rai 2022-08-04 08:14:01 UTC

To verify this bz, describe any ceph pods{osd,mgr,mon} and check the `TimeoutSeconds` inside `Probe` section.

Comment 24 avdhoot 2022-08-04 12:50:34 UTC

Verified bug by checking the `TimeoutSeconds` inside `Probe` section of osd,mgr,mon pods


cluster details-

[auth]$ oc get clusterversion
    NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.11.0-0.nightly-2022-07-19-104004   True        False         15d     Cluster version is 4.11.0-0.nightly-2022-07-19-104004
     
     
[auth]$ oc get csv
NAME                              DISPLAY                       VERSION   REPLACES   PHASE
mcg-operator.v4.11.0              NooBaa Operator               4.11.0               Succeeded
ocs-operator.v4.11.0              OpenShift Container Storage   4.11.0               Succeeded
odf-csi-addons-operator.v4.11.0   CSI Addons                    4.11.0               Succeeded
odf-operator.v4.11.0              OpenShift Data Foundation     4.11.0               Succeeded

Pasting here output of oc describe of ceph pods.

    Mgr pod -
     
        image: quay.io/rhceph-dev/rhceph@sha256:5adfc1fde0d2d7d63e41934186c6190fd3c55c3d23bffc70f9c6abff38c16101
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
     
    Mon Pods-
     
        livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: mon
       
    OSD Pods-
            livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-osd.0.asok status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: osd

hence marking it as verified.

Comment 25 avdhoot 2022-08-04 12:53:54 UTC

 Also we did not see any issues in our regression tests wrt this change

Comment 27 errata-xmlrpc 2022-08-24 13:54:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156

Note You need to log in before you can comment on or make changes to this bug.