Bug 2091951 - [GSS] OCS pods are restarting due to liveness probe failure
Summary: [GSS] OCS pods are restarting due to liveness probe failure
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.8
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ODF 4.11.0
Assignee: Subham Rai
QA Contact: avdhoot
URL:
Whiteboard:
Depends On:
Blocks: 2094357
TreeView+ depends on / blocked
 
Reported: 2022-05-31 11:58 UTC by Sonal
Modified: 2023-08-09 17:03 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
.OpenShift Data Foundation pods are restarting due to liveness probe failure Previously, the liveness probe on pods caused a restart of Ceph pods. This release update increases the default timeout for the liveness probe. The pods now get more time before restarting due to liveness where the nodes have more loads and fewer CPU/memory resources.
Clone Of:
Environment:
Last Closed: 2022-08-24 13:54:12 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github red-hat-storage rook pull 390 0 None open Bug 2091951: core: increase liveness probe timeout to 2s 2022-06-20 03:59:57 UTC
Github rook rook pull 10460 0 None open core: increase liveness probe timeout to 2s 2022-06-16 12:54:21 UTC
Red Hat Product Errata RHSA-2022:6156 0 None None None 2022-08-24 13:54:26 UTC

Description Sonal 2022-05-31 11:58:53 UTC
Description of problem:

Liveness probe failing for osd, mds, mon pods causing pods to restart very frequently and often the pods are in CLBO state.

Version-Release number of selected component (if applicable):
OCS 4.8.9
Environment: OCP is running on Vspehere

How reproducible:
In customer's environment

Actual results:
Pos are restarting due to liveness probe failure

Expected results:
Liveness probe should not fail

Additional info:
In the next private comment

Comment 4 Subham Rai 2022-05-31 15:28:13 UTC
The issue of liveness probe failing has been reported in the past also with an older versions and we have multiple of fix regarding this. We can need to check if this version ocs 4.8.9 has those fix.

Comment 9 Subham Rai 2022-06-02 08:21:03 UTC
The most common reason for Liveness probe failure is due to lack of cpu/memory or node being slow. Did you try increasing the time for the liveness Probe if that helps?

Also, please share must-gather to debug further. Thanks

Comment 23 Subham Rai 2022-08-04 08:14:01 UTC
To verify this bz, describe any ceph pods{osd,mgr,mon} and check the `TimeoutSeconds` inside `Probe` section.

Comment 24 avdhoot 2022-08-04 12:50:34 UTC
Verified bug by checking the `TimeoutSeconds` inside `Probe` section of osd,mgr,mon pods


cluster details-

[auth]$ oc get clusterversion
    NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.11.0-0.nightly-2022-07-19-104004   True        False         15d     Cluster version is 4.11.0-0.nightly-2022-07-19-104004
     
     
[auth]$ oc get csv
NAME                              DISPLAY                       VERSION   REPLACES   PHASE
mcg-operator.v4.11.0              NooBaa Operator               4.11.0               Succeeded
ocs-operator.v4.11.0              OpenShift Container Storage   4.11.0               Succeeded
odf-csi-addons-operator.v4.11.0   CSI Addons                    4.11.0               Succeeded
odf-operator.v4.11.0              OpenShift Data Foundation     4.11.0               Succeeded

Pasting here output of oc describe of ceph pods.

    Mgr pod -
     
        image: quay.io/rhceph-dev/rhceph@sha256:5adfc1fde0d2d7d63e41934186c6190fd3c55c3d23bffc70f9c6abff38c16101
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
     
    Mon Pods-
     
        livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: mon
       
    OSD Pods-
            livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-osd.0.asok status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: osd

hence marking it as verified.

Comment 25 avdhoot 2022-08-04 12:53:54 UTC
 Also we did not see any issues in our regression tests wrt this change

Comment 27 errata-xmlrpc 2022-08-24 13:54:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156


Note You need to log in before you can comment on or make changes to this bug.