Bug 2091951

Summary: [GSS] OCS pods are restarting due to liveness probe failure
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sonal <sarora>
Component: rookAssignee: Subham Rai <srai>
Status: CLOSED ERRATA QA Contact: avdhoot <asagare>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: madam, mmanjuna, muagarwa, ocs-bugs, odf-bz-bot, olakra, srai, tdesala, tnielsen
Target Milestone: ---   
Target Release: ODF 4.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
.OpenShift Data Foundation pods are restarting due to liveness probe failure Previously, the liveness probe on pods caused a restart of Ceph pods. This release update increases the default timeout for the liveness probe. The pods now get more time before restarting due to liveness where the nodes have more loads and fewer CPU/memory resources.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-24 13:54:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2094357    

Description Sonal 2022-05-31 11:58:53 UTC
Description of problem:

Liveness probe failing for osd, mds, mon pods causing pods to restart very frequently and often the pods are in CLBO state.

Version-Release number of selected component (if applicable):
OCS 4.8.9
Environment: OCP is running on Vspehere

How reproducible:
In customer's environment

Actual results:
Pos are restarting due to liveness probe failure

Expected results:
Liveness probe should not fail

Additional info:
In the next private comment

Comment 4 Subham Rai 2022-05-31 15:28:13 UTC
The issue of liveness probe failing has been reported in the past also with an older versions and we have multiple of fix regarding this. We can need to check if this version ocs 4.8.9 has those fix.

Comment 9 Subham Rai 2022-06-02 08:21:03 UTC
The most common reason for Liveness probe failure is due to lack of cpu/memory or node being slow. Did you try increasing the time for the liveness Probe if that helps?

Also, please share must-gather to debug further. Thanks

Comment 23 Subham Rai 2022-08-04 08:14:01 UTC
To verify this bz, describe any ceph pods{osd,mgr,mon} and check the `TimeoutSeconds` inside `Probe` section.

Comment 24 avdhoot 2022-08-04 12:50:34 UTC
Verified bug by checking the `TimeoutSeconds` inside `Probe` section of osd,mgr,mon pods


cluster details-

[auth]$ oc get clusterversion
    NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.11.0-0.nightly-2022-07-19-104004   True        False         15d     Cluster version is 4.11.0-0.nightly-2022-07-19-104004
     
     
[auth]$ oc get csv
NAME                              DISPLAY                       VERSION   REPLACES   PHASE
mcg-operator.v4.11.0              NooBaa Operator               4.11.0               Succeeded
ocs-operator.v4.11.0              OpenShift Container Storage   4.11.0               Succeeded
odf-csi-addons-operator.v4.11.0   CSI Addons                    4.11.0               Succeeded
odf-operator.v4.11.0              OpenShift Data Foundation     4.11.0               Succeeded

Pasting here output of oc describe of ceph pods.

    Mgr pod -
     
        image: quay.io/rhceph-dev/rhceph@sha256:5adfc1fde0d2d7d63e41934186c6190fd3c55c3d23bffc70f9c6abff38c16101
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
     
    Mon Pods-
     
        livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: mon
       
    OSD Pods-
            livenessProbe:
          exec:
            command:
            - env
            - -i
            - sh
            - -c
            - ceph --admin-daemon /run/ceph/ceph-osd.0.asok status
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: osd

hence marking it as verified.

Comment 25 avdhoot 2022-08-04 12:53:54 UTC
 Also we did not see any issues in our regression tests wrt this change

Comment 27 errata-xmlrpc 2022-08-24 13:54:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6156