Description of problem: Liveness probe failing for osd, mds, mon pods causing pods to restart very frequently and often the pods are in CLBO state. Version-Release number of selected component (if applicable): OCS 4.8.9 Environment: OCP is running on Vspehere How reproducible: In customer's environment Actual results: Pos are restarting due to liveness probe failure Expected results: Liveness probe should not fail Additional info: In the next private comment
The issue of liveness probe failing has been reported in the past also with an older versions and we have multiple of fix regarding this. We can need to check if this version ocs 4.8.9 has those fix.
The most common reason for Liveness probe failure is due to lack of cpu/memory or node being slow. Did you try increasing the time for the liveness Probe if that helps? Also, please share must-gather to debug further. Thanks
To verify this bz, describe any ceph pods{osd,mgr,mon} and check the `TimeoutSeconds` inside `Probe` section.
Verified bug by checking the `TimeoutSeconds` inside `Probe` section of osd,mgr,mon pods cluster details- [auth]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-07-19-104004 True False 15d Cluster version is 4.11.0-0.nightly-2022-07-19-104004 [auth]$ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.0 NooBaa Operator 4.11.0 Succeeded ocs-operator.v4.11.0 OpenShift Container Storage 4.11.0 Succeeded odf-csi-addons-operator.v4.11.0 CSI Addons 4.11.0 Succeeded odf-operator.v4.11.0 OpenShift Data Foundation 4.11.0 Succeeded Pasting here output of oc describe of ceph pods. Mgr pod - image: quay.io/rhceph-dev/rhceph@sha256:5adfc1fde0d2d7d63e41934186c6190fd3c55c3d23bffc70f9c6abff38c16101 imagePullPolicy: IfNotPresent livenessProbe: exec: command: - env - -i - sh - -c - ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 Mon Pods- livenessProbe: exec: command: - env - -i - sh - -c - ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 name: mon OSD Pods- livenessProbe: exec: command: - env - -i - sh - -c - ceph --admin-daemon /run/ceph/ceph-osd.0.asok status failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 name: osd hence marking it as verified.
Also we did not see any issues in our regression tests wrt this change
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156