Bug 2091951
Summary: | [GSS] OCS pods are restarting due to liveness probe failure | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sonal <sarora> |
Component: | rook | Assignee: | Subham Rai <srai> |
Status: | CLOSED ERRATA | QA Contact: | avdhoot <asagare> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.8 | CC: | madam, mmanjuna, muagarwa, ocs-bugs, odf-bz-bot, olakra, srai, tdesala, tnielsen |
Target Milestone: | --- | ||
Target Release: | ODF 4.11.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
.OpenShift Data Foundation pods are restarting due to liveness probe failure
Previously, the liveness probe on pods caused a restart of Ceph pods. This release update increases the default timeout for the liveness probe. The pods now get more time before restarting due to liveness where the nodes have more loads and fewer CPU/memory resources.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-24 13:54:12 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2094357 |
Description
Sonal
2022-05-31 11:58:53 UTC
The issue of liveness probe failing has been reported in the past also with an older versions and we have multiple of fix regarding this. We can need to check if this version ocs 4.8.9 has those fix. The most common reason for Liveness probe failure is due to lack of cpu/memory or node being slow. Did you try increasing the time for the liveness Probe if that helps? Also, please share must-gather to debug further. Thanks To verify this bz, describe any ceph pods{osd,mgr,mon} and check the `TimeoutSeconds` inside `Probe` section. Verified bug by checking the `TimeoutSeconds` inside `Probe` section of osd,mgr,mon pods cluster details- [auth]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-07-19-104004 True False 15d Cluster version is 4.11.0-0.nightly-2022-07-19-104004 [auth]$ oc get csv NAME DISPLAY VERSION REPLACES PHASE mcg-operator.v4.11.0 NooBaa Operator 4.11.0 Succeeded ocs-operator.v4.11.0 OpenShift Container Storage 4.11.0 Succeeded odf-csi-addons-operator.v4.11.0 CSI Addons 4.11.0 Succeeded odf-operator.v4.11.0 OpenShift Data Foundation 4.11.0 Succeeded Pasting here output of oc describe of ceph pods. Mgr pod - image: quay.io/rhceph-dev/rhceph@sha256:5adfc1fde0d2d7d63e41934186c6190fd3c55c3d23bffc70f9c6abff38c16101 imagePullPolicy: IfNotPresent livenessProbe: exec: command: - env - -i - sh - -c - ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 Mon Pods- livenessProbe: exec: command: - env - -i - sh - -c - ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 name: mon OSD Pods- livenessProbe: exec: command: - env - -i - sh - -c - ceph --admin-daemon /run/ceph/ceph-osd.0.asok status failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 2 name: osd hence marking it as verified. Also we did not see any issues in our regression tests wrt this change Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156 |