Bug 2091951
| Summary: | [GSS] OCS pods are restarting due to liveness probe failure | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sonal <sarora> |
| Component: | rook | Assignee: | Subham Rai <srai> |
| Status: | CLOSED ERRATA | QA Contact: | avdhoot <asagare> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | madam, mmanjuna, muagarwa, ocs-bugs, odf-bz-bot, olakra, srai, tdesala, tnielsen |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.11.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
.OpenShift Data Foundation pods are restarting due to liveness probe failure
Previously, the liveness probe on pods caused a restart of Ceph pods. This release update increases the default timeout for the liveness probe. The pods now get more time before restarting due to liveness where the nodes have more loads and fewer CPU/memory resources.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-24 13:54:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2094357 | ||
|
Description
Sonal
2022-05-31 11:58:53 UTC
The issue of liveness probe failing has been reported in the past also with an older versions and we have multiple of fix regarding this. We can need to check if this version ocs 4.8.9 has those fix. The most common reason for Liveness probe failure is due to lack of cpu/memory or node being slow. Did you try increasing the time for the liveness Probe if that helps? Also, please share must-gather to debug further. Thanks To verify this bz, describe any ceph pods{osd,mgr,mon} and check the `TimeoutSeconds` inside `Probe` section.
Verified bug by checking the `TimeoutSeconds` inside `Probe` section of osd,mgr,mon pods
cluster details-
[auth]$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-07-19-104004 True False 15d Cluster version is 4.11.0-0.nightly-2022-07-19-104004
[auth]$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
mcg-operator.v4.11.0 NooBaa Operator 4.11.0 Succeeded
ocs-operator.v4.11.0 OpenShift Container Storage 4.11.0 Succeeded
odf-csi-addons-operator.v4.11.0 CSI Addons 4.11.0 Succeeded
odf-operator.v4.11.0 OpenShift Data Foundation 4.11.0 Succeeded
Pasting here output of oc describe of ceph pods.
Mgr pod -
image: quay.io/rhceph-dev/rhceph@sha256:5adfc1fde0d2d7d63e41934186c6190fd3c55c3d23bffc70f9c6abff38c16101
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- env
- -i
- sh
- -c
- ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
Mon Pods-
livenessProbe:
exec:
command:
- env
- -i
- sh
- -c
- ceph --admin-daemon /run/ceph/ceph-mon.b.asok mon_status
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
name: mon
OSD Pods-
livenessProbe:
exec:
command:
- env
- -i
- sh
- -c
- ceph --admin-daemon /run/ceph/ceph-osd.0.asok status
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
name: osd
hence marking it as verified.
Also we did not see any issues in our regression tests wrt this change Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.11.0 security, enhancement, & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6156 |