Bug 2258021
| Summary: | OSD cpu overutilization alert is raised during normal fio workloads | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Joy John Pinto <jopinto> |
| Component: | ceph-monitoring | Assignee: | Divyansh Kamboj <dkamboj> |
| Status: | CLOSED ERRATA | QA Contact: | Joy John Pinto <jopinto> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.15 | CC: | dkamboj, ebenahar, muagarwa, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | ODF 4.15.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.15.0-125 | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-03-19 15:31:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Joy John Pinto
2024-01-12 07:17:00 UTC
Upon verification of the bug on running multiple fio jobs I observed osd pod restart when memory crosses 80% of the total allocated memory. As we have OSD_memeory_target_ratio set to 0.8 already, I dont think we might hit this scenario of osd pod crossing 80% of the allocated memory. [jopinto@jopinto jan30]$ kubectl top pod -n openshift-storage -l app=rook-ceph-osd NAME CPU(cores) MEMORY(bytes) rook-ceph-osd-0-5d97dfcdfc-d9xfb 115m 2084Mi rook-ceph-osd-1-84cc4b8d49-v8rvq 253m 2514Mi rook-ceph-osd-2-66c5fd79d8-jv4ds 222m 1896Mi [jopinto@jopinto jan30]$ kubectl top pod -n openshift-storage -l app=rook-ceph-osd NAME CPU(cores) MEMORY(bytes) rook-ceph-osd-0-5d97dfcdfc-d9xfb 115m 36Mi rook-ceph-osd-1-84cc4b8d49-v8rvq 299m 2085Mi rook-ceph-osd-2-66c5fd79d8-jv4ds 436m 1749Mi Apologies for the error (memory utilization) in https://bugzilla.redhat.com/show_bug.cgi?id=2258021#c5, Will retest with osd cpu utilization crossing 80% Verified with OCP 4.15.0-0.nightly-2024-02-14-214710 and ODF 4.15.0-142
Since it was difficult to achieve 80% of CPU utilization of osd pods, retried same scenario by creating a new alert as suggested by dev for 40% CPU utilization of osd pods for a duration of 10 minutes.
was able to achieve this with randread 4k block size five replica fio job, after 10 minutes of osd cpu utilization crossing 40% saw an alert in the UI (refer alert_osd.png)
Once the cpu utilzation went below the threshold 40%, the alert disappeared.
Promotheus rule yaml used in the verification:
[jopinto@jopinto upgrade_ibm]$ cat alert.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: high-cpu-usage
namespace: openshift-storage
spec:
groups:
- name: ceph-daemon-performance-alerts.rules
rules:
- alert: OSDCPULoadHigh2
annotations:
description: CPU usage for osd on pod {{ $labels.pod }} has exceeded 80%. Consider creating more OSDs to increase performance
message: High CPU usage detected in OSD container on pod {{ $labels.pod}}.
severity_level: warning
expr: |
pod:container_cpu_usage:sum{pod=~"rook-ceph-osd-.*"} / on(pod) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-osd-.*"} > 0.40
for: 10m
labels:
severity: warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |