Description of problem (please be detailed as possible and provide log snippests): OSD cpu overutilization alert is raised during normal fio workloads Version of all relevant components (if applicable): ODF 4.15.0-103 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? NA Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: No Steps to Reproduce: 1. Install OCP and ODF 4.15 2. Run a fio workload and observe cpu overutilization alert being raised when osd cpu utilization crosses 35% 3. Observe CPU overutilization alert being raised Actual results: During fio workload, cpu overutilization alert being raised when osd cpu utilization crosses 35% Expected results: 35% limit set for osd cpu utilization or memory utilization is too low, it should be increased to above 60%. Please refer comments from jira https://issues.redhat.com/browse/RHSTOR-4881 Additional info:
Upon verification of the bug on running multiple fio jobs I observed osd pod restart when memory crosses 80% of the total allocated memory. As we have OSD_memeory_target_ratio set to 0.8 already, I dont think we might hit this scenario of osd pod crossing 80% of the allocated memory. [jopinto@jopinto jan30]$ kubectl top pod -n openshift-storage -l app=rook-ceph-osd NAME CPU(cores) MEMORY(bytes) rook-ceph-osd-0-5d97dfcdfc-d9xfb 115m 2084Mi rook-ceph-osd-1-84cc4b8d49-v8rvq 253m 2514Mi rook-ceph-osd-2-66c5fd79d8-jv4ds 222m 1896Mi [jopinto@jopinto jan30]$ kubectl top pod -n openshift-storage -l app=rook-ceph-osd NAME CPU(cores) MEMORY(bytes) rook-ceph-osd-0-5d97dfcdfc-d9xfb 115m 36Mi rook-ceph-osd-1-84cc4b8d49-v8rvq 299m 2085Mi rook-ceph-osd-2-66c5fd79d8-jv4ds 436m 1749Mi
Apologies for the error (memory utilization) in https://bugzilla.redhat.com/show_bug.cgi?id=2258021#c5, Will retest with osd cpu utilization crossing 80%
Verified with OCP 4.15.0-0.nightly-2024-02-14-214710 and ODF 4.15.0-142 Since it was difficult to achieve 80% of CPU utilization of osd pods, retried same scenario by creating a new alert as suggested by dev for 40% CPU utilization of osd pods for a duration of 10 minutes. was able to achieve this with randread 4k block size five replica fio job, after 10 minutes of osd cpu utilization crossing 40% saw an alert in the UI (refer alert_osd.png) Once the cpu utilzation went below the threshold 40%, the alert disappeared. Promotheus rule yaml used in the verification: [jopinto@jopinto upgrade_ibm]$ cat alert.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: high-cpu-usage namespace: openshift-storage spec: groups: - name: ceph-daemon-performance-alerts.rules rules: - alert: OSDCPULoadHigh2 annotations: description: CPU usage for osd on pod {{ $labels.pod }} has exceeded 80%. Consider creating more OSDs to increase performance message: High CPU usage detected in OSD container on pod {{ $labels.pod}}. severity_level: warning expr: | pod:container_cpu_usage:sum{pod=~"rook-ceph-osd-.*"} / on(pod) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-osd-.*"} > 0.40 for: 10m labels: severity: warning
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:1383
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days