Bug 2258021

Summary: OSD cpu overutilization alert is raised during normal fio workloads
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Joy John Pinto <jopinto>
Component: ceph-monitoringAssignee: Divyansh Kamboj <dkamboj>
Status: CLOSED ERRATA QA Contact: Joy John Pinto <jopinto>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.15CC: dkamboj, ebenahar, muagarwa, odf-bz-bot
Target Milestone: ---   
Target Release: ODF 4.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.15.0-125 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-03-19 15:31:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joy John Pinto 2024-01-12 07:17:00 UTC
Description of problem (please be detailed as possible and provide log
snippests):
OSD cpu overutilization alert is raised during normal fio workloads

Version of all relevant components (if applicable):
ODF 4.15.0-103

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
NA

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
No

Steps to Reproduce:
1. Install OCP and ODF 4.15
2. Run a fio workload and observe cpu overutilization alert being raised when osd cpu utilization crosses 35%
3. Observe CPU overutilization alert being raised


Actual results:

During fio workload, cpu overutilization alert being raised when osd cpu utilization crosses 35%

Expected results:
35% limit set for osd cpu utilization or memory utilization is too low, it should be increased to above 60%. Please refer comments from jira https://issues.redhat.com/browse/RHSTOR-4881

Additional info:

Comment 5 Joy John Pinto 2024-02-06 09:26:26 UTC
Upon verification of the bug on running multiple fio jobs I observed osd pod restart when memory crosses 80% of the total allocated memory. As we have OSD_memeory_target_ratio set to 0.8 already, I dont think we might hit this scenario of osd pod crossing 80% of the allocated memory. 

[jopinto@jopinto jan30]$ kubectl top pod -n openshift-storage -l app=rook-ceph-osd
NAME                               CPU(cores)   MEMORY(bytes)   
rook-ceph-osd-0-5d97dfcdfc-d9xfb   115m         2084Mi          
rook-ceph-osd-1-84cc4b8d49-v8rvq   253m         2514Mi          
rook-ceph-osd-2-66c5fd79d8-jv4ds   222m         1896Mi          
[jopinto@jopinto jan30]$ kubectl top pod -n openshift-storage -l app=rook-ceph-osd
NAME                               CPU(cores)   MEMORY(bytes)   
rook-ceph-osd-0-5d97dfcdfc-d9xfb   115m         36Mi            
rook-ceph-osd-1-84cc4b8d49-v8rvq   299m         2085Mi          
rook-ceph-osd-2-66c5fd79d8-jv4ds   436m         1749Mi

Comment 7 Joy John Pinto 2024-02-07 11:22:22 UTC
Apologies for the error (memory utilization) in https://bugzilla.redhat.com/show_bug.cgi?id=2258021#c5, Will retest with osd cpu utilization crossing 80%

Comment 8 Joy John Pinto 2024-02-16 09:17:00 UTC
Verified with OCP 4.15.0-0.nightly-2024-02-14-214710 and ODF 4.15.0-142

Since it was difficult to achieve 80% of CPU utilization of osd pods, retried same scenario by creating a new alert as suggested by dev for 40% CPU utilization of osd pods for a duration of 10 minutes.

was able to achieve this with randread 4k block size five replica fio job, after 10 minutes of osd cpu utilization crossing 40% saw an alert in the UI (refer alert_osd.png)

Once the cpu utilzation went below the threshold 40%, the alert disappeared.

Promotheus rule yaml used in the verification:
[jopinto@jopinto upgrade_ibm]$ cat alert.yaml 
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: high-cpu-usage
  namespace: openshift-storage
spec:
  groups:
  - name: ceph-daemon-performance-alerts.rules
    rules:
    - alert: OSDCPULoadHigh2
      annotations:
        description: CPU usage for osd on pod {{ $labels.pod }} has exceeded 80%. Consider creating more OSDs to increase performance
        message: High CPU usage detected in OSD container on pod {{ $labels.pod}}.
        severity_level: warning
      expr: |
        pod:container_cpu_usage:sum{pod=~"rook-ceph-osd-.*"} / on(pod) kube_pod_resource_limit{resource='cpu',pod=~"rook-ceph-osd-.*"} > 0.40
      for: 10m
      labels:
        severity: warning

Comment 11 errata-xmlrpc 2024-03-19 15:31:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Comment 12 Red Hat Bugzilla 2024-07-18 04:25:24 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days