Bug 2248855 - Observing ~397% CPU usage on dashboard after deploying the nvme service
Summary: Observing ~397% CPU usage on dashboard after deploying the nvme service
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Dashboard
Version: 7.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 7.1
Assignee: Nizamudeen
QA Contact: Rahul Lepakshi
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-09 10:14 UTC by Vidushi Mishra
Modified: 2024-06-13 14:23 UTC (History)
8 users (show)

Fixed In Version: ceph-18.2.1-92.el9cp
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-06-13 14:23:02 UTC
Embargoed:
epuertat: needinfo+


Attachments (Terms of Use)
cpu_usage_1 (105.88 KB, image/png)
2023-11-13 10:09 UTC, Rahul Lepakshi
no flags Details
Dashboard (98.35 KB, image/png)
2024-04-29 07:45 UTC, Rahul Lepakshi
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 56295 0 None open mgr/dashboard: rm warning/error threshold for cpu usage 2024-03-19 14:59:53 UTC
Red Hat Issue Tracker RHCEPH-7880 0 None None None 2023-11-09 10:17:49 UTC
Red Hat Issue Tracker RHCSDASH-1241 0 None None None 2024-02-01 04:35:05 UTC
Red Hat Product Errata RHSA-2024:3925 0 None None None 2024-06-13 14:23:05 UTC

Description Vidushi Mishra 2023-11-09 10:14:57 UTC
Description of problem:

Observing ~397% CPU usage after deploying the nvme service

Version-Release number of selected component (if applicable):

ceph version 18.2.0-117.el9cp

ceph-nvmeof:0.0.4-1

How reproducible:
1/1

Steps to Reproduce:

1. ceph orch apply nvmeof rbdpool --placement="pluto003"

[root@pluto003 ~]# ceph orch ls | grep nvme
nvmeof.rbdpool  ?:4420,5500,8009      1/1  4m ago     25m  pluto003   
[root@pluto003 ~]# 


Actual results:

after the deployment, we observed 397% cpu usage

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     
1798835 root      20   0  128.3g  43960  25424 R 396.7   0.0  74:29.30 reactor_0                                                                                                                                   



Expected results:


Additional info:

Comment 2 Rahul Lepakshi 2023-11-13 10:09:27 UTC
Created attachment 1999083 [details]
cpu_usage_1

Comment 3 Rahul Lepakshi 2023-11-13 10:09:56 UTC
Few insights here -
Observing ~397% CPU usage is a collective of 4 reactor threads combined, each at 100% which is expected as they are always in polling mode - https://github.com/spdk/spdk/issues/285
But dashboard highlights as high CPU usage this irrespective of however this works - see attachment

[root@pluto003 ~]# top
top - 09:45:54 up 117 days,  2:18,  2 users,  load average: 5.56, 5.64, 5.76
Tasks: 394 total,   2 running, 392 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.9 us,  1.3 sy,  0.0 ni, 84.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 128284.1 total,  37664.1 free,  26288.8 used,  68023.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 101995.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2129234 root      20   0  128.3g  34296  15864 R 393.8   0.0  17942:04 reactor_0
   1014 root      20   0  663628  36100  30796 S  56.2   0.0  36321:20 rsyslogd
    826 root      20   0  288996 171140 164972 S  12.5   0.1  17298:35 systemd-journal
2113394 ceph      20   0 2316380   1.6g  36352 S   6.2   1.3  99:26.95 ceph-osd
      1 root      20   0  174104  17992  10500 S   0.0   0.0  30:11.17 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:13.02 kthreadd

[root@pluto003 ~]# top -H -p 2129234
top - 09:46:14 up 117 days,  2:18,  2 users,  load average: 5.59, 5.64, 5.76
Threads:   6 total,   4 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.5 us,  1.8 sy,  0.0 ni, 83.4 id,  0.0 wa,  0.1 hi,  0.1 si,  0.0 st
MiB Mem : 128284.1 total,  37639.6 free,  26283.8 used,  68053.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 102000.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2129254 root      20   0  128.3g  34296  15864 R  99.9   0.0   4494:22 reactor_3
2129234 root      20   0  128.3g  34296  15864 R  99.7   0.0   4459:59 reactor_0
2129252 root      20   0  128.3g  34296  15864 R  99.7   0.0   4494:48 reactor_1
2129253 root      20   0  128.3g  34296  15864 R  99.7   0.0   4494:13 reactor_2                                                                                                                                 2129251 root      20   0  128.3g  34296  15864 S   0.0   0.0   0:00.00 eal-intr-thread
2129255 root      20   0  128.3g  34296  15864 S   0.0   0.0   0:00.00 telemetry-v2

[root@pluto003 ~]# ^C
[root@pluto003 ~]# taskset -p 2129234
pid 2129234's current affinity mask: 1
[root@pluto003 ~]# taskset -p 2129252
pid 2129252's current affinity mask: 2
[root@pluto003 ~]# taskset -p 2129253
pid 2129253's current affinity mask: 4
[root@pluto003 ~]# taskset -p 2129254
pid 2129254's current affinity mask: 8

Comment 5 Aviv Caro 2024-01-21 15:18:56 UTC
This is expected because the SPDK runs on 4 cores in polling mode. I don't see what is the bug here. Can you explain?

Comment 6 Rahul Lepakshi 2024-02-01 04:34:39 UTC
Moving this BZ to dashboard team as this is a warning/ alert shown on dashboard all the times and this expected from nvmeof component as reactors are always in polling mode. I would like to hear from dashboard team how we can handle such alerts, if its okay to suppress them.

@epuertat WDYT?

Comment 10 Rahul Lepakshi 2024-04-29 07:45:03 UTC
Created attachment 2029939 [details]
Dashboard

Comment 11 Rahul Lepakshi 2024-04-29 07:47:37 UTC
Verified. See attachment 

# ceph version
ceph version 18.2.1-149.el9cp (6944266a2186e8940baeefc45140e9c798b90141) reef (stable)

Comment 12 errata-xmlrpc 2024-06-13 14:23:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925


Note You need to log in before you can comment on or make changes to this bug.