2248855 – Observing ~397% CPU usage on dashboard after deploying the nvme service

Bug 2248855 - Observing ~397% CPU usage on dashboard after deploying the nvme service

Summary: Observing ~397% CPU usage on dashboard after deploying the nvme service

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Dashboard
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.1
Assignee:	Nizamudeen
QA Contact:	Rahul Lepakshi
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-11-09 10:14 UTC by Vidushi Mishra
Modified:	2024-06-13 14:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ceph-18.2.1-92.el9cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-13 14:23:02 UTC
Embargoed:
Dependent Products:
Flags:	epuertat: needinfo+

Attachments	(Terms of Use)
cpu_usage_1 (105.88 KB, image/png) 2023-11-13 10:09 UTC, Rahul Lepakshi	no flags	Details
Dashboard (98.35 KB, image/png) 2024-04-29 07:45 UTC, Rahul Lepakshi	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 56295	None	open	mgr/dashboard: rm warning/error threshold for cpu usage	2024-03-19 14:59:53 UTC
Red Hat Issue Tracker	RHCEPH-7880	None	None	None	2023-11-09 10:17:49 UTC
Red Hat Issue Tracker	RHCSDASH-1241	None	None	None	2024-02-01 04:35:05 UTC
Red Hat Product Errata	RHSA-2024:3925	None	None	None	2024-06-13 14:23:05 UTC

Description Vidushi Mishra 2023-11-09 10:14:57 UTC

Description of problem:

Observing ~397% CPU usage after deploying the nvme service

Version-Release number of selected component (if applicable):

ceph version 18.2.0-117.el9cp

ceph-nvmeof:0.0.4-1

How reproducible:
1/1

Steps to Reproduce:

1. ceph orch apply nvmeof rbdpool --placement="pluto003"

[root@pluto003 ~]# ceph orch ls | grep nvme
nvmeof.rbdpool  ?:4420,5500,8009      1/1  4m ago     25m  pluto003   
[root@pluto003 ~]# 


Actual results:

after the deployment, we observed 397% cpu usage

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     
1798835 root      20   0  128.3g  43960  25424 R 396.7   0.0  74:29.30 reactor_0                                                                                                                                   



Expected results:


Additional info:

Comment 2 Rahul Lepakshi 2023-11-13 10:09:27 UTC

Created attachment 1999083 [details]
cpu_usage_1

Comment 3 Rahul Lepakshi 2023-11-13 10:09:56 UTC

Few insights here -
Observing ~397% CPU usage is a collective of 4 reactor threads combined, each at 100% which is expected as they are always in polling mode - https://github.com/spdk/spdk/issues/285
But dashboard highlights as high CPU usage this irrespective of however this works - see attachment

[root@pluto003 ~]# top
top - 09:45:54 up 117 days,  2:18,  2 users,  load average: 5.56, 5.64, 5.76
Tasks: 394 total,   2 running, 392 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.9 us,  1.3 sy,  0.0 ni, 84.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 128284.1 total,  37664.1 free,  26288.8 used,  68023.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 101995.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2129234 root      20   0  128.3g  34296  15864 R 393.8   0.0  17942:04 reactor_0
   1014 root      20   0  663628  36100  30796 S  56.2   0.0  36321:20 rsyslogd
    826 root      20   0  288996 171140 164972 S  12.5   0.1  17298:35 systemd-journal
2113394 ceph      20   0 2316380   1.6g  36352 S   6.2   1.3  99:26.95 ceph-osd
      1 root      20   0  174104  17992  10500 S   0.0   0.0  30:11.17 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:13.02 kthreadd

[root@pluto003 ~]# top -H -p 2129234
top - 09:46:14 up 117 days,  2:18,  2 users,  load average: 5.59, 5.64, 5.76
Threads:   6 total,   4 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.5 us,  1.8 sy,  0.0 ni, 83.4 id,  0.0 wa,  0.1 hi,  0.1 si,  0.0 st
MiB Mem : 128284.1 total,  37639.6 free,  26283.8 used,  68053.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 102000.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2129254 root      20   0  128.3g  34296  15864 R  99.9   0.0   4494:22 reactor_3
2129234 root      20   0  128.3g  34296  15864 R  99.7   0.0   4459:59 reactor_0
2129252 root      20   0  128.3g  34296  15864 R  99.7   0.0   4494:48 reactor_1
2129253 root      20   0  128.3g  34296  15864 R  99.7   0.0   4494:13 reactor_2                                                                                                                                 2129251 root      20   0  128.3g  34296  15864 S   0.0   0.0   0:00.00 eal-intr-thread
2129255 root      20   0  128.3g  34296  15864 S   0.0   0.0   0:00.00 telemetry-v2

[root@pluto003 ~]# ^C
[root@pluto003 ~]# taskset -p 2129234
pid 2129234's current affinity mask: 1
[root@pluto003 ~]# taskset -p 2129252
pid 2129252's current affinity mask: 2
[root@pluto003 ~]# taskset -p 2129253
pid 2129253's current affinity mask: 4
[root@pluto003 ~]# taskset -p 2129254
pid 2129254's current affinity mask: 8

Comment 5 Aviv Caro 2024-01-21 15:18:56 UTC

This is expected because the SPDK runs on 4 cores in polling mode. I don't see what is the bug here. Can you explain?

Comment 6 Rahul Lepakshi 2024-02-01 04:34:39 UTC

Moving this BZ to dashboard team as this is a warning/ alert shown on dashboard all the times and this expected from nvmeof component as reactors are always in polling mode. I would like to hear from dashboard team how we can handle such alerts, if its okay to suppress them.

@epuertat WDYT?

Comment 10 Rahul Lepakshi 2024-04-29 07:45:03 UTC

Created attachment 2029939 [details]
Dashboard

Comment 11 Rahul Lepakshi 2024-04-29 07:47:37 UTC

Verified. See attachment 

# ceph version
ceph version 18.2.1-149.el9cp (6944266a2186e8940baeefc45140e9c798b90141) reef (stable)

Comment 12 errata-xmlrpc 2024-06-13 14:23:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Note You need to log in before you can comment on or make changes to this bug.