Description of problem: Observing ~397% CPU usage after deploying the nvme service Version-Release number of selected component (if applicable): ceph version 18.2.0-117.el9cp ceph-nvmeof:0.0.4-1 How reproducible: 1/1 Steps to Reproduce: 1. ceph orch apply nvmeof rbdpool --placement="pluto003" [root@pluto003 ~]# ceph orch ls | grep nvme nvmeof.rbdpool ?:4420,5500,8009 1/1 4m ago 25m pluto003 [root@pluto003 ~]# Actual results: after the deployment, we observed 397% cpu usage PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1798835 root 20 0 128.3g 43960 25424 R 396.7 0.0 74:29.30 reactor_0 Expected results: Additional info:
Created attachment 1999083 [details] cpu_usage_1
Few insights here - Observing ~397% CPU usage is a collective of 4 reactor threads combined, each at 100% which is expected as they are always in polling mode - https://github.com/spdk/spdk/issues/285 But dashboard highlights as high CPU usage this irrespective of however this works - see attachment [root@pluto003 ~]# top top - 09:45:54 up 117 days, 2:18, 2 users, load average: 5.56, 5.64, 5.76 Tasks: 394 total, 2 running, 392 sleeping, 0 stopped, 0 zombie %Cpu(s): 13.9 us, 1.3 sy, 0.0 ni, 84.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 128284.1 total, 37664.1 free, 26288.8 used, 68023.9 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 101995.4 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2129234 root 20 0 128.3g 34296 15864 R 393.8 0.0 17942:04 reactor_0 1014 root 20 0 663628 36100 30796 S 56.2 0.0 36321:20 rsyslogd 826 root 20 0 288996 171140 164972 S 12.5 0.1 17298:35 systemd-journal 2113394 ceph 20 0 2316380 1.6g 36352 S 6.2 1.3 99:26.95 ceph-osd 1 root 20 0 174104 17992 10500 S 0.0 0.0 30:11.17 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:13.02 kthreadd [root@pluto003 ~]# top -H -p 2129234 top - 09:46:14 up 117 days, 2:18, 2 users, load average: 5.59, 5.64, 5.76 Threads: 6 total, 4 running, 2 sleeping, 0 stopped, 0 zombie %Cpu(s): 14.5 us, 1.8 sy, 0.0 ni, 83.4 id, 0.0 wa, 0.1 hi, 0.1 si, 0.0 st MiB Mem : 128284.1 total, 37639.6 free, 26283.8 used, 68053.0 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 102000.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2129254 root 20 0 128.3g 34296 15864 R 99.9 0.0 4494:22 reactor_3 2129234 root 20 0 128.3g 34296 15864 R 99.7 0.0 4459:59 reactor_0 2129252 root 20 0 128.3g 34296 15864 R 99.7 0.0 4494:48 reactor_1 2129253 root 20 0 128.3g 34296 15864 R 99.7 0.0 4494:13 reactor_2 2129251 root 20 0 128.3g 34296 15864 S 0.0 0.0 0:00.00 eal-intr-thread 2129255 root 20 0 128.3g 34296 15864 S 0.0 0.0 0:00.00 telemetry-v2 [root@pluto003 ~]# ^C [root@pluto003 ~]# taskset -p 2129234 pid 2129234's current affinity mask: 1 [root@pluto003 ~]# taskset -p 2129252 pid 2129252's current affinity mask: 2 [root@pluto003 ~]# taskset -p 2129253 pid 2129253's current affinity mask: 4 [root@pluto003 ~]# taskset -p 2129254 pid 2129254's current affinity mask: 8
This is expected because the SPDK runs on 4 cores in polling mode. I don't see what is the bug here. Can you explain?
Moving this BZ to dashboard team as this is a warning/ alert shown on dashboard all the times and this expected from nvmeof component as reactors are always in polling mode. I would like to hear from dashboard team how we can handle such alerts, if its okay to suppress them. @epuertat WDYT?
Created attachment 2029939 [details] Dashboard
Verified. See attachment # ceph version ceph version 18.2.1-149.el9cp (6944266a2186e8940baeefc45140e9c798b90141) reef (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3925