Description of problem:
CPU Utilisation on Baremetal Ceph node hosting MDS is Very High, it's 59.8%.
This is seen after Stress and Scalability tests on Consistency Group feature.
top cmd output snippet: from MDS node magna24
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1759 167 20 0 9658228 5.8g 8892 S 59.8 18.7 5864:09 ceph-mds
599 root 20 0 228720 130484 127536 S 1.0 0.4 569:12.47 systemd-journal
802 root 20 0 598176 46556 34760 S 1.0 0.1 141:57.00 rsyslogd
2378 167 20 0 5008724 3.2g 10332 S 0.3 10.2 103:43.44 ceph-osd
2381 167 20 0 4896376 3.1g 10500 S 0.3 9.9 125:46.09 ceph-osd
Version-Release number of selected component (if applicable): ceph version 18.2.1-76.el9cp
How reproducible:
Steps to Reproduce:
1. Run Consistency Group quiesce stress and scalability tests on baremetal ceph setup.
2. Capture CPU usage from MDS node.
Actual results: CPU utilisation - 59.8%
System reports socket timeout for any further tests, and ceph status shows "1 MDSs report slow requests" warning for more than a day.
Test logs for socket timeout : http://magna002.ceph.redhat.com/cephci-jenkins/cephci-run-P4WY12
Expected results: Even if the IO payload to MDS is high, due to Stress and scalability tests, MDS daemon could limit on the CPU resource utilisation.
Additional info: System logs will be copied.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:3925