Bug 1397798 - MTSH: multithreaded self heal hogs cpu consistently over 150% [NEEDINFO]
Summary: MTSH: multithreaded self heal hogs cpu consistently over 150%
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: Mohit Agrawal
QA Contact: Vijay Avuthu
Depends On: 1484446
Blocks: 1503134
TreeView+ depends on / blocked
Reported: 2016-11-23 11:25 UTC by nchilaka
Modified: 2018-09-17 13:40 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.12.2-2
Doc Type: Bug Fix
Doc Text:
Some gluster daemons like glustershd have a higher cpu or memory consumption, when there is a large amount of data/entries to healed. This results in slow consumption of resources. You can resolve this by running the control-cpu-load.sh script. This script used the control groups for regulating cpu and memory of any gluster daemon.
Clone Of:
Last Closed: 2018-09-04 06:29:44 UTC
Target Upstream Version:
srmukher: needinfo? (moagrawa)

Attachments (Terms of Use)
log files (529.58 KB, application/x-gzip)
2016-11-23 11:32 UTC, nchilaka
no flags Details

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:31:17 UTC

Description nchilaka 2016-11-23 11:25:22 UTC
Description of problem:
When we set shd threads to say 4 and wait for the heal.
We can see that the selfheal deamon consumes abotu 150% cpu at an average consitenty as long as heal goes on. 
THis can really put resources on strain.

For that reason, I am guessing even my systemic testing where heal is pending and mtsh is set to 4 are so busy with kernel hung messages that I can't even log in.
I can only ping to the machines.

I know mtsh comes with the trade off of resources but there must be a cap to it

Version-Release number of selected component (if applicable):

Comment 2 nchilaka 2016-11-23 11:31:30 UTC
For eg, I have attached some cpu log files,
you can notice that 1102 is the self heal deamon process of the source brick with shd option set to default 1 ===>avg cpu consumption is <100%
However with the threads set to 4, the avg cpu consumption is ~150% (refer 3354)
leg_newlog.log ==>legacy gluster processes top o/p
newlog.log ==>with mtsh set to 4

Comment 3 nchilaka 2016-11-23 11:32:00 UTC
Created attachment 1223175 [details]
log files

Comment 4 Atin Mukherjee 2016-11-25 08:01:50 UTC
Dev comment : We can't fix this bug atm as per the design, raising number of threads would definitely eat up more CPU. We'd need to loop in Perf team to assess the h/w recommendation for MT-self heal usage which needs to be documented. Pranith will follow up with perf team.

A decision has been arrived to take this bug out of 3.2.0 in today's triage meeting between Dev, QE & PM. More details at https://docs.google.com/spreadsheets/d/1ew4cafcvIVEWuJ4tLDuZ4ao7ZTYpsRz5NwCtQ4JVZaQ/edit#gid=0

Comment 6 Ambarish 2017-03-16 08:55:27 UTC
Seeing similar stuff on physical perf machines.

I see CPU usage shooting upto 150% (and it stays there),though I could login etc to my machines since the machines I ma using are pretty high-end with 24 cores and 48G RAM.

Comment 8 Ravishankar N 2017-09-27 09:44:54 UTC
Mohit's patch upstream: https://review.gluster.org/#/c/18404/

Comment 12 Vijay Avuthu 2018-05-17 04:58:59 UTC
This bug has been verified as part of bug 1478395.

Changing status to Verified.

Comment 13 Srijita Mukherjee 2018-09-03 15:33:04 UTC
Have updated the doc text. kindly review and confirm.

Comment 15 errata-xmlrpc 2018-09-04 06:29:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.