Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1397798 - MTSH: multithreaded self heal hogs cpu consistently over 150% [NEEDINFO]
MTSH: multithreaded self heal hogs cpu consistently over 150%
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate (Show other bugs)
3.2
Unspecified Unspecified
high Severity high
: ---
: RHGS 3.4.0
Assigned To: Mohit Agrawal
Vijay Avuthu
: ZStream
Depends On: 1484446
Blocks: 1503134
  Show dependency treegraph
 
Reported: 2016-11-23 06:25 EST by nchilaka
Modified: 2018-09-17 09:40 EDT (History)
8 users (show)

See Also:
Fixed In Version: glusterfs-3.12.2-2
Doc Type: Bug Fix
Doc Text:
Some gluster daemons like glustershd have a higher cpu or memory consumption, when there is a large amount of data/entries to healed. This results in slow consumption of resources. You can resolve this by running the control-cpu-load.sh script. This script used the control groups for regulating cpu and memory of any gluster daemon.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-09-04 02:29:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
srmukher: needinfo? (moagrawa)


Attachments (Terms of Use)
log files (529.58 KB, application/x-gzip)
2016-11-23 06:32 EST, nchilaka
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 02:31 EDT

  None (edit)
Description nchilaka 2016-11-23 06:25:22 EST
Description of problem:
=========================
When we set shd threads to say 4 and wait for the heal.
We can see that the selfheal deamon consumes abotu 150% cpu at an average consitenty as long as heal goes on. 
THis can really put resources on strain.

For that reason, I am guessing even my systemic testing where heal is pending and mtsh is set to 4 are so busy with kernel hung messages that I can't even log in.
I can only ping to the machines.

I know mtsh comes with the trade off of resources but there must be a cap to it


Version-Release number of selected component (if applicable):
===========
3.8.4-5
Comment 2 nchilaka 2016-11-23 06:31:30 EST
For eg, I have attached some cpu log files,
you can notice that 1102 is the self heal deamon process of the source brick with shd option set to default 1 ===>avg cpu consumption is <100%
However with the threads set to 4, the avg cpu consumption is ~150% (refer 3354)
leg_newlog.log ==>legacy gluster processes top o/p
newlog.log ==>with mtsh set to 4
Comment 3 nchilaka 2016-11-23 06:32 EST
Created attachment 1223175 [details]
log files
Comment 4 Atin Mukherjee 2016-11-25 03:01:50 EST
Dev comment : We can't fix this bug atm as per the design, raising number of threads would definitely eat up more CPU. We'd need to loop in Perf team to assess the h/w recommendation for MT-self heal usage which needs to be documented. Pranith will follow up with perf team.

A decision has been arrived to take this bug out of 3.2.0 in today's triage meeting between Dev, QE & PM. More details at https://docs.google.com/spreadsheets/d/1ew4cafcvIVEWuJ4tLDuZ4ao7ZTYpsRz5NwCtQ4JVZaQ/edit#gid=0
Comment 6 Ambarish 2017-03-16 04:55:27 EDT
Seeing similar stuff on physical perf machines.

I see CPU usage shooting upto 150% (and it stays there),though I could login etc to my machines since the machines I ma using are pretty high-end with 24 cores and 48G RAM.
Comment 8 Ravishankar N 2017-09-27 05:44:54 EDT
Mohit's patch upstream: https://review.gluster.org/#/c/18404/
Comment 12 Vijay Avuthu 2018-05-17 00:58:59 EDT
This bug has been verified as part of bug 1478395.

Changing status to Verified.
Comment 13 Srijita Mukherjee 2018-09-03 11:33:04 EDT
Have updated the doc text. kindly review and confirm.
Comment 15 errata-xmlrpc 2018-09-04 02:29:44 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.