Bug 1397798

Summary: MTSH: multithreaded self heal hogs cpu consistently over 150%
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: replicateAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Vijay Avuthu <vavuthu>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.2CC: amukherj, moagrawa, ravishankar, rhinduja, rhs-bugs, sheggodu, srmukher, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-2 Doc Type: Bug Fix
Doc Text:
Some gluster daemons like glustershd have a higher cpu or memory consumption, when there is a large amount of data/entries to healed. This results in slow consumption of resources. You can resolve this by running the control-cpu-load.sh script. This script used the control groups for regulating cpu and memory of any gluster daemon.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:29:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1484446    
Bug Blocks: 1503134    
Attachments:
Description Flags
log files none

Description Nag Pavan Chilakam 2016-11-23 11:25:22 UTC
Description of problem:
=========================
When we set shd threads to say 4 and wait for the heal.
We can see that the selfheal deamon consumes abotu 150% cpu at an average consitenty as long as heal goes on. 
THis can really put resources on strain.

For that reason, I am guessing even my systemic testing where heal is pending and mtsh is set to 4 are so busy with kernel hung messages that I can't even log in.
I can only ping to the machines.

I know mtsh comes with the trade off of resources but there must be a cap to it


Version-Release number of selected component (if applicable):
===========
3.8.4-5

Comment 2 Nag Pavan Chilakam 2016-11-23 11:31:30 UTC
For eg, I have attached some cpu log files,
you can notice that 1102 is the self heal deamon process of the source brick with shd option set to default 1 ===>avg cpu consumption is <100%
However with the threads set to 4, the avg cpu consumption is ~150% (refer 3354)
leg_newlog.log ==>legacy gluster processes top o/p
newlog.log ==>with mtsh set to 4

Comment 3 Nag Pavan Chilakam 2016-11-23 11:32:00 UTC
Created attachment 1223175 [details]
log files

Comment 4 Atin Mukherjee 2016-11-25 08:01:50 UTC
Dev comment : We can't fix this bug atm as per the design, raising number of threads would definitely eat up more CPU. We'd need to loop in Perf team to assess the h/w recommendation for MT-self heal usage which needs to be documented. Pranith will follow up with perf team.

A decision has been arrived to take this bug out of 3.2.0 in today's triage meeting between Dev, QE & PM. More details at https://docs.google.com/spreadsheets/d/1ew4cafcvIVEWuJ4tLDuZ4ao7ZTYpsRz5NwCtQ4JVZaQ/edit#gid=0

Comment 6 Ambarish 2017-03-16 08:55:27 UTC
Seeing similar stuff on physical perf machines.

I see CPU usage shooting upto 150% (and it stays there),though I could login etc to my machines since the machines I ma using are pretty high-end with 24 cores and 48G RAM.

Comment 8 Ravishankar N 2017-09-27 09:44:54 UTC
Mohit's patch upstream: https://review.gluster.org/#/c/18404/

Comment 12 Vijay Avuthu 2018-05-17 04:58:59 UTC
This bug has been verified as part of bug 1478395.

Changing status to Verified.

Comment 13 Srijita Mukherjee 2018-09-03 15:33:04 UTC
Have updated the doc text. kindly review and confirm.

Comment 15 errata-xmlrpc 2018-09-04 06:29:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Comment 16 Red Hat Bugzilla 2023-09-14 03:34:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days