Bug 1397798 - MTSH: multithreaded self heal hogs cpu consistently over 150%
Summary: MTSH: multithreaded self heal hogs cpu consistently over 150%
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Mohit Agrawal
QA Contact: Vijay Avuthu
URL:
Whiteboard:
Depends On: 1484446
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2016-11-23 11:25 UTC by Nag Pavan Chilakam
Modified: 2023-09-14 03:34 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.12.2-2
Doc Type: Bug Fix
Doc Text:
Some gluster daemons like glustershd have a higher cpu or memory consumption, when there is a large amount of data/entries to healed. This results in slow consumption of resources. You can resolve this by running the control-cpu-load.sh script. This script used the control groups for regulating cpu and memory of any gluster daemon.
Clone Of:
Environment:
Last Closed: 2018-09-04 06:29:44 UTC
Embargoed:


Attachments (Terms of Use)
log files (529.58 KB, application/x-gzip)
2016-11-23 11:32 UTC, Nag Pavan Chilakam
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 0 None None None 2018-09-04 06:31:17 UTC

Description Nag Pavan Chilakam 2016-11-23 11:25:22 UTC
Description of problem:
=========================
When we set shd threads to say 4 and wait for the heal.
We can see that the selfheal deamon consumes abotu 150% cpu at an average consitenty as long as heal goes on. 
THis can really put resources on strain.

For that reason, I am guessing even my systemic testing where heal is pending and mtsh is set to 4 are so busy with kernel hung messages that I can't even log in.
I can only ping to the machines.

I know mtsh comes with the trade off of resources but there must be a cap to it


Version-Release number of selected component (if applicable):
===========
3.8.4-5

Comment 2 Nag Pavan Chilakam 2016-11-23 11:31:30 UTC
For eg, I have attached some cpu log files,
you can notice that 1102 is the self heal deamon process of the source brick with shd option set to default 1 ===>avg cpu consumption is <100%
However with the threads set to 4, the avg cpu consumption is ~150% (refer 3354)
leg_newlog.log ==>legacy gluster processes top o/p
newlog.log ==>with mtsh set to 4

Comment 3 Nag Pavan Chilakam 2016-11-23 11:32:00 UTC
Created attachment 1223175 [details]
log files

Comment 4 Atin Mukherjee 2016-11-25 08:01:50 UTC
Dev comment : We can't fix this bug atm as per the design, raising number of threads would definitely eat up more CPU. We'd need to loop in Perf team to assess the h/w recommendation for MT-self heal usage which needs to be documented. Pranith will follow up with perf team.

A decision has been arrived to take this bug out of 3.2.0 in today's triage meeting between Dev, QE & PM. More details at https://docs.google.com/spreadsheets/d/1ew4cafcvIVEWuJ4tLDuZ4ao7ZTYpsRz5NwCtQ4JVZaQ/edit#gid=0

Comment 6 Ambarish 2017-03-16 08:55:27 UTC
Seeing similar stuff on physical perf machines.

I see CPU usage shooting upto 150% (and it stays there),though I could login etc to my machines since the machines I ma using are pretty high-end with 24 cores and 48G RAM.

Comment 8 Ravishankar N 2017-09-27 09:44:54 UTC
Mohit's patch upstream: https://review.gluster.org/#/c/18404/

Comment 12 Vijay Avuthu 2018-05-17 04:58:59 UTC
This bug has been verified as part of bug 1478395.

Changing status to Verified.

Comment 13 Srijita Mukherjee 2018-09-03 15:33:04 UTC
Have updated the doc text. kindly review and confirm.

Comment 15 errata-xmlrpc 2018-09-04 06:29:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Comment 16 Red Hat Bugzilla 2023-09-14 03:34:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.