Bug 1397798

Summary:

MTSH: multithreaded self heal hogs cpu consistently over 150%

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Nag Pavan Chilakam <nchilaka>

Component:

replicate

Assignee:

Mohit Agrawal <moagrawa>

Status:

CLOSED ERRATA

QA Contact:

Vijay Avuthu <vavuthu>

Severity:

high

Docs Contact:

Priority:

high

Version:

rhgs-3.2

CC:

amukherj, moagrawa, ravishankar, rhinduja, rhs-bugs, sheggodu, srmukher, storage-qa-internal

Target Milestone:

---

Keywords:

ZStream

Target Release:

RHGS 3.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.12.2-2

Doc Type:

Bug Fix

Doc Text:

Some gluster daemons like glustershd have a higher cpu or memory consumption, when there is a large amount of data/entries to healed. This results in slow consumption of resources. You can resolve this by running the control-cpu-load.sh script. This script used the control groups for regulating cpu and memory of any gluster daemon.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-09-04 06:29:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1484446

Bug Blocks:

1503134

Attachments:

Description	Flags
log files	none

Description Nag Pavan Chilakam 2016-11-23 11:25:22 UTC

Description of problem:
=========================
When we set shd threads to say 4 and wait for the heal.
We can see that the selfheal deamon consumes abotu 150% cpu at an average consitenty as long as heal goes on. 
THis can really put resources on strain.

For that reason, I am guessing even my systemic testing where heal is pending and mtsh is set to 4 are so busy with kernel hung messages that I can't even log in.
I can only ping to the machines.

I know mtsh comes with the trade off of resources but there must be a cap to it


Version-Release number of selected component (if applicable):
===========
3.8.4-5

Comment 2 Nag Pavan Chilakam 2016-11-23 11:31:30 UTC

For eg, I have attached some cpu log files,
you can notice that 1102 is the self heal deamon process of the source brick with shd option set to default 1 ===>avg cpu consumption is <100%
However with the threads set to 4, the avg cpu consumption is ~150% (refer 3354)
leg_newlog.log ==>legacy gluster processes top o/p
newlog.log ==>with mtsh set to 4

Comment 3 Nag Pavan Chilakam 2016-11-23 11:32:00 UTC

Created attachment 1223175 [details]
log files

Comment 4 Atin Mukherjee 2016-11-25 08:01:50 UTC

Dev comment : We can't fix this bug atm as per the design, raising number of threads would definitely eat up more CPU. We'd need to loop in Perf team to assess the h/w recommendation for MT-self heal usage which needs to be documented. Pranith will follow up with perf team.

A decision has been arrived to take this bug out of 3.2.0 in today's triage meeting between Dev, QE & PM. More details at https://docs.google.com/spreadsheets/d/1ew4cafcvIVEWuJ4tLDuZ4ao7ZTYpsRz5NwCtQ4JVZaQ/edit#gid=0

Comment 6 Ambarish 2017-03-16 08:55:27 UTC

Seeing similar stuff on physical perf machines.

I see CPU usage shooting upto 150% (and it stays there),though I could login etc to my machines since the machines I ma using are pretty high-end with 24 cores and 48G RAM.

Comment 8 Ravishankar N 2017-09-27 09:44:54 UTC

Mohit's patch upstream: https://review.gluster.org/#/c/18404/

Comment 12 Vijay Avuthu 2018-05-17 04:58:59 UTC

This bug has been verified as part of bug 1478395.

Changing status to Verified.

Comment 13 Srijita Mukherjee 2018-09-03 15:33:04 UTC

Have updated the doc text. kindly review and confirm.

Comment 15 errata-xmlrpc 2018-09-04 06:29:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Comment 16 Red Hat Bugzilla 2023-09-14 03:34:55 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days