1478395 – Extreme Load from self-heal

Bug 1478395 - Extreme Load from self-heal

Summary: Extreme Load from self-heal

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Mohit Agrawal
QA Contact:	Vijay Avuthu
Docs Contact:
URL:
Whiteboard:
Depends On:	1484446
Blocks:	RHGS-3.4-GSS-proposed-tracker 1496334 1496335 1503135
TreeView+	depends on / blocked

Reported:	2017-08-04 13:37 UTC by Matthew Robson
Modified:	2020-12-14 09:21 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.12.2-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1496334 1496335 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:34:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:35:54 UTC

Description Matthew Robson 2017-08-04 13:37:17 UTC

Description of problem:

Upgraded gluster servers to RHEL 7.4 and upgraded Gluster from RHGS 3.1.3 to RHGS 3.2.

The gluster process was stopped during the gluster package upgrade.

After the upgrade, restarting gluster lead to load averages over 1000 on the server.

top - 14:30:59 up  4:27,  3 users,  load average: 1003.32, 714.06, 356.68

This caused services backed by gluster to timeout and lose connectivity to both gluster servers.

There are approximately 300 bricks in a 1 x 2 replica on the server, all backing OpenShift PVs.

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-12 to 3.8.4-18.6.el7rhgs

How reproducible:

Happened 4 out of 4 times when gluster was restarted.

Steps to Reproduce:
1. Shutdown gluster
2. Waiting period ~15 minutes
3. Start Gluster

Actual results:

Unmanageable load.


Expected results:

Server to manage the heals and maintain its availability to services.


Additional info:

Comment 4 Karthik U S 2017-08-07 05:11:26 UTC

Adding Mohit, since he is working on cgroup based IO throttling, which is targeted to solve this kind of problems. He should be able to give some workaround for this.

Comment 5 Amar Tumballi 2017-08-07 07:04:21 UTC

@Matthew,

> Steps to Reproduce:
> 1. Shutdown gluster
> 2. Waiting period ~15 minutes
> 3. Start Gluster

During 2) above, was there lot of I/O happening on clients?

Ravi/Karthik,

This was the case, where there are way too many volumes and there is an SHD process for each on of them. I guess this is similar to one which Pranith was talking earlier. Can we run just 1 shd per machine? that should technically solve this problem.

Comment 6 Ravishankar N 2017-08-07 09:38:47 UTC

(In reply to Amar Tumballi from comment #5)
> Ravi/Karthik,
> 
> This was the case, where there are way too many volumes and there is an SHD
> process for each on of them. I guess this is similar to one which Pranith
> was talking earlier. Can we run just 1 shd per machine? that should
> technically solve this problem.

There is only one instance of shd per machine (node) irrespective of the number of bricks or volumes that node is hosting. Also,you can enable/disable the shd on a per volume basis (`gluster volume set <volname> self-heal-daemon disable/enable`). The problem is that heal is launched using synctask framework that causes a lot of parallel heals to be processed. This is where Pranith wanted to modify the synctask such that it heals only 1 file at a time (instead of picking up the 2nd file while we await the cbk of the 1st file etc).

If a lot of files were modified in the 15 minutes on the 300 bricks, then high shd load is expected (but perhaps undesired) behaviour. I think controlling the CPU usage of shd using cgroups would be an effective way to regulate heal and bring down the load.

Comment 7 Ravishankar N 2017-08-07 09:42:15 UTC

(In reply to Karthik U S from comment #4)
> Adding Mohit, since he is working on cgroup based IO throttling, which is
> targeted to solve this kind of problems. He should be able to give some
> workaround for this.

Mohit, would you be able to provide the steps to use cgroup cpu accounting to control glustershd only? Maybe a KCS article would be great. Thank you!

Comment 18 Ravishankar N 2017-09-27 09:43:56 UTC

Mohit's patch upstream: https://review.gluster.org/#/c/18404/

Comment 25 Nag Pavan Chilakam 2018-04-04 12:34:44 UTC

Test version:3.12-2-6
moving to verified as below

based on my testing in comment at https://bugzilla.redhat.com/show_bug.cgi?id=1406363#c12

and also tested below for memory consumption management.
However, I noticed that memory consumption doesnt go back to the set limit,
however, kernel notifies the non-compliance as below , as by default oom killing is disabled(for which I would raise a new bug)
[root@dhcp37-174 ~]# cat /sys/fs/cgroup/memory/system.slice/glusterd.service/cgroup_gluster_26704/memory.failcnt 
39970

So the script is working as expected, as once memory consumption is crossing the limit, we can notice it as above with this script, which was not previously available.



[root@dhcp37-174 ~]# top -n 1 -b|grep gluster
26704 root      20   0 2532892 119024   4936 S 125.0  1.5   1:36.07 glusterfsd
 4047 root      20   0  680856  13944   4392 S   0.0  0.2   0:44.89 glusterd
26740 root      20   0 1318488  58020   3220 S   0.0  0.7   0:02.86 glusterfs
[root@dhcp37-174 ~]# cd /usr/share/
[root@dhcp37-174 share]# cd glusterfs/scripts/
[root@dhcp37-174 scripts]# ls
control-cpu-load.sh    get-gfid.sh                       schedule_georep.pyc
control-mem.sh         gsync-sync-gfid                   schedule_georep.pyo
eventsdash.py          gsync-upgrade.sh                  slave-upgrade.sh
eventsdash.pyc         post-upgrade-script-for-quota.sh  stop-all-gluster-processes.sh
eventsdash.pyo         pre-upgrade-script-for-quota.sh
generate-gfid-file.sh  schedule_georep.py
[root@dhcp37-174 scripts]# ./control-mem.sh

Enter Any gluster daemon pid for that you want to control MEMORY.
26704
If you want to continue the script to attach daeomon with new cgroup. Press (y/n)?y
yes
Creating child cgroup directory 'cgroup_gluster_26704 cgroup' for glusterd.service.
Enter Memory value in Mega bytes [100,8000000000000]:
110
Entered memory limit value is 110.
Setting 115343360 to memory.limit_in_bytes for /sys/fs/cgroup/memory/system.slice/glusterd.service/cgroup_gluster_26704.
Tasks are attached successfully specific to 26704 to cgroup_gluster_26704.

Comment 27 Vijay Avuthu 2018-04-17 07:32:19 UTC

Update:
=========

Build Used: glusterfs-3.12.2-7.el7rhgs.x86_64

Scenario :

1. create 2 * 3 volume and start
2. Kill one of the brick
3. create 100K files from client
4. brought up the down brick
5. CPU load will be high as heal starts
6. using cgroups, control the CPU load
7. Monitor the CPU load uisng top command

> Before triggering control-cpu-load.sh, below is the CPU load


top - 03:14:08 up 7 days, 22:00,  3 users,  load average: 0.65, 0.30, 0.15
Tasks: 293 total,   1 running, 292 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.6 us,  7.2 sy,  0.0 ni, 86.8 id,  0.0 wa,  0.0 hi,  0.3 si,  0.2 st
KiB Mem :  7911208 total,  6065196 free,   362916 used,  1483096 buff/cache
KiB Swap:  8126460 total,  8126460 free,        0 used.  7060660 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND             
13962 root      20   0 1880588  70688   4808 S  56.3  0.9   1:25.44 glusterfsd 

> Applied cgroups using control-cpu-load.sh

# ./control-cpu-load.sh 
Enter gluster daemon pid for which you want to control CPU.
13962
pid 13962 is attached with glusterd.service cgroup.
If you want to continue the script to attach 13962 with new cgroup_gluster_13962 cgroup Press (y/n)?y
yes
Creating child cgroup directory 'cgroup_gluster_13962 cgroup' for glusterd.service.
Enter quota value in range [10,100]:  
10
Entered quota value is 10
Setting 10000 to cpu.cfs_quota_us for gluster_cgroup.
Tasks are attached successfully specific to 13962 to cgroup_gluster_13962.
#

> After triggering control-cpu-load.sh, below is CPU load


top - 03:15:45 up 7 days, 22:02,  3 users,  load average: 0.62, 0.40, 0.20
Tasks: 295 total,   2 running, 293 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.8 us,  1.1 sy,  0.0 ni, 97.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.2 st
KiB Mem :  7911208 total,  5966992 free,   387844 used,  1556372 buff/cache
KiB Swap:  8126460 total,  8126460 free,        0 used.  6995304 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND             
13962 root      20   0 1880588  70688   4808 S  10.2  0.9   2:14.68 glusterfsd  
 

Moving bug to Verified

Comment 29 errata-xmlrpc 2018-09-04 06:34:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.