1520767 – 500% -600% CPU utitlisation when one brick is down in EC volume

Bug 1520767 - 500% -600% CPU utitlisation when one brick is down in EC volume

Summary: 500% -600% CPU utitlisation when one brick is down in EC volume

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Ashish Pandey
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2017-12-05 05:58 UTC by Karan Sandha
Modified:	2018-09-17 11:32 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-09-04 06:39:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:41:33 UTC

Description Karan Sandha 2017-12-05 05:58:52 UTC

Description of problem:
High CPU Utilisation when one of the brick is killed in EC volume

Version-Release number of selected component (if applicable):
3.8.4-52

How reproducible:
Tested only once

Steps to Reproduce:
1. Create an EC volume 24*(4+2)
2. Start the volume and run CCTV workload from 4 windows client. eg. Milestone's X-protect
3. Kill one brick from the volume.
4. Monitor the CPU utilisation from the TOP command.

Actual results:
1) 500%-600% CPU utilisation was seen. 

Second Observation:- When all the bricks are up glusterfsd takes 150% CPU utilisation.  

Expected results:
This amount of CPU utilisation shouldn't be observed. 

Additional info:

The software populate with 16MB medium files from 4 windows Clients 

performance.parallel-readdir on
performance.readdir-ahead on
performance.quick-read off
performance.io-cache off
nfs.disable on
transport.address-family inet
features.cache-invalidation on
features.cache-invalidation-timeout 600
performance.stat-prefetch on
performance.cache-invalidation on
performance.md-cache-timeout 600
network.inode-lru-limit 200000
performance.nl-cache on
performance.nl-cache-timeout 600
cluster.lookup-optimize on
server.event-threads 4
client.event-threads 6
performance.cache-samba-metadata on
performance.client-io-threads on
cluster.readdir-optimize on

Comment 9 Nag Pavan Chilakam 2018-05-30 10:46:34 UTC

using the cpu control script I was able to control cpu consumption of shd
(however note this is a workaround and not actual fix as already detailed above)
moving the bz to verified
test version:3.12.2-11

ot@dhcp35-97 scripts]# 30
-bash: 30: command not found
[root@dhcp35-97 scripts]# top -n 1 -b|egrep "glusterfs$|RES"
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14882 root      20   0 3089444 157524   3712 S 288.2  2.0   7:12.13 glusterfs
14872 root      20   0  538516   9612   3592 S   0.0  0.1   0:00.17 glusterfs
[root@dhcp35-97 scripts]# top -n 1 -b|egrep "glusterfs$|RES"
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14882 root      20   0 3089436 153312   3712 S 244.4  1.9   7:25.34 glusterfs
14872 root      20   0  538516   9612   3592 S   0.0  0.1   0:00.17 glusterfs
[root@dhcp35-97 scripts]# ./control-cpu-load.sh 
Enter gluster daemon pid for which you want to control CPU.
^C
[root@dhcp35-97 scripts]# ./control-cpu-load.sh 
Enter gluster daemon pid for which you want to control CPU.

Entered daemon_pid is not numeric so Rerun the script.
[root@dhcp35-97 scripts]# ./control-cpu-load.sh 
Enter gluster daemon pid for which you want to control CPU.
14882
If you want to continue the script to attach 14882 with new cgroup_gluster_14882 cgroup Press (y/n)?
invalid
[root@dhcp35-97 scripts]# ./control-cpu-load.sh 
Enter gluster daemon pid for which you want to control CPU.

Entered daemon_pid is not numeric so Rerun the script.
[root@dhcp35-97 scripts]# ./control-cpu-load.sh 
Enter gluster daemon pid for which you want to control CPU.
14882
If you want to continue the script to attach 14882 with new cgroup_gluster_14882 cgroup Press (y/n)?y
yes
Creating child cgroup directory 'cgroup_gluster_14882 cgroup' for glusterd.service.
Enter quota value in range [10,100]:  
50
Entered quota value is 50
Setting 50000 to cpu.cfs_quota_us for gluster_cgroup.
Tasks are attached successfully specific to 14882 to cgroup_gluster_14882.
[root@dhcp35-97 scripts]# top -n 1 -b|egrep "glusterfs$|RES"
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14882 root      20   0 3089488 157564   3712 S  58.8  2.0   8:27.61 glusterfs
14872 root      20   0  538516   9612   3592 S   0.0  0.1   0:00.17 glusterfs
[root@dhcp35-97 scripts]# 14882
-bash: 14882: command not found
[root@dhcp35-97 scripts]# top -n 1 -b|egrep "glusterfs$|RES"
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14882 root      20   0 3089456 159556   3712 S  50.0  2.0   9:01.75 glusterfs
14872 root      20   0  538516   9612   3592 S   0.0  0.1   0:00.18 glusterfs
[root@dhcp35-97 scripts]# 14882
-bash: 14882: command not found
[root@dhcp35-97 scripts]# 14882
-bash: 14882: command not found
[root@dhcp35-97 scripts]# top -n 1 -b|egrep "glusterfs$|RES"
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
14882 root      20   0 3089480 153672   3712 S  33.3  1.9   9:04.91 glusterfs
14872 root      20   0  538516   9612   3592 S   0.0  0.1   0:00.18 glusterfs
[root@dhcp35-97 scripts]# pwd
/usr/share/glusterfs/scripts
[root@dhcp35-97 scripts]# ^C
[root@dhcp35-97 scripts]#

Comment 11 errata-xmlrpc 2018-09-04 06:39:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.