Bug 1496335

Summary:	Extreme Load from self-heal
Product:	[Community] GlusterFS	Reporter:	Mohit Agrawal <moagrawa>
Component:	replicate	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	mainline	CC:	amukherj, aspandey, atumball, bkunal, bugs, ksubrahm, moagrawa, mrobson, nchilaka, ravishankar, rhs-bugs, sheggodu, storage-qa-internal, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-4.0.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1478395	Environment:
Last Closed:	2018-03-15 11:17:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1478395, 1484446
Bug Blocks:	1369781

Description Mohit Agrawal 2017-09-27 06:00:20 UTC

+++ This bug was initially created as a clone of Bug #1478395 +++

Description of problem:

Upgraded gluster servers to RHEL 7.4 and upgraded Gluster from RHGS 3.1.3 to RHGS 3.2.

The gluster process was stopped during the gluster package upgrade.

After the upgrade, restarting gluster lead to load averages over 1000 on the server.

top - 14:30:59 up  4:27,  3 users,  load average: 1003.32, 714.06, 356.68

This caused services backed by gluster to timeout and lose connectivity to both gluster servers.

There are approximately 300 bricks in a 1 x 2 replica on the server, all backing OpenShift PVs.

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-12 to 3.8.4-18.6.el7rhgs

How reproducible:

Happened 4 out of 4 times when gluster was restarted.

Steps to Reproduce:
1. Shutdown gluster
2. Waiting period ~15 minutes
3. Start Gluster

Actual results:

Unmanageable load.


Expected results:

Server to manage the heals and maintain its availability to services.


Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-04 09:37:21 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Matthew Robson on 2017-08-04 09:41:11 EDT ---

This was a production outage for BC Governments Openshift Environment.  All of the production services which have a storage dependency on Gluster back persistent volumes timed out until the self-heals finished and the load stabilized.

This was roughly 30 minutes in total.

Most of the bricks are 1 or 5 GB with 2 large ones at 100 and 200GB.

This also happened when bring back gluster after restarting the server during the RHEL 7.4 upgrade.

--- Additional comment from Atin Mukherjee on 2017-08-06 14:11:05 EDT ---

Karthick - could you take a look at this issue?

--- Additional comment from Karthik U S on 2017-08-07 01:11:26 EDT ---

Adding Mohit, since he is working on cgroup based IO throttling, which is targeted to solve this kind of problems. He should be able to give some workaround for this.

--- Additional comment from Amar Tumballi on 2017-08-07 03:04:21 EDT ---

@Matthew,

> Steps to Reproduce:
> 1. Shutdown gluster
> 2. Waiting period ~15 minutes
> 3. Start Gluster

During 2) above, was there lot of I/O happening on clients?

Ravi/Karthik,

This was the case, where there are way too many volumes and there is an SHD process for each on of them. I guess this is similar to one which Pranith was talking earlier. Can we run just 1 shd per machine? that should technically solve this problem.

--- Additional comment from Ravishankar N on 2017-08-07 05:38:47 EDT ---

(In reply to Amar Tumballi from comment #5)
> Ravi/Karthik,
> 
> This was the case, where there are way too many volumes and there is an SHD
> process for each on of them. I guess this is similar to one which Pranith
> was talking earlier. Can we run just 1 shd per machine? that should
> technically solve this problem.

There is only one instance of shd per machine (node) irrespective of the number of bricks or volumes that node is hosting. Also,you can enable/disable the shd on a per volume basis (`gluster volume set <volname> self-heal-daemon disable/enable`). The problem is that heal is launched using synctask framework that causes a lot of parallel heals to be processed. This is where Pranith wanted to modify the synctask such that it heals only 1 file at a time (instead of picking up the 2nd file while we await the cbk of the 1st file etc).

If a lot of files were modified in the 15 minutes on the 300 bricks, then high shd load is expected (but perhaps undesired) behaviour. I think controlling the CPU usage of shd using cgroups would be an effective way to regulate heal and bring down the load.

--- Additional comment from Ravishankar N on 2017-08-07 05:42:15 EDT ---

(In reply to Karthik U S from comment #4)
> Adding Mohit, since he is working on cgroup based IO throttling, which is
> targeted to solve this kind of problems. He should be able to give some
> workaround for this.

Mohit, would you be able to provide the steps to use cgroup cpu accounting to control glustershd only? Maybe a KCS article would be great. Thank you!

--- Additional comment from Mohit Agrawal on 2017-08-07 23:22:11 EDT ---

Hi,

To use the cgroup for any specific glusterd(selfheald etc) first  you have to change the glusterd unit file and start the service (glusterd and volumes).
   
   I would like to suggest before apply the step on production enviroment please do run the same steps on test environment and 
   if it is working fine on test environment then apply same on your production cluster.

   1) Add directive "Delegate=yes" in Service section in glusterd unit file as below 

[Service]
Type=forking
PIDFile=/var/run/glusterd.pid
LimitNOFILE=65536
Delegate=yes
Environment="LOG_LEVEL=INFO"
EnvironmentFile=-${prefix}/etc/sysconfig/glusterd
ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid  --log-level $LOG_LEVEL $GLUSTERD_OPTIONS
KillMode=process
SuccessExitStatus=15
    
   2) stop gluster services (kill volume also)
   3) do daemon reload systemctl daemon-reload
   4) start glusterd service
   5) Create cgroup in subsystems (cpu,cpuacct,memory etc) whatever you want to restrict for specific daemon
      here i am trying to control CPU for selfheald so i have created mycgroup directory in that subsystem   
      mkdir -p /sys/fs/cgroup/cpu,cpuacct/system.slice/glusterd.service/mycgroup
   6) Set quota for this cgroup as below
      echo 25000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/glusterd.service/mycgroup/cpu.cfs_quota_us
      here the cpu.cfs_quota_us means total available run-time within a period (microsecond) and length
      of period is 100 ms , here 25000 represent 25 ms, it means kernel will allocate cpu for these tasks
      for 25ms in every 100 ms
   7) Attach the daemon tasks for that you want to control CPU
      here i am trying to control selfheal daemon , suppose pid of selfheald is 576, run below command to move
      all selfheald tasks to mycgroup
      for thid in `ps -T -p 576 | grep gluster | awk -F " " '{print $2}'`; do echo $thid > /sys/fs/cgroup/cpu,cpuacct/system.slice/glusterd.service/mycgroup/tasks ; done       
   
   8) After attached all selfheald tasks to mygcgroup to ensure task is successfully attached or not you can check
      tasks file in glusterd.service/tasks in the same hierarchy, all tasks will be move from glusterd tasks to mycgroup
      tasks.
   9) Check top output, cpu usage for selfheald will be lower.

    Let me know if u face any problem to run the steps.

   
Regards
Mohit Agrawal

--- Additional comment from Matthew Robson on 2017-08-08 12:59:59 EDT ---

That looks interesting, thanks.

1) Is it something we have tested or are looking to incorporate into a future version of gluster? The issue with the above is that they would need to take another outage to restart gluster to implement the changes.

2) How would the above apply to CNS?

3) Is it possible for QE to validate a similar scenario in the context of RHGS and or CNS. The customer will be deploying a new CNS 3.5 cluster in the next month or so, migrating from the current standalone gluster cluster.

The fear they have is that if they evacuate or lose a pod, will the self-heal of hundreds of bricks begin to impact the entire Openshift cluster when it comes back.

Is there a good way to measure IO across all of the bricks? I can ask them to collect some data around the same time of day the upgrade was run. There were certainly all in use... Some are production and some are only dev / test.

Are there other ways to mitigate this?  If they added a 3rd node in the cluster and went 1 x 3 for their bricks... Would that in any way alleviate some load or make it worse?

If they set up 2 more gluster servers (4 in the pool), could they migrate 50% of the bricks over to the new servers to reduce overall load on each server?

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-08-09 00:40:02 EDT ---

This BZ having been considered, and subsequently not approved to be fixed at the RHGS 3.3.0 release, is being proposed for the next minor release of RHGS

--- Additional comment from Ravishankar N on 2017-08-09 04:49:24 EDT ---

(In reply to Matthew Robson from comment #9)
> That looks interesting, thanks.
> 
> 1) Is it something we have tested or are looking to incorporate into a
> future version of gluster? The issue with the above is that they would need
> to take another outage to restart gluster to implement the changes.

Mohit had tested it in his dev setup. I think the initial idea is to get a KCS article on this. See the discussion @ http://post-office.corp.redhat.com/archives/team-quine-afr/2017-August/thread.html.

> 
> 2) How would the above apply to CNS?

If cgroups is available, it should work.
> 
> 3) Is it possible for QE to validate a similar scenario in the context of
> RHGS and or CNS. The customer will be deploying a new CNS 3.5 cluster in the
> next month or so, migrating from the current standalone gluster cluster.
> 

Same answer as 1. I think Rahul would be in a better position to answer this.

> The fear they have is that if they evacuate or lose a pod, will the
> self-heal of hundreds of bricks begin to impact the entire Openshift cluster
> when it comes back.
> 
> Is there a good way to measure IO across all of the bricks? I can ask them
> to collect some data around the same time of day the upgrade was run. There
> were certainly all in use... Some are production and some are only dev /
> test.

`gluster volume profile` can be used. See the admin guide: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.2/html/administration_guide/chap-monitoring_red_hat_storage_workload

 
> 
> Are there other ways to mitigate this?  If they added a 3rd node in the
> cluster and went 1 x 3 for their bricks... Would that in any way alleviate
> some load or make it worse?

I don't think it would make a difference to self-heal. Whatever node you stop/reboot, data that was modified needs to be synced to that brick. Except now you will have two good copies which the self-heal daemons will use to sync from.


> 
> If they set up 2 more gluster servers (4 in the pool), could they migrate
> 50% of the bricks over to the new servers to reduce overall load on each
> server?

Yes I think distributing data across distributed replicate volume (as opposed to 1x 2) means lesser data to heal per brick.

--- Additional comment from Mohit Agrawal on 2017-08-16 02:57:28 EDT ---

Hi,

I have discussed the same with systemd developer team and they suggested other way to achieve the same without restart daemons.
I have tried the same on dev setup and i have found we can use the cgroup capabilities on process with out restart a process.

To use the cgroup for any specific glusterd internal daemon(ex. selfheald etc) i have followed below steps

   1) Run below command to set CPUShares proporty on glusterd (we do set it with default value so that it will not have
      impact on any other daemon)
      systemctl set-property glusterd.service CPUShares=1024
      It will create separate cgroup directory structure in cpu controller 
       /sys/fs/cgroup/cpu,cpuacct/system.slice/glusterd.service
   2) Create cgroup in subsystems (cpu,cpuacct,memory etc) whatever you want to restrict for specific daemon
      here i am trying to control CPU for selfheald so i have created mycgroup directory in that subsystem   
      mkdir -p /sys/fs/cgroup/cpu,cpuacct/system.slice/glusterd.service/mycgroup
   3) Set quota for this cgroup as below
      echo 25000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/glusterd.service/mycgroup/cpu.cfs_quota_us
      here the cpu.cfs_quota_us means total available run-time within a period (microsecond) and length
      of period is 100 ms , here 25000 represent 25 ms, it means kernel will allocate cpu for these tasks
      for 25ms in every 100 ms
   4) Attach the daemon tasks for that you want to control CPU
      here i am trying to control selfheal daemon , suppose pid of selfheald is 576, run below command to move
      all selfheald tasks to mycgroup
      for thid in `ps -T -p 576 | grep gluster | awk -F " " '{print $2}'`; do echo $thid > /sys/fs/cgroup/cpu,cpuacct/system.slice/glusterd.service/mycgroup/tasks ; done       
   
   5) After attached all selfheald tasks to mygcgroup to ensure task is successfully attached or not you can check
      tasks file in glusterd.service/tasks in the same hierarchy, all tasks will be move from glusterd tasks to mycgroup
      tasks.
   6) Check top output, cpu usage for selfheald will be lower.

    Let me know if u face any problem to run the steps.

   
Regards
Mohit Agrawal

--- Additional comment from Matthew Robson on 2017-08-23 10:13:47 EDT ---

Thanks, I will see if we can test this.

@Bipin - I see you marked this for RHGS 3.4, do you know how this may translate into the product in 3.4 and will it also translate into the RHGS images for CNS?

--- Additional comment from Matthew Robson on 2017-08-23 10:23:20 EDT ---

Additional question Mohit - 

Step 4: Attach the daemon tasks for that you want to control CPU here i am trying to control selfheal daemon, suppose pid of selfheald is 576, run below command to move all selfheald tasks to mycgroup


What happens when you restart the gluster service? The issue with the massive self-heal load is after the gluster service is restarted (server reboot, gluster upgrade etc..).

When the new self-heal daemons come up, I assume they would not, by default, come up in the new 'mycgroup'?

--- Additional comment from Mohit Agrawal on 2017-08-23 10:37:23 EDT ---

Hi,

If service is restarted then new selfheald is spawned so needs to be rerun same command again 
to attach tasks with mycgroup.I have not created any script for this yet but we can create 
script.It is just a workaround and user can use the same when he wants to control CPU
for any daemon.


Regards
Mohit Agrawal

--- Additional comment from Bipin Kunal on 2017-08-23 11:29:42 EDT ---

(In reply to Matthew Robson from comment #13)
> Thanks, I will see if we can test this.
> 
> @Bipin - I see you marked this for RHGS 3.4, do you know how this may
> translate into the product in 3.4 and will it also translate into the RHGS
> images for CNS?

Hello Matt,
   I have proposed this for one of the future release. But this has not been accepted yet. I have discussion with engineering team and suggestion was to have some tunable available at gluster level to control resources, it might be internally using cgroups.

 I have opened a RFE https://bugzilla.redhat.com/show_bug.cgi?id=1484446 for the same. Will try to get it included at RHGS-3.4 but it all depends on PM to accept. Once it is part of gluster, it will be automatically be part of CNS I assume.

--- Additional comment from Ravishankar N on 2017-09-01 02:26:32 EDT ---

Adding this bug for 3.4.0 based on the script/steps being available in comment #3 of BZ 1484446 for QE to test and validate.

Comment 1 Worker Ant 2017-09-27 06:20:30 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#1) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 2 Worker Ant 2017-09-27 06:49:22 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#2) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 3 Worker Ant 2017-09-28 07:46:46 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#3) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 4 Karthik U S 2017-10-04 05:11:45 UTC

*** Bug 1496334 has been marked as a duplicate of this bug. ***

Comment 5 Worker Ant 2017-10-04 06:57:00 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#4) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 6 Worker Ant 2017-10-04 07:14:09 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#5) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 7 Worker Ant 2017-10-10 07:10:56 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#6) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 8 Worker Ant 2017-10-16 03:04:13 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#7) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 9 Worker Ant 2017-10-23 07:49:40 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#8) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 10 Worker Ant 2017-10-28 07:14:14 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#9) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 11 Worker Ant 2017-10-28 09:41:42 UTC

REVIEW: https://review.gluster.org/18404 (extras/devel-tools: Script to control CPU for selfheald after configure cpu_quota) posted (#10) for review on master by MOHIT AGRAWAL (moagrawa)

Comment 12 Atin Mukherjee 2017-11-03 08:01:50 UTC

Please add a public description to the bug.

Comment 13 Worker Ant 2017-11-13 11:07:36 UTC

COMMIT: https://review.gluster.org/18404 committed in master by \"MOHIT AGRAWAL\" <moagrawa> with a commit message- extras: scripts to control CPU/MEMORY for any gluster daemon during runtime

Problem: Sometime gluster daemons like glustershd can consume a lot of cpu and/
or memory if there is a large amount of data/ entries to be healed.

Solution: Until we have some form of throttling/ QoS mechanisms built into
gluster, we can use control groups for regulating cpu and memory of any gluster
daemon using control-cpu-load.sh and control-mem.sh scripts respectively.

Test:    To test the control-cpu-load.sh script follow below procedure:
         1) Setup distribute replica environment
         2) Selfheal daemon off
         3) Down one node from replica nodes
         4) Create millions of files from mount point
         5) Start down node
         6) Check cpu usage for shd process in top command
         7) Run script after provide shd pid with CPU quota value
         8) Check again cpu usage for shd process in top command

Note: control-mem.sh script can cap the memory usage of the process to the set
limit, beyond which the process gets blocked. It resumes either when the memory
usage comes down or if the limit is increased.

BUG: 1496335
Change-Id: Id73c36b73ca600fa9f7905d84053d1e8633c996f
Signed-off-by: Mohit Agrawal <moagrawa>

Comment 14 Shyamsundar 2018-03-15 11:17:56 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.0.0, please open a new bug report.

glusterfs-4.0.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-March/000092.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 15 Ashish Pandey 2018-10-17 09:42:41 UTC

*** Bug 1460665 has been marked as a duplicate of this bug. ***