Red Hat Bugzilla – Bug 1478395
Extreme Load from self-heal
Last modified: 2018-02-26 00:51:46 EST
Description of problem:
Upgraded gluster servers to RHEL 7.4 and upgraded Gluster from RHGS 3.1.3 to RHGS 3.2.
The gluster process was stopped during the gluster package upgrade.
After the upgrade, restarting gluster lead to load averages over 1000 on the server.
top - 14:30:59 up 4:27, 3 users, load average: 1003.32, 714.06, 356.68
This caused services backed by gluster to timeout and lose connectivity to both gluster servers.
There are approximately 300 bricks in a 1 x 2 replica on the server, all backing OpenShift PVs.
Version-Release number of selected component (if applicable):
glusterfs-3.7.9-12 to 3.8.4-18.6.el7rhgs
Happened 4 out of 4 times when gluster was restarted.
Steps to Reproduce:
1. Shutdown gluster
2. Waiting period ~15 minutes
3. Start Gluster
Server to manage the heals and maintain its availability to services.
Adding Mohit, since he is working on cgroup based IO throttling, which is targeted to solve this kind of problems. He should be able to give some workaround for this.
> Steps to Reproduce:
> 1. Shutdown gluster
> 2. Waiting period ~15 minutes
> 3. Start Gluster
During 2) above, was there lot of I/O happening on clients?
This was the case, where there are way too many volumes and there is an SHD process for each on of them. I guess this is similar to one which Pranith was talking earlier. Can we run just 1 shd per machine? that should technically solve this problem.
(In reply to Amar Tumballi from comment #5)
> This was the case, where there are way too many volumes and there is an SHD
> process for each on of them. I guess this is similar to one which Pranith
> was talking earlier. Can we run just 1 shd per machine? that should
> technically solve this problem.
There is only one instance of shd per machine (node) irrespective of the number of bricks or volumes that node is hosting. Also,you can enable/disable the shd on a per volume basis (`gluster volume set <volname> self-heal-daemon disable/enable`). The problem is that heal is launched using synctask framework that causes a lot of parallel heals to be processed. This is where Pranith wanted to modify the synctask such that it heals only 1 file at a time (instead of picking up the 2nd file while we await the cbk of the 1st file etc).
If a lot of files were modified in the 15 minutes on the 300 bricks, then high shd load is expected (but perhaps undesired) behaviour. I think controlling the CPU usage of shd using cgroups would be an effective way to regulate heal and bring down the load.
(In reply to Karthik U S from comment #4)
> Adding Mohit, since he is working on cgroup based IO throttling, which is
> targeted to solve this kind of problems. He should be able to give some
> workaround for this.
Mohit, would you be able to provide the steps to use cgroup cpu accounting to control glustershd only? Maybe a KCS article would be great. Thank you!
Mohit's patch upstream: https://review.gluster.org/#/c/18404/