Description of problem: We should have a way to control self-heal for a disperse volume. At sometime when we have multiple bricks on a node. If a node goes down and comes up, several brick will start healing. This can lead to very high CPU usage and this can hamper ongoing IO. So we should have a mechanism to control heal, this will help in performing lazy heal during peak time and aggressive heal during less load.
on_qa validation: Have tested below: /usr/share/glusterfs/scripts/control-cpu-load.sh -->able to set cpu load limit to glusterfsd/shd/glusterd process. Suppose a process was consuming 100% and I set the limit to 20%, the cpu % comes down and doesnt cross 20% (+ a very small delta eg:20.5 ,etc) [root@dhcp35-205 scripts]# ./control-cpu-load.sh Enter gluster daemon pid for which you want to control CPU. 23530 pid 23530 is attached with glusterd.service cgroup. pid 23530 is not attached with cgroup_gluster_23530. If you want to continue the script to attach 23530 with new cgroup_gluster_23530 cgroup Press (y/n)?y yes Creating child cgroup directory 'cgroup_gluster_23530 cgroup' for glusterd.service. Enter quota value in range [10,100]: 30 Entered quota value is 30 Setting 30000 to cpu.cfs_quota_us for gluster_cgroup. Tasks are attached successfully specific to 23530 to cgroup_gluster_23530. In above case cpu consumption was arrested to max 30% hence moving to verified. If I see any specific observations will raise new bugs, but at a high level this feature works
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607