Description of problem:
Upgraded gluster servers to RHEL 7.4 and upgraded Gluster from RHGS 3.1.3 to RHGS 3.2.
The gluster process was stopped during the gluster package upgrade.
After the upgrade, restarting gluster lead to load averages over 1000 on the server.
top - 14:30:59 up 4:27, 3 users, load average: 1003.32, 714.06, 356.68
This caused services backed by gluster to timeout and lose connectivity to both gluster servers.
There are approximately 300 bricks in a 1 x 2 replica on the server, all backing OpenShift PVs.
Version-Release number of selected component (if applicable):
glusterfs-3.7.9-12 to 3.8.4-18.6.el7rhgs
Happened 4 out of 4 times when gluster was restarted.
Steps to Reproduce:
1. Shutdown gluster
2. Waiting period ~15 minutes
3. Start Gluster
Server to manage the heals and maintain its availability to services.
Adding Mohit, since he is working on cgroup based IO throttling, which is targeted to solve this kind of problems. He should be able to give some workaround for this.
> Steps to Reproduce:
> 1. Shutdown gluster
> 2. Waiting period ~15 minutes
> 3. Start Gluster
During 2) above, was there lot of I/O happening on clients?
This was the case, where there are way too many volumes and there is an SHD process for each on of them. I guess this is similar to one which Pranith was talking earlier. Can we run just 1 shd per machine? that should technically solve this problem.
(In reply to Amar Tumballi from comment #5)
> This was the case, where there are way too many volumes and there is an SHD
> process for each on of them. I guess this is similar to one which Pranith
> was talking earlier. Can we run just 1 shd per machine? that should
> technically solve this problem.
There is only one instance of shd per machine (node) irrespective of the number of bricks or volumes that node is hosting. Also,you can enable/disable the shd on a per volume basis (`gluster volume set <volname> self-heal-daemon disable/enable`). The problem is that heal is launched using synctask framework that causes a lot of parallel heals to be processed. This is where Pranith wanted to modify the synctask such that it heals only 1 file at a time (instead of picking up the 2nd file while we await the cbk of the 1st file etc).
If a lot of files were modified in the 15 minutes on the 300 bricks, then high shd load is expected (but perhaps undesired) behaviour. I think controlling the CPU usage of shd using cgroups would be an effective way to regulate heal and bring down the load.
(In reply to Karthik U S from comment #4)
> Adding Mohit, since he is working on cgroup based IO throttling, which is
> targeted to solve this kind of problems. He should be able to give some
> workaround for this.
Mohit, would you be able to provide the steps to use cgroup cpu accounting to control glustershd only? Maybe a KCS article would be great. Thank you!
Mohit's patch upstream: https://review.gluster.org/#/c/18404/
moving to verified as below
based on my testing in comment at https://bugzilla.redhat.com/show_bug.cgi?id=1406363#c12
and also tested below for memory consumption management.
However, I noticed that memory consumption doesnt go back to the set limit,
however, kernel notifies the non-compliance as below , as by default oom killing is disabled(for which I would raise a new bug)
[root@dhcp37-174 ~]# cat /sys/fs/cgroup/memory/system.slice/glusterd.service/cgroup_gluster_26704/memory.failcnt
So the script is working as expected, as once memory consumption is crossing the limit, we can notice it as above with this script, which was not previously available.
[root@dhcp37-174 ~]# top -n 1 -b|grep gluster
26704 root 20 0 2532892 119024 4936 S 125.0 1.5 1:36.07 glusterfsd
4047 root 20 0 680856 13944 4392 S 0.0 0.2 0:44.89 glusterd
26740 root 20 0 1318488 58020 3220 S 0.0 0.7 0:02.86 glusterfs
[root@dhcp37-174 ~]# cd /usr/share/
[root@dhcp37-174 share]# cd glusterfs/scripts/
[root@dhcp37-174 scripts]# ls
control-cpu-load.sh get-gfid.sh schedule_georep.pyc
control-mem.sh gsync-sync-gfid schedule_georep.pyo
eventsdash.py gsync-upgrade.sh slave-upgrade.sh
eventsdash.pyc post-upgrade-script-for-quota.sh stop-all-gluster-processes.sh
[root@dhcp37-174 scripts]# ./control-mem.sh
Enter Any gluster daemon pid for that you want to control MEMORY.
If you want to continue the script to attach daeomon with new cgroup. Press (y/n)?y
Creating child cgroup directory 'cgroup_gluster_26704 cgroup' for glusterd.service.
Enter Memory value in Mega bytes [100,8000000000000]:
Entered memory limit value is 110.
Setting 115343360 to memory.limit_in_bytes for /sys/fs/cgroup/memory/system.slice/glusterd.service/cgroup_gluster_26704.
Tasks are attached successfully specific to 26704 to cgroup_gluster_26704.
Build Used: glusterfs-3.12.2-7.el7rhgs.x86_64
1. create 2 * 3 volume and start
2. Kill one of the brick
3. create 100K files from client
4. brought up the down brick
5. CPU load will be high as heal starts
6. using cgroups, control the CPU load
7. Monitor the CPU load uisng top command
> Before triggering control-cpu-load.sh, below is the CPU load
top - 03:14:08 up 7 days, 22:00, 3 users, load average: 0.65, 0.30, 0.15
Tasks: 293 total, 1 running, 292 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.6 us, 7.2 sy, 0.0 ni, 86.8 id, 0.0 wa, 0.0 hi, 0.3 si, 0.2 st
KiB Mem : 7911208 total, 6065196 free, 362916 used, 1483096 buff/cache
KiB Swap: 8126460 total, 8126460 free, 0 used. 7060660 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13962 root 20 0 1880588 70688 4808 S 56.3 0.9 1:25.44 glusterfsd
> Applied cgroups using control-cpu-load.sh
Enter gluster daemon pid for which you want to control CPU.
pid 13962 is attached with glusterd.service cgroup.
If you want to continue the script to attach 13962 with new cgroup_gluster_13962 cgroup Press (y/n)?y
Creating child cgroup directory 'cgroup_gluster_13962 cgroup' for glusterd.service.
Enter quota value in range [10,100]:
Entered quota value is 10
Setting 10000 to cpu.cfs_quota_us for gluster_cgroup.
Tasks are attached successfully specific to 13962 to cgroup_gluster_13962.
> After triggering control-cpu-load.sh, below is CPU load
top - 03:15:45 up 7 days, 22:02, 3 users, load average: 0.62, 0.40, 0.20
Tasks: 295 total, 2 running, 293 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 1.1 sy, 0.0 ni, 97.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.2 st
KiB Mem : 7911208 total, 5966992 free, 387844 used, 1556372 buff/cache
KiB Swap: 8126460 total, 8126460 free, 0 used. 6995304 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13962 root 20 0 1880588 70688 4808 S 10.2 0.9 2:14.68 glusterfsd
Moving bug to Verified
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.