Description of problem: During normal operations, CPU and memory usage gradually increase to 100%, being used by a large number of glusterfs processes. The result is slowness and resource starvation. Issue startet happening with GlusterFS 5.2, but did not improve with 5.5. Did not see this issue in 3.12. Version-Release number of selected component (if applicable): GlusterFS 5.2 and 5.5 Heketi 8.0.0 CentOS 7.6 How reproducible: Users of the cluster hit this issue pretty often by creating and deleting volumes quickly, from Kubernetes (using Heketi to control GlusterFS). Sometimes we hit 100% resource usage several times a day. Steps to Reproduce: 1. Create volume 2. Delete volume 3. Repeat quickly Actual results: CPU usage and memory usage increase, and the number of glusterfs processes increases. I have to login to each node in the cluster and kill old processes to make nodes responsive again, otherwise the nodes eventually freeze from resource starvation. Expected results: CPU and memory usage should only spike shortly, and not continue to increase, and there should be only one glusterfs process. Additional info: I found some issues that look similar: * https://github.com/gluster/glusterfs/issues/625 * https://github.com/heketi/heketi/issues/1439 Log output from a time where resource usage increased (/var/log/glusterfs/glusterd.log): [2019-04-01 12:07:23.377715] W [MSGID: 101095] [xlator.c:180:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/5.5/xlator/nfs/server.so: cannot open shared object file: Ingen slik fil eller filkatalog [2019-04-01 12:07:23.684561] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/create/post/S10selinux-label-brick.sh --volname=vol_45653f46dbc8953f876a009b4ea8dd26 [2019-04-01 12:07:26.931683] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600 [2019-04-01 12:07:26.932340] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600 [2019-04-01 12:07:26.932667] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped [2019-04-01 12:07:26.932707] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is stopped [2019-04-01 12:07:26.932731] I [MSGID: 106599] [glusterd-nfs-svc.c:81:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-04-01 12:07:26.963055] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 16020 [2019-04-01 12:07:27.963708] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-04-01 12:07:27.963951] I [MSGID: 106567] [glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd service [2019-04-01 12:07:28.985311] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped [2019-04-01 12:07:28.985478] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is stopped [2019-04-01 12:07:28.989024] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped [2019-04-01 12:07:28.989098] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is stopped [2019-04-01 12:07:29.299841] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_45653f46dbc8953f876a009b4ea8dd26 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-04-01 12:07:29.338437] E [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe69c3) [0x7fd04cc469c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_45653f46dbc8953f876a009b4ea8dd26 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd [2019-04-01 12:07:52.658922] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/stop/pre/S29CTDB-teardown.sh --volname=vol_c5112b1e28a7bbc96640a8572009c6f0 --last=no [2019-04-01 12:07:52.679220] E [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe69c3) [0x7fd04cc469c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/stop/pre/S30samba-stop.sh --volname=vol_c5112b1e28a7bbc96640a8572009c6f0 --last=no [2019-04-01 12:07:52.681081] I [MSGID: 106542] [glusterd-utils.c:8440:glusterd_brick_signal] 0-glusterd: sending signal 15 to brick with pid 27595 [2019-04-01 12:07:53.732699] I [MSGID: 106599] [glusterd-nfs-svc.c:81:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2019-04-01 12:07:53.791560] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 18583 [2019-04-01 12:07:53.791857] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /var/lib/heketi/mounts/vg_799fbf11286fbf497605bbe58c3e9dfa/brick_08bfe132dad6099ab387555298466ca3/brick on port 49162 [2019-04-01 12:07:53.822032] I [MSGID: 106006] [glusterd-svc-mgmt.c:356:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd. [2019-04-01 12:07:54.792497] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is stopped [2019-04-01 12:07:54.792736] I [MSGID: 106567] [glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd service [2019-04-01 12:07:55.812655] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped [2019-04-01 12:07:55.812837] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is stopped [2019-04-01 12:07:55.816580] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped [2019-04-01 12:07:55.816672] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is stopped [2019-04-01 12:07:59.829927] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/delete/pre/S10selinux-del-fcontext.sh --volname=vol_c5112b1e28a7bbc96640a8572009c6f0 [2019-04-01 12:07:59.951300] I [MSGID: 106495] [glusterd-handler.c:3118:__glusterd_handle_getwd] 0-glusterd: Received getwd req [2019-04-01 12:07:59.967584] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/delete/post/S57glusterfind-delete-post --volname=vol_c5112b1e28a7bbc96640a8572009c6f0 [2019-04-01 12:07:53.732626] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped [2019-04-01 12:07:53.732677] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is stopped Examples of errors about deleted volumes from /var/log/glusterfs/glustershd.log - we get gigabytes of these every day: [2019-04-02 09:57:08.997572] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 10-vol_3be8a34875cc37098593d4bc8740477b-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2019-04-02 09:57:09.033441] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 26-vol_2399e6ef0347ac569a0b1211f1fd109d-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2019-04-02 09:57:09.036003] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 40-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2019-04-02 09:57:09.077109] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 2-vol_bca47201841f5b50d341eb2bedf5cd46-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2019-04-02 09:57:09.103495] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 24-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2019-04-02 09:57:09.455818] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 30-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2019-04-02 09:57:09.511070] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 14-vol_cf3700764dfdce40d60b89fde7e1a643-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. [2019-04-02 09:57:09.490714] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 0-vol_c5112b1e28a7bbc96640a8572009c6f0-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up. Example of concurrent glusterfs processes on a node: root 4559 16.8 6.5 14882048 1060288 ? Ssl Apr01 206:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 6507 14.7 6.1 14250324 998280 ? Ssl Apr01 178:33 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 6743 0.0 1.2 4780344 201708 ? Ssl Apr01 0:35 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 7660 17.0 6.3 14859244 1027432 ? Ssl Apr01 206:32 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 7789 0.1 1.5 5390364 250200 ? Ssl Apr01 1:08 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 9259 16.4 6.3 14841432 1029512 ? Ssl Apr01 198:12 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 12394 14.0 5.6 13549044 918424 ? Ssl Apr01 167:46 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 14980 9.2 4.7 11657716 778876 ? Ssl Apr01 110:10 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 16032 8.2 4.4 11040436 716020 ? Ssl Apr01 97:39 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 23961 6.3 3.7 9807736 610408 ? Ssl Apr01 62:03 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 25560 2.8 3.0 8474704 503488 ? Ssl Apr01 27:33 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 26293 3.2 1.2 4812208 200896 ? Ssl 09:26 0:35 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 28205 1.3 1.8 5992016 300012 ? Ssl Apr01 13:31 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 29186 1.4 2.1 6669800 352440 ? Ssl Apr01 13:59 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 30485 0.9 0.6 3527080 101552 ? Ssl 09:35 0:05 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 31171 1.0 0.6 3562360 104908 ? Ssl 09:35 0:05 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd root 32086 0.6 0.3 2925412 54852 ? Ssl 09:35 0:03 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
Example of how to reliably reproduce the issue from Kubernetes. 1. kubectl apply -f pvc.yaml 2. kubectl delete -f pvc.yaml There will almost always be a few more glusterfs-processes running after doing this. pvc.yaml: apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc1 namespace: default spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi storageClassName: glusterfs-replicated-2 --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc2 namespace: default spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi storageClassName: glusterfs-replicated-2 --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc3 namespace: default spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi storageClassName: glusterfs-replicated-2
I have been experimenting with setting "max_inflight_operations" to 1 in Heketi, as mentioned in https://github.com/heketi/heketi/issues/1439 Example of how to configure this: https://github.com/heketi/heketi/blob/8417f25f474b0b16e1936a66f9b63bcedfba6e4c/tests/functional/TestSmokeTest/config/heketi.json I am not able to reproduce the issue anymore when the value is set to 1. The number of glusterfs-processes varies between 0 and 2 during volume changes, but always settles on 1 single process afterwards. This seems to be an easy workaround, but hopefully the bug will be fixed so I can revert back to concurrent Heketi again.
Oh, just noticed I wrote CentOS 7.6 only. We use RedHat 7.6 on our main servers, but the issue is the same on both CentOS and RedHat.
I see from the release notes of 5.6 that this issue is resolved: https://bugzilla.redhat.com/show_bug.cgi?id=1696147 Looks like it may be the same as this. I will test 5.6 once it's out.
Please let us know if you have tested 5.6 and see this problem disappearing.
I have tested 5.6 and have so far been unable to reproduce the problem. Looks like the problem is fixed. *** This bug has been marked as a duplicate of bug 1696147 ***