Bug 1695099 - The number of glusterfs processes keeps increasing, using all available resources
Summary: The number of glusterfs processes keeps increasing, using all available resou...
Keywords:
Status: CLOSED DUPLICATE of bug 1696147
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-02 13:21 UTC by Christian Ihle
Modified: 2019-05-08 06:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-08 06:45:33 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Christian Ihle 2019-04-02 13:21:23 UTC
Description of problem:

During normal operations, CPU and memory usage gradually increase to 100%, being used by a large number of glusterfs processes. The result is slowness and resource starvation. Issue startet happening with GlusterFS 5.2, but did not improve with 5.5. Did not see this issue in 3.12.


Version-Release number of selected component (if applicable): 
GlusterFS 5.2 and 5.5
Heketi 8.0.0
CentOS 7.6


How reproducible:

Users of the cluster hit this issue pretty often by creating and deleting volumes quickly, from Kubernetes (using Heketi to control GlusterFS). Sometimes we hit 100% resource usage several times a day.


Steps to Reproduce:
1. Create volume
2. Delete volume
3. Repeat quickly


Actual results:

CPU usage and memory usage increase, and the number of glusterfs processes increases. I have to login to each node in the cluster and kill old processes to make nodes responsive again, otherwise the nodes eventually freeze from resource starvation.


Expected results:

CPU and memory usage should only spike shortly, and not continue to increase, and there should be only one glusterfs process.


Additional info:

I found some issues that look similar:

* https://github.com/gluster/glusterfs/issues/625
* https://github.com/heketi/heketi/issues/1439


Log output from a time where resource usage increased (/var/log/glusterfs/glusterd.log):

[2019-04-01 12:07:23.377715] W [MSGID: 101095] [xlator.c:180:xlator_volopt_dynload] 0-xlator: /usr/lib64/glusterfs/5.5/xlator/nfs/server.so: cannot open shared object file: Ingen slik fil eller filkatalog
[2019-04-01 12:07:23.684561] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/create/post/S10selinux-label-brick.sh --volname=vol_45653f46dbc8953f876a009b4ea8dd26
[2019-04-01 12:07:26.931683] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-snapd: setting frame-timeout to 600
[2019-04-01 12:07:26.932340] I [rpc-clnt.c:1000:rpc_clnt_connection_init] 0-gfproxyd: setting frame-timeout to 600
[2019-04-01 12:07:26.932667] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped
[2019-04-01 12:07:26.932707] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is stopped
[2019-04-01 12:07:26.932731] I [MSGID: 106599] [glusterd-nfs-svc.c:81:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed
[2019-04-01 12:07:26.963055] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 16020
[2019-04-01 12:07:27.963708] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-04-01 12:07:27.963951] I [MSGID: 106567] [glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd service
[2019-04-01 12:07:28.985311] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped
[2019-04-01 12:07:28.985478] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is stopped
[2019-04-01 12:07:28.989024] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped
[2019-04-01 12:07:28.989098] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is stopped
[2019-04-01 12:07:29.299841] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=vol_45653f46dbc8953f876a009b4ea8dd26 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
[2019-04-01 12:07:29.338437] E [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe69c3) [0x7fd04cc469c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=vol_45653f46dbc8953f876a009b4ea8dd26 --first=no --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
[2019-04-01 12:07:52.658922] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/stop/pre/S29CTDB-teardown.sh --volname=vol_c5112b1e28a7bbc96640a8572009c6f0 --last=no
[2019-04-01 12:07:52.679220] E [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe69c3) [0x7fd04cc469c3] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Failed to execute script: /var/lib/glusterd/hooks/1/stop/pre/S30samba-stop.sh --volname=vol_c5112b1e28a7bbc96640a8572009c6f0 --last=no
[2019-04-01 12:07:52.681081] I [MSGID: 106542] [glusterd-utils.c:8440:glusterd_brick_signal] 0-glusterd: sending signal 15 to brick with pid 27595
[2019-04-01 12:07:53.732699] I [MSGID: 106599] [glusterd-nfs-svc.c:81:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed
[2019-04-01 12:07:53.791560] I [MSGID: 106568] [glusterd-proc-mgmt.c:92:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 18583
[2019-04-01 12:07:53.791857] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /var/lib/heketi/mounts/vg_799fbf11286fbf497605bbe58c3e9dfa/brick_08bfe132dad6099ab387555298466ca3/brick on port 49162
[2019-04-01 12:07:53.822032] I [MSGID: 106006] [glusterd-svc-mgmt.c:356:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2019-04-01 12:07:54.792497] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: glustershd service is stopped
[2019-04-01 12:07:54.792736] I [MSGID: 106567] [glusterd-svc-mgmt.c:220:glusterd_svc_start] 0-management: Starting glustershd service
[2019-04-01 12:07:55.812655] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: bitd already stopped
[2019-04-01 12:07:55.812837] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: bitd service is stopped
[2019-04-01 12:07:55.816580] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: scrub already stopped
[2019-04-01 12:07:55.816672] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: scrub service is stopped
[2019-04-01 12:07:59.829927] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0x3b2dd) [0x7fd04cb9b2dd] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/delete/pre/S10selinux-del-fcontext.sh --volname=vol_c5112b1e28a7bbc96640a8572009c6f0
[2019-04-01 12:07:59.951300] I [MSGID: 106495] [glusterd-handler.c:3118:__glusterd_handle_getwd] 0-glusterd: Received getwd req
[2019-04-01 12:07:59.967584] I [run.c:242:runner_log] (-->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6f9a) [0x7fd04cc46f9a] -->/usr/lib64/glusterfs/5.5/xlator/mgmt/glusterd.so(+0xe6a65) [0x7fd04cc46a65] -->/lib64/libglusterfs.so.0(runner_log+0x115) [0x7fd058156955] ) 0-management: Ran script: /var/lib/glusterd/hooks/1/delete/post/S57glusterfind-delete-post --volname=vol_c5112b1e28a7bbc96640a8572009c6f0
[2019-04-01 12:07:53.732626] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: nfs already stopped
[2019-04-01 12:07:53.732677] I [MSGID: 106568] [glusterd-svc-mgmt.c:253:glusterd_svc_stop] 0-management: nfs service is stopped


Examples of errors about deleted volumes from /var/log/glusterfs/glustershd.log - we get gigabytes of these every day:

[2019-04-02 09:57:08.997572] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 10-vol_3be8a34875cc37098593d4bc8740477b-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.033441] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 26-vol_2399e6ef0347ac569a0b1211f1fd109d-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.036003] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 40-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.077109] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 2-vol_bca47201841f5b50d341eb2bedf5cd46-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.103495] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 24-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.455818] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 30-vol_fafddd8a937a550fbefc6c54830ce44f-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.511070] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 14-vol_cf3700764dfdce40d60b89fde7e1a643-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.
[2019-04-02 09:57:09.490714] E [MSGID: 108006] [afr-common.c:5314:__afr_handle_child_down_event] 0-vol_c5112b1e28a7bbc96640a8572009c6f0-replicate-0: All subvolumes are down. Going offline until at least one of them comes back up.


Example of concurrent glusterfs processes on a node:

root      4559 16.8  6.5 14882048 1060288 ?    Ssl  Apr01 206:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root      6507 14.7  6.1 14250324 998280 ?     Ssl  Apr01 178:33 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root      6743  0.0  1.2 4780344 201708 ?      Ssl  Apr01   0:35 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root      7660 17.0  6.3 14859244 1027432 ?    Ssl  Apr01 206:32 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root      7789  0.1  1.5 5390364 250200 ?      Ssl  Apr01   1:08 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root      9259 16.4  6.3 14841432 1029512 ?    Ssl  Apr01 198:12 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     12394 14.0  5.6 13549044 918424 ?     Ssl  Apr01 167:46 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     14980  9.2  4.7 11657716 778876 ?     Ssl  Apr01 110:10 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     16032  8.2  4.4 11040436 716020 ?     Ssl  Apr01  97:39 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     23961  6.3  3.7 9807736 610408 ?      Ssl  Apr01  62:03 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     25560  2.8  3.0 8474704 503488 ?      Ssl  Apr01  27:33 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     26293  3.2  1.2 4812208 200896 ?      Ssl  09:26   0:35 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     28205  1.3  1.8 5992016 300012 ?      Ssl  Apr01  13:31 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     29186  1.4  2.1 6669800 352440 ?      Ssl  Apr01  13:59 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     30485  0.9  0.6 3527080 101552 ?      Ssl  09:35   0:05 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     31171  1.0  0.6 3562360 104908 ?      Ssl  09:35   0:05 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd
root     32086  0.6  0.3 2925412 54852 ?       Ssl  09:35   0:03 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/b1de56a0bbcc8779.socket --xlator-option *replicate*.node-uuid=24419492-0d80-4a2a-9420-1dd92515eaf1 --process-name glustershd

Comment 1 Christian Ihle 2019-04-03 07:24:59 UTC
Example of how to reliably reproduce the issue from Kubernetes.

1. kubectl apply -f pvc.yaml
2. kubectl delete -f pvc.yaml

There will almost always be a few more glusterfs-processes running after doing this.

pvc.yaml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc1
  namespace: default
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: glusterfs-replicated-2

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc2
  namespace: default
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: glusterfs-replicated-2

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc3
  namespace: default
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: glusterfs-replicated-2

Comment 2 Christian Ihle 2019-04-03 08:39:23 UTC
I have been experimenting with setting "max_inflight_operations" to 1 in Heketi, as mentioned in https://github.com/heketi/heketi/issues/1439
Example of how to configure this: https://github.com/heketi/heketi/blob/8417f25f474b0b16e1936a66f9b63bcedfba6e4c/tests/functional/TestSmokeTest/config/heketi.json

I am not able to reproduce the issue anymore when the value is set to 1.
The number of glusterfs-processes varies between 0 and 2 during volume changes, but always settles on 1 single process afterwards.

This seems to be an easy workaround, but hopefully the bug will be fixed so I can revert back to concurrent Heketi again.

Comment 3 Christian Ihle 2019-04-09 12:57:42 UTC
Oh, just noticed I wrote CentOS 7.6 only. We use RedHat 7.6 on our main servers, but the issue is the same on both CentOS and RedHat.

Comment 4 Christian Ihle 2019-04-10 06:21:52 UTC
I see from the release notes of 5.6 that this issue is resolved: https://bugzilla.redhat.com/show_bug.cgi?id=1696147
Looks like it may be the same as this. I will test 5.6 once it's out.

Comment 5 Atin Mukherjee 2019-04-29 03:02:47 UTC
Please let us know if you have tested 5.6 and see this problem disappearing.

Comment 6 Christian Ihle 2019-05-08 06:45:33 UTC
I have tested 5.6 and have so far been unable to reproduce the problem. Looks like the problem is fixed.

*** This bug has been marked as a duplicate of bug 1696147 ***


Note You need to log in before you can comment on or make changes to this bug.