Description of problem ====================== When RHGSWA is monitoring gluster trusted storage pool during particular with volume profiling enabled during particular log running workload, memory usage of glusterd process on one storage machine grows consistently. In my case, I see growth from 50% to 70% of used memory during 16h, with 7821 MB of total memory, this gives me growth 7821*.2/16 MB/h ~ 98 MB/h. Version-Release number of selected component ============================================ Storage machine: [root@mbukatov-usm2-gl5 ~]# rpm -qa | grep glusterfs | sort glusterfs-3.12.2-45.el7rhgs.x86_64 glusterfs-api-3.12.2-45.el7rhgs.x86_64 glusterfs-cli-3.12.2-45.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-45.el7rhgs.x86_64 glusterfs-events-3.12.2-45.el7rhgs.x86_64 glusterfs-fuse-3.12.2-45.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-45.el7rhgs.x86_64 glusterfs-libs-3.12.2-45.el7rhgs.x86_64 glusterfs-rdma-3.12.2-45.el7rhgs.x86_64 glusterfs-server-3.12.2-45.el7rhgs.x86_64 [root@mbukatov-usm2-gl5 ~]# rpm -qa | grep tendrl | sort tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch tendrl-commons-1.6.3-17.el7rhgs.noarch tendrl-gluster-integration-1.6.3-14.el7rhgs.noarch tendrl-node-agent-1.6.3-18.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch Web Admin machine: [root@mbukatov-usm2-server ~]# rpm -qa | grep tendrl | sort tendrl-ansible-1.6.3-11.el7rhgs.noarch tendrl-api-1.6.3-13.el7rhgs.noarch tendrl-api-httpd-1.6.3-13.el7rhgs.noarch tendrl-commons-1.6.3-17.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch tendrl-node-agent-1.6.3-18.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch tendrl-ui-1.6.3-15.el7rhgs.noarch How reproducible ================ Noticed this during long workload run, haven't tried to reproduce. Steps to Reproduce ================== 1. Prepare 6 machines for RHGS, one for RHGSWA, one for native client 2. Create Gluster trusted storage pool with 2 volumes (arbiter, disperse) 3. Install RHGSWA, enable tls for both etcd and apache 4. Enable alerting (both snmp and smtp, receiving on the client machine) 5. Import storage pool into RHGSWA, with volume profiling enabled 6. Mount both volumes on the client, and extract lot of small files on them simultaneously (extracting wikipedia tarball). 7. Monitor status of the cluster for few days (at least 3). Standard usmqe setup was used (volumes beta and gama), most of this is automated. Actual results ============== On one storage machine (mbukatov-usm2-gl5) out of 6, memory usage grows much faster compared to the rest of storage machines. Right now, after about 3 days, I see that on affected machine: * glusterd consumes 73% available memory * total memory usage on the machine is on 88% Expected results ================ Memory usage on affected machine doesn't differ from the rest of the cluster (memory usage is at about 45% on remaining machines of the cluster). Additional info =============== Affected machine is RHGSWA provisioner node. Statedumps for affected machine (mbukatov-usm2-gl5) and another one (mbukatov-usm2-gl1) for comparison are available. Grepping for 'get-state.*detail' entries in cmd_history.log files of all machines shows that: * affected machine doesn't perform more get-state calls compared to other machines * all machines do lot of such calls ``` $ find . -name cmd_history.log | xargs -I'{}' grep -H 'get-state.*detail' '{}' | awk -f get-state-counter.awk /mbukatov-usm2-gl1 7805 /mbukatov-usm2-gl2 7808 /mbukatov-usm2-gl3 7802 /mbukatov-usm2-gl4 7801 /mbukatov-usm2-gl5 7418 /mbukatov-usm2-gl6 7807 ``` Numbers above are produced for logs covering about 52h. Related Bugs ============ This is similar to other older, now addressed, memory leaks, such as BZ 1567899. In BZ 1566023, Atin suggests WA to drop frequency of 'gluster get-state detail' calls. That said, BZ 1566023 has been closed by dev team without direct indication whether this was done or proposed for a future enhancement.
Created attachment 1539886 [details] memory usage chart via munin of affected machine (gl5)
Created attachment 1539887 [details] memory usage chart via munin of unaffected machine (gl1)
Created attachment 1539888 [details] Grafana RHGSWA memory chart for whole perion on affected machine (gl5)
Daniel noticed this on his machines as well. Without any workload, memory consumption of glusterd on a RHGSWA provisioner storage node grows in a linear way.
Based on observation mentioned in comment 6, this is likely a regression.
The problem is with command `gluster volume profile ${vol} info`. I've installed clean gluster cluster with 3 volumes (without WA console) and launched this command for all 3 volumes repeatedly every 5 seconds and memory consumption grew up by 1GB in 2,5 hours. But I'm not 100% sure, if it is regression against last GA version or not.
(In reply to Daniel Horák from comment #8) > The problem is with command `gluster volume profile ${vol} info`. Evidence from cmd_history.log files supports this: ``` $ find . -name cmd_history.log | xargs -I'{}' grep -H 'volume profile' '{}' | awk -f get-state-counter.awk /mbukatov-usm2-gl2 2 /mbukatov-usm2-gl5 4898 ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0658