Bug 1684648

Summary: glusterd memory usage grows at 98 MB/h while being monitored by RHGSWA
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Martin Bukatovic <mbukatov>
Component: glusterdAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Kshithij Iyer <kiyer>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.4CC: amukherj, bmekala, dahorak, kiyer, moagrawa, nchilaka, pasik, rcyriac, rhs-bugs, sankarshan, sheggodu, smali, storage-qa-internal, vbellur
Target Milestone: ---Keywords: Regression, ZStream
Target Release: RHGS 3.4.z Batch Update 4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-46 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1685414 1685771 (view as bug list) Environment:
Last Closed: 2019-03-27 03:43:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1685414, 1685771    
Attachments:
Description Flags
memory usage chart via munin of affected machine (gl5)
none
memory usage chart via munin of unaffected machine (gl1)
none
Grafana RHGSWA memory chart for whole perion on affected machine (gl5) none

Description Martin Bukatovic 2019-03-01 18:21:39 UTC
Description of problem
======================

When RHGSWA is monitoring gluster trusted storage pool during particular
with volume profiling enabled during particular log running workload, memory
usage of glusterd process on one storage machine grows consistently.

In my case, I see growth from 50% to 70% of used memory during 16h, with
7821 MB of total memory, this gives me growth 7821*.2/16 MB/h ~ 98 MB/h.

Version-Release number of selected component
============================================

Storage machine:

[root@mbukatov-usm2-gl5 ~]# rpm -qa | grep glusterfs | sort
glusterfs-3.12.2-45.el7rhgs.x86_64
glusterfs-api-3.12.2-45.el7rhgs.x86_64
glusterfs-cli-3.12.2-45.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-45.el7rhgs.x86_64
glusterfs-events-3.12.2-45.el7rhgs.x86_64
glusterfs-fuse-3.12.2-45.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-45.el7rhgs.x86_64
glusterfs-libs-3.12.2-45.el7rhgs.x86_64
glusterfs-rdma-3.12.2-45.el7rhgs.x86_64
glusterfs-server-3.12.2-45.el7rhgs.x86_64

[root@mbukatov-usm2-gl5 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-14.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch

Web Admin machine:

[root@mbukatov-usm2-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch

How reproducible
================

Noticed this during long workload run, haven't tried to reproduce.

Steps to Reproduce
==================

1. Prepare 6 machines for RHGS, one for RHGSWA, one for native client
2. Create Gluster trusted storage pool with 2 volumes (arbiter, disperse)
3. Install RHGSWA, enable tls for both etcd and apache
4. Enable alerting (both snmp and smtp, receiving on the client machine)
5. Import storage pool into RHGSWA, with volume profiling enabled
6. Mount both volumes on the client, and extract lot of small files on them
   simultaneously (extracting wikipedia tarball).
7. Monitor status of the cluster for few days (at least 3).

Standard usmqe setup was used (volumes beta and gama), most of this is
automated.

Actual results
==============

On one storage machine (mbukatov-usm2-gl5) out of 6, memory usage grows much
faster compared to the rest of storage machines.

Right now, after about 3 days, I see that on affected machine:

 * glusterd consumes 73% available memory
 * total memory usage on the machine is on 88%

Expected results
================

Memory usage on affected machine doesn't differ from the rest of the cluster
(memory usage is at about 45% on remaining machines of the cluster).

Additional info
===============

Affected machine is RHGSWA provisioner node.

Statedumps for affected machine (mbukatov-usm2-gl5) and another one
(mbukatov-usm2-gl1) for comparison are available.

Grepping for 'get-state.*detail' entries in cmd_history.log files of
all machines shows that:

 * affected machine doesn't perform more get-state calls compared to
   other machines
 * all machines do lot of such calls

```
$ find . -name cmd_history.log | xargs -I'{}' grep -H 'get-state.*detail' '{}' | awk -f get-state-counter.awk
/mbukatov-usm2-gl1  7805
/mbukatov-usm2-gl2  7808
/mbukatov-usm2-gl3  7802
/mbukatov-usm2-gl4  7801
/mbukatov-usm2-gl5  7418
/mbukatov-usm2-gl6  7807
```

Numbers above are produced for logs covering about 52h.

Related Bugs
============

This is similar to other older, now addressed, memory leaks, such as BZ 1567899.

In BZ 1566023, Atin suggests WA to drop frequency of 'gluster get-state detail'
calls. That said, BZ 1566023 has been closed by dev team without direct
indication whether this was done or proposed for a future enhancement.

Comment 2 Martin Bukatovic 2019-03-01 18:27:24 UTC
Created attachment 1539886 [details]
memory usage chart via munin of affected machine (gl5)

Comment 3 Martin Bukatovic 2019-03-01 18:28:06 UTC
Created attachment 1539887 [details]
memory usage chart via munin of unaffected machine (gl1)

Comment 4 Martin Bukatovic 2019-03-01 18:29:23 UTC
Created attachment 1539888 [details]
Grafana RHGSWA memory chart for whole perion on affected machine (gl5)

Comment 6 Martin Bukatovic 2019-03-04 08:56:45 UTC
Daniel noticed this on his machines as well. Without any workload, memory consumption of glusterd on a RHGSWA provisioner storage node grows in a linear way.

Comment 7 Martin Bukatovic 2019-03-04 08:59:02 UTC
Based on observation mentioned in comment 6, this is likely a regression.

Comment 8 Daniel Horák 2019-03-04 11:28:20 UTC
The problem is with command `gluster volume profile ${vol} info`.

I've installed clean gluster cluster with 3 volumes (without WA console) and
launched this command for all 3 volumes repeatedly every 5 seconds and memory
consumption grew up by 1GB in 2,5 hours.

But I'm not 100% sure, if it is regression against last GA version or not.

Comment 9 Martin Bukatovic 2019-03-04 13:44:10 UTC
(In reply to Daniel Horák from comment #8)
> The problem is with command `gluster volume profile ${vol} info`.

Evidence from cmd_history.log files supports this:

```
$ find . -name cmd_history.log | xargs -I'{}' grep -H 'volume profile' '{}' | awk -f get-state-counter.awk
/mbukatov-usm2-gl2     2
/mbukatov-usm2-gl5  4898
```

Comment 21 errata-xmlrpc 2019-03-27 03:43:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0658