1684648 – glusterd memory usage grows at 98 MB/h while being monitored by RHGSWA

Bug 1684648 - glusterd memory usage grows at 98 MB/h while being monitored by RHGSWA

Summary: glusterd memory usage grows at 98 MB/h while being monitored by RHGSWA

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 4
Assignee:	Mohit Agrawal
QA Contact:	Kshithij Iyer
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1685414 1685771
TreeView+	depends on / blocked

Reported:	2019-03-01 18:21 UTC by Martin Bukatovic
Modified:	2019-03-27 07:46 UTC (History)
CC List:	14 users (show)
Fixed In Version:	glusterfs-3.12.2-46
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1685414 1685771 (view as bug list)
Environment:
Last Closed:	2019-03-27 03:43:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
memory usage chart via munin of affected machine (gl5) (255.55 KB, image/png) 2019-03-01 18:27 UTC, Martin Bukatovic	no flags	Details
memory usage chart via munin of unaffected machine (gl1) (236.89 KB, image/png) 2019-03-01 18:28 UTC, Martin Bukatovic	no flags	Details
Grafana RHGSWA memory chart for whole perion on affected machine (gl5) (116.79 KB, image/png) 2019-03-01 18:29 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0658	0	None	None	None	2019-03-27 03:44:49 UTC

Description Martin Bukatovic 2019-03-01 18:21:39 UTC

Description of problem
======================

When RHGSWA is monitoring gluster trusted storage pool during particular
with volume profiling enabled during particular log running workload, memory
usage of glusterd process on one storage machine grows consistently.

In my case, I see growth from 50% to 70% of used memory during 16h, with
7821 MB of total memory, this gives me growth 7821*.2/16 MB/h ~ 98 MB/h.

Version-Release number of selected component
============================================

Storage machine:

[root@mbukatov-usm2-gl5 ~]# rpm -qa | grep glusterfs | sort
glusterfs-3.12.2-45.el7rhgs.x86_64
glusterfs-api-3.12.2-45.el7rhgs.x86_64
glusterfs-cli-3.12.2-45.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-45.el7rhgs.x86_64
glusterfs-events-3.12.2-45.el7rhgs.x86_64
glusterfs-fuse-3.12.2-45.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-45.el7rhgs.x86_64
glusterfs-libs-3.12.2-45.el7rhgs.x86_64
glusterfs-rdma-3.12.2-45.el7rhgs.x86_64
glusterfs-server-3.12.2-45.el7rhgs.x86_64

[root@mbukatov-usm2-gl5 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-14.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch

Web Admin machine:

[root@mbukatov-usm2-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch

How reproducible
================

Noticed this during long workload run, haven't tried to reproduce.

Steps to Reproduce
==================

1. Prepare 6 machines for RHGS, one for RHGSWA, one for native client
2. Create Gluster trusted storage pool with 2 volumes (arbiter, disperse)
3. Install RHGSWA, enable tls for both etcd and apache
4. Enable alerting (both snmp and smtp, receiving on the client machine)
5. Import storage pool into RHGSWA, with volume profiling enabled
6. Mount both volumes on the client, and extract lot of small files on them
   simultaneously (extracting wikipedia tarball).
7. Monitor status of the cluster for few days (at least 3).

Standard usmqe setup was used (volumes beta and gama), most of this is
automated.

Actual results
==============

On one storage machine (mbukatov-usm2-gl5) out of 6, memory usage grows much
faster compared to the rest of storage machines.

Right now, after about 3 days, I see that on affected machine:

 * glusterd consumes 73% available memory
 * total memory usage on the machine is on 88%

Expected results
================

Memory usage on affected machine doesn't differ from the rest of the cluster
(memory usage is at about 45% on remaining machines of the cluster).

Additional info
===============

Affected machine is RHGSWA provisioner node.

Statedumps for affected machine (mbukatov-usm2-gl5) and another one
(mbukatov-usm2-gl1) for comparison are available.

Grepping for 'get-state.*detail' entries in cmd_history.log files of
all machines shows that:

 * affected machine doesn't perform more get-state calls compared to
   other machines
 * all machines do lot of such calls

```
$ find . -name cmd_history.log | xargs -I'{}' grep -H 'get-state.*detail' '{}' | awk -f get-state-counter.awk
/mbukatov-usm2-gl1  7805
/mbukatov-usm2-gl2  7808
/mbukatov-usm2-gl3  7802
/mbukatov-usm2-gl4  7801
/mbukatov-usm2-gl5  7418
/mbukatov-usm2-gl6  7807
```

Numbers above are produced for logs covering about 52h.

Related Bugs
============

This is similar to other older, now addressed, memory leaks, such as BZ 1567899.

In BZ 1566023, Atin suggests WA to drop frequency of 'gluster get-state detail'
calls. That said, BZ 1566023 has been closed by dev team without direct
indication whether this was done or proposed for a future enhancement.

Comment 2 Martin Bukatovic 2019-03-01 18:27:24 UTC

Created attachment 1539886 [details]
memory usage chart via munin of affected machine (gl5)

Comment 3 Martin Bukatovic 2019-03-01 18:28:06 UTC

Created attachment 1539887 [details]
memory usage chart via munin of unaffected machine (gl1)

Comment 4 Martin Bukatovic 2019-03-01 18:29:23 UTC

Created attachment 1539888 [details]
Grafana RHGSWA memory chart for whole perion on affected machine (gl5)

Comment 6 Martin Bukatovic 2019-03-04 08:56:45 UTC

Daniel noticed this on his machines as well. Without any workload, memory consumption of glusterd on a RHGSWA provisioner storage node grows in a linear way.

Comment 7 Martin Bukatovic 2019-03-04 08:59:02 UTC

Based on observation mentioned in comment 6, this is likely a regression.

Comment 8 Daniel Horák 2019-03-04 11:28:20 UTC

The problem is with command `gluster volume profile ${vol} info`.

I've installed clean gluster cluster with 3 volumes (without WA console) and
launched this command for all 3 volumes repeatedly every 5 seconds and memory
consumption grew up by 1GB in 2,5 hours.

But I'm not 100% sure, if it is regression against last GA version or not.

Comment 9 Martin Bukatovic 2019-03-04 13:44:10 UTC

(In reply to Daniel Horák from comment #8)
> The problem is with command `gluster volume profile ${vol} info`.

Evidence from cmd_history.log files supports this:

```
$ find . -name cmd_history.log | xargs -I'{}' grep -H 'volume profile' '{}' | awk -f get-state-counter.awk
/mbukatov-usm2-gl2     2
/mbukatov-usm2-gl5  4898
```

Comment 21 errata-xmlrpc 2019-03-27 03:43:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0658

Note You need to log in before you can comment on or make changes to this bug.