Bug 1667169

Summary: glusterd leaks about 1GB memory per day on single machine of storage pool
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Martin Bukatovic <mbukatov>
Component: glusterdAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Bala Konda Reddy M <bmekala>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: amukherj, fbalak, kiyer, mchangir, moagrawa, rhs-bugs, sankarshan, sheggodu, storage-qa-internal, vbellur, vdas
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.z Batch Update 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-39 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1667779 (view as bug list) Environment:
Last Closed: 2019-02-04 07:41:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1667779    
Attachments:
Description Flags
screenshot of RHGSWA host dashboard, with Memory Utilization chart for 7 days none

Description Martin Bukatovic 2019-01-17 16:04:37 UTC
Description of problem
======================

On single machine of trusted storage pool monitored by RHGSWA, glusterd process
memory usage grows about 1.3 GB per day, consuming all available memory in few
days.

On all other nodes, the memory growth was smaller (about 80 MB/day),
which is within limits what has been already reported as BZ 1664046.

Version-Release number of selected component
============================================

GlusterFS:

```
# rpm -qa |grep gluster | sort
glusterfs-3.12.2-36.el7rhgs.x86_64
glusterfs-api-3.12.2-36.el7rhgs.x86_64
glusterfs-cli-3.12.2-36.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-36.el7rhgs.x86_64
glusterfs-events-3.12.2-36.el7rhgs.x86_64
glusterfs-fuse-3.12.2-36.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-36.el7rhgs.x86_64
glusterfs-libs-3.12.2-36.el7rhgs.x86_64
glusterfs-rdma-3.12.2-36.el7rhgs.x86_64
glusterfs-server-3.12.2-36.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.3.x86_64
python2-gluster-3.12.2-36.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-13.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
```

RHGSWA:

```
# rpm -qa | grep tendrl | sort 
tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
tendrl-commons-1.6.3-14.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-13.el7rhgs.noarch
tendrl-node-agent-1.6.3-13.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
```

How reproducible
================

I don't know. I haven't seen this before and haven't chance to reproduce
it again (as I decided to collect data before retrying).

Steps to Reproduce
==================

1. Install setup RHGS cluster on 6 machines, with 2 volumes
   (using standard usmqe configuration).
2. Install RHGSWA on separate machine and import trusted storage pool
   into RHGSWA
3. Mount one volume on dedicated client machine, and fill it completely
   with 10 MB files, then free the space.
4. Let the cluster operational for few days

Actual results
==============

The memory usage on one node grows at about 1.3 GB per day.

The machine has 7821 MB of memory, and within one day, memory
consumption jumped from 65 % to 82 % (see screenshot from WA
dashboard) => (7821/2**10)*.17 GB/day = 1.3 GB/day

Expected results
================

The memory utilization doesn't grow that rapidly.

Additional info
===============

Note: sos report killed all the bricks => memory was freed and I was not
able to create proper statedump report.

I can't directly confirm that the node was used as RHGSWA Provisioner Node,
when I tried to find what node is the provisioner, no machine was assigned.
I will try to confirm this indirectly, checking logs.

That said, it's possible that the problem is triggered by commands executed by
some RHGSWA component. See attached cmd_history.log file.

Other memory leak BZs
=====================

At the time of reporting this bug, the following memory leak bugs were open:

* Bug 1651915
* Bug 1664046

Since I'm not sure about the reproducer, I list the bugs here as it could be
related. That said, I noticed enough differences in my case compared to these
already reported bugs, I created a separate bug:

* Compared to BZ 1664046, the memory growth is much faster. I see 1.3 GB/day,
  while in BZ 1664046, the rate is about 100 MB/day. Moreover, I see it on
  single node (out of 6) only, while in BZ 1664046, all storage machines are
  affected.

* Compared to BZ 1651915, I see no "volume status" commands in cmd_history.log
  The growth rate also differs, but the change could be caused by differences
  in workloads.

Comment 2 Martin Bukatovic 2019-01-17 16:07:59 UTC
Created attachment 1521305 [details]
screenshot of RHGSWA host dashboard, with Memory Utilization chart for 7 days

Comment 8 Atin Mukherjee 2019-01-21 04:18:36 UTC
*** Bug 1664046 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2019-02-04 07:41:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263