1667169 – glusterd leaks about 1GB memory per day on single machine of storage pool

Bug 1667169 - glusterd leaks about 1GB memory per day on single machine of storage pool

Summary: glusterd leaks about 1GB memory per day on single machine of storage pool

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 3
Assignee:	Mohit Agrawal
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1664046 (view as bug list)
Depends On:
Blocks:	1667779
TreeView+	depends on / blocked

Reported:	2019-01-17 16:04 UTC by Martin Bukatovic
Modified:	2019-02-07 09:57 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.12.2-39
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1667779 (view as bug list)
Environment:
Last Closed:	2019-02-04 07:41:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
screenshot of RHGSWA host dashboard, with Memory Utilization chart for 7 days (281.49 KB, image/png) 2019-01-17 16:07 UTC, Martin Bukatovic	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0263	0	None	None	None	2019-02-04 07:41:49 UTC

Description Martin Bukatovic 2019-01-17 16:04:37 UTC

Description of problem
======================

On single machine of trusted storage pool monitored by RHGSWA, glusterd process
memory usage grows about 1.3 GB per day, consuming all available memory in few
days.

On all other nodes, the memory growth was smaller (about 80 MB/day),
which is within limits what has been already reported as BZ 1664046.

Version-Release number of selected component
============================================

GlusterFS:

```
# rpm -qa |grep gluster | sort
glusterfs-3.12.2-36.el7rhgs.x86_64
glusterfs-api-3.12.2-36.el7rhgs.x86_64
glusterfs-cli-3.12.2-36.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-36.el7rhgs.x86_64
glusterfs-events-3.12.2-36.el7rhgs.x86_64
glusterfs-fuse-3.12.2-36.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-36.el7rhgs.x86_64
glusterfs-libs-3.12.2-36.el7rhgs.x86_64
glusterfs-rdma-3.12.2-36.el7rhgs.x86_64
glusterfs-server-3.12.2-36.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.3.x86_64
python2-gluster-3.12.2-36.el7rhgs.x86_64
tendrl-gluster-integration-1.6.3-13.el7rhgs.noarch
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch
```

RHGSWA:

```
# rpm -qa | grep tendrl | sort 
tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
tendrl-commons-1.6.3-14.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-13.el7rhgs.noarch
tendrl-node-agent-1.6.3-13.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
```

How reproducible
================

I don't know. I haven't seen this before and haven't chance to reproduce
it again (as I decided to collect data before retrying).

Steps to Reproduce
==================

1. Install setup RHGS cluster on 6 machines, with 2 volumes
   (using standard usmqe configuration).
2. Install RHGSWA on separate machine and import trusted storage pool
   into RHGSWA
3. Mount one volume on dedicated client machine, and fill it completely
   with 10 MB files, then free the space.
4. Let the cluster operational for few days

Actual results
==============

The memory usage on one node grows at about 1.3 GB per day.

The machine has 7821 MB of memory, and within one day, memory
consumption jumped from 65 % to 82 % (see screenshot from WA
dashboard) => (7821/2**10)*.17 GB/day = 1.3 GB/day

Expected results
================

The memory utilization doesn't grow that rapidly.

Additional info
===============

Note: sos report killed all the bricks => memory was freed and I was not
able to create proper statedump report.

I can't directly confirm that the node was used as RHGSWA Provisioner Node,
when I tried to find what node is the provisioner, no machine was assigned.
I will try to confirm this indirectly, checking logs.

That said, it's possible that the problem is triggered by commands executed by
some RHGSWA component. See attached cmd_history.log file.

Other memory leak BZs
=====================

At the time of reporting this bug, the following memory leak bugs were open:

* Bug 1651915
* Bug 1664046

Since I'm not sure about the reproducer, I list the bugs here as it could be
related. That said, I noticed enough differences in my case compared to these
already reported bugs, I created a separate bug:

* Compared to BZ 1664046, the memory growth is much faster. I see 1.3 GB/day,
  while in BZ 1664046, the rate is about 100 MB/day. Moreover, I see it on
  single node (out of 6) only, while in BZ 1664046, all storage machines are
  affected.

* Compared to BZ 1651915, I see no "volume status" commands in cmd_history.log
  The growth rate also differs, but the change could be caused by differences
  in workloads.

Comment 2 Martin Bukatovic 2019-01-17 16:07:59 UTC

Created attachment 1521305 [details]
screenshot of RHGSWA host dashboard, with Memory Utilization chart for 7 days

Comment 8 Atin Mukherjee 2019-01-21 04:18:36 UTC

*** Bug 1664046 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2019-02-04 07:41:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0263

Note You need to log in before you can comment on or make changes to this bug.