Bug 1593884 - glusterfs-fuse 3.12.9/10 high memory consumption
Summary: glusterfs-fuse 3.12.9/10 high memory consumption
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: fuse
Version: 3.12
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
: 1583502 1589090 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-21 18:46 UTC by d.webb
Modified: 2018-08-29 11:10 UTC (History)
13 users (show)

Fixed In Version: glusterfs-3.12.13
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-08-29 10:59:17 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Gluster dump of client fuse process (48.81 KB, text/plain)
2018-06-21 18:46 UTC, d.webb
no flags Details

Description d.webb 2018-06-21 18:46:11 UTC
Created attachment 1453582 [details]
Gluster dump of client fuse process

Description of problem:

Gluster-fuse mount process is consuming large amounts of memory over a relatively short period of time (consuming GBs over a day) on a mount < 100MB but with lots of churn.  

Version-Release number of selected component (if applicable):

# Client side

glusterfs-3.12.10-1.el7.x86_64
glusterfs-client-xlators-3.12.10-1.el7.x86_64
glusterfs-libs-3.12.10-1.el7.x86_64
glusterfs-fuse-3.12.10-1.el7.x86_64

# Server Side:

glusterfs-cli-3.12.10-1.el7.x86_64
glusterfs-3.12.10-1.el7.x86_64
glusterfs-fuse-3.12.10-1.el7.x86_64
glusterfs-libs-3.12.10-1.el7.x86_64
glusterfs-api-3.12.10-1.el7.x86_64
glusterfs-client-xlators-3.12.10-1.el7.x86_64
glusterfs-server-3.12.10-1.el7.x86_64

# Volume setup:

# 3 node 3 brick replica, has KahaDB files for activeMQ on it, mount itself is only a few MB used:

node-001:/gv_activemq  250G   84M  250G   1% /mnt/amq_broker


# vol info:

Volume Name: gv_activemq
Type: Replicate
Volume ID: d3003a1d-b07e-4996-998d-6cbe26a587e2
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: node-001:/opt/gluster_storage/gv_activemq/brick
Brick2: node-002:/opt/gluster_storage/gv_activemq/brick
Brick3: node-003:/opt/gluster_storage/gv_activemq/brick
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.readdir-ahead: off
network.ping-timeout: 5
performance.cache-size: 1GB


How reproducible:

Memory utilisation seems to be a problem in at least 3.12.9 and 10 (I've upgraded from .9 to .10 in hopes that this would fix it).  I've got another cluster running 3.12.6 whose client doesn't have the same issue and is running the other end of this ActiveMQ cluster).  So looks to have been introduced since then.  

Steps to Reproduce:
1. ? 
2.
3.

Expected results:


Additional info:

# 3.12.6-1 usage on a much busier mount:

43489 root      20   0  681796  53828   4328 S   0.3  0.3   1624:11 glusterfs 
node-001:/gv_amq_broker  200G   36G  165G  18% /opt/amq_broker
                                                                                                  

# 3.12.9/10 on a similar mount with less traffic:

48376 root      20   0 5164844 4.038g   4460 S   5.4 26.0  32:11.30 glusterfs                                                                                                   

node-001:/gv_activemq  250G   84M  250G   1% /mnt/amq_broker

Comment 1 d.webb 2018-06-21 18:51:49 UTC
just a comment, state dump provided was after the upgrade from 3.12.9 (the 4G res memory above)

the memory footprint for the 3.12.10 version was:

 8499 root      20   0 2123592 1.436g   4124 S   0.0  9.3  18:56.34 glusterfs                                                                                                   

when the statedump was taken.

Comment 2 Marcus Calverley 2018-06-25 20:35:14 UTC
I have the same issue with steadily increasing client memory usage on 3.12.9 under read/write load. I can add that I'm using GlusterFS as oVirt VM storage, so I tried switching oVirt to using libgfapi instead of using the fuse mounts, but that just moved the memory leak to the individual qemu processes instead of being centralised to the fuse mount processes. That also confirmed that it seems to be mostly database servers that cause the issue (so the oVirt hosted engine, and some other PostgreSQL servers I'm running that see a lot of traffic).

I'm currently working around the issue by migrating VMs manually between servers in the cluster, but it's annoying to have to do that twice a day to keep the servers from using 100% memory and VMs getting killed. I'm surprised there hasn't been more discussion on this issue, is it some setting that we have that's causing this?

The engine gluster volume configuration:
Volume Name: engine
Type: Replicate
Volume ID: 84f29251-619b-493c-ae1c-7da0fd27a8c1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 5 = 5
Transport-type: tcp
Bricks:
Brick1: gluster1.management:/gluster/sda/engine
Brick2: gluster2.management:/gluster/sda/engine
Brick3: gluster3.management:/gluster/sda/engine
Brick4: gluster4.management:/gluster/sda/engine
Brick5: gluster5.management:/gluster/sdb/engine
Options Reconfigured:
server.allow-insecure: on
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.ping-timeout: 10
storage.owner-uid: 36
storage.owner-gid: 36
performance.flush-behind: on
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
nfs.disable: on
transport.address-family: inet
cluster.server-quorum-ratio: 51%

Comment 3 d.webb 2018-06-26 11:06:54 UTC
We've migrated the affected cluster to the new 4.1 release and it seems to have "fixed" the issue.  the 3.9/10 releases were unusable for us as we risked OOMing if we left the mount running too long.

Comment 4 Amar Tumballi 2018-06-26 12:03:01 UTC
Hi, Thanks for the update. Helps us to corner the specific patch, so hopefully we can fix 3.12.x branch.

Comment 5 Marcus Calverley 2018-06-30 03:16:53 UTC
After downgrading to 3.12.6, it seems the issue is no longer present. Downgrading to 3.12.8 did not help. It doesn't look like 3.12.7 got to the CentOS repos, so I haven't tested that version.

Comment 6 Rob Sanders 2018-07-16 08:17:51 UTC
I'm running oVirt cluster with 3 nodes and after around a week of use, the hosted_storage gluster monut consumes around 16GB of memory on each host.

Comment 7 Oskar Stolc 2018-07-30 13:20:53 UTC
The same is happening on Debian 9 with GlusterFS 3.12.12 - RSS memory of the glusterfs process (fuse client) is gradually rising until the server runs out of memory.

I tried to downgrade both glusterfs-server on the servers and glusterfs-client on the client to 3.12.11, then 3.12.6, 3.12.5, 3.12.4, 3.12.3, none of these worked.

Currently Glusterfs 3.12.x is unusable on Debian 9.

Comment 8 Alex 2018-08-07 18:07:10 UTC
I've opened a backport request here as it seems a commit has been made to fix this, but isn't in the current 3.12.x release:

https://bugzilla.redhat.com/show_bug.cgi?id=1613512

Comment 9 Amar Tumballi 2018-08-29 10:59:17 UTC
This is fixed with 3.12.13 now.

Comment 10 Amar Tumballi 2018-08-29 10:59:42 UTC
*** Bug 1583502 has been marked as a duplicate of this bug. ***

Comment 11 Amar Tumballi 2018-08-29 11:10:02 UTC
*** Bug 1589090 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.