Bug 1380249 - Huge memory usage of FUSE client
Summary: Huge memory usage of FUSE client
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: fuse
Version: mainline
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1369364
TreeView+ depends on / blocked
 
Reported: 2016-09-29 05:47 UTC by Nithya Balachandran
Modified: 2020-01-17 15:58 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.10.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1369364
Environment:
Last Closed: 2017-03-06 17:28:10 UTC
Regression: ---
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Nithya Balachandran 2016-09-29 05:47:55 UTC
+++ This bug was initially created as a clone of Bug #1369364 +++

Description of problem:

FUSE client consumes lots of memory for volume with lots of small files (mailboxes storage for dovecot).

Here are volume options:

===
Type: Distributed-Replicate
...
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
...
network.ping-timeout: 10
features.cache-invalidation: on
performance.cache-size: 16777216
performance.cache-max-file-size: 1048576
performance.io-thread-count: 4
performance.write-behind: on
performance.flush-behind: on
performance.read-ahead: on
performance.quick-read: on
performance.stat-prefetch: on
performance.write-behind-window-size: 2097152
storage.linux-aio: on
performance.client-io-threads: off
server.event-threads: 8
network.inode-lru-limit: 4096
client.event-threads: 4
cluster.readdir-optimize: on
cluster.lookup-optimize: on
performance.readdir-ahead: on
nfs.disable: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
===

Here goes memory usage before drop_caches:

===
root      1049  0.0 40.1 1850040 1170912 ?     S<sl сер19   0:28 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.server.com --volfile-id=mail_boxes /var/spool/mail/virtual
===

And after drop_caches:

===
root      1049  0.0 40.1 1850040 1170912 ?     S<sl сер19   0:28 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.server.com --volfile-id=mail_boxes /var/spool/mail/virtual
===

Nothing changes after drop_caches.

pmap of PID 1049 is attached. I see lots of suspicious entries:

===
00007fee74000000  65536K rw---   [ anon ]
===

Are those related to I/O translator? What else could consume 64M at once?

Also, attaching statedumps before drop_caches and after drop_caches (almost no difference between them).

Version-Release number of selected component (if applicable):

GlusterFS v3.7.14 + following patches (all of them are already merged for 3.7.15 release):

===
Aravinda VK (1):
      packaging: Remove ".py" extension from symlink target

Atin Mukherjee (1):
      rpc : build_prog_details should iterate program list inside critical section

Jiffin Tony Thottan (2):
      gfapi : Avoid double freeing of dict in glfs_*_*getxattr
      xlator/trash : append '/' at the end in trash_notify_lookup_cbk

Raghavendra G (2):
      libglusterfs/client_t: Dump the 0th client too
      storage/posix: fix inode leaks

Soumya Koduri (2):
      glfs/upcall: entries should be removed under mutex lock
      gfapi/upcall: Fix a ref leak

Susant Palai (1):
      posix: fix posix_fgetxattr to return the correct error
===

How reproducible:

Always.

Steps to Reproduce:
1. mount volume with lots of small files (mail boxes);
2. use them
3. ...
4. PROFIT!

Actual results:

Client memory leaks.

Expected results:

Client memory does not leak :).

Additional info:

Feel free to ask me for additional info.

--- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-25 11:03:23 EDT ---

http://termbin.com/66ud — the script I use trying to reproduce the leak.

Observing the following in pmap:

===
00007fe430000000    132K rw---   [ anon ]
00007fe430021000  65404K -----   [ anon ]
00007fe438000000   6564K rw---   [ anon ] ← this value increases over time...
00007fe438669000  58972K -----   [ anon ] ← ...while this value decreases over time
00007fe43c000000    132K rw---   [ anon ]
00007fe43c021000  65404K -----   [ anon ]
00007fe440000000   6800K rw---   [ anon ]
00007fe4406a4000  58736K -----   [ anon ]
00007fe44552c000      4K -----   [ anon ]
00007fe44552d000   8192K rw---   [ anon ]
00007fe445d2d000      4K -----   [ anon ]
00007fe445d2e000  16908K rw---   [ anon ]
00007fe446db1000      4K -----   [ anon ]
00007fe446db2000  18744K rw---   [ anon ]                                                                                                                                                    00007fe448000000   9068K rw---   [ anon ]
00007fe4488db000  56468K -----   [ anon ]
00007fe44c027000  29424K rw---   [ anon ]
00007fe44dce3000     52K r-x-- meta.so
00007fe44dcf0000   2044K ----- meta.so
...snip...
===

Other anonymous allocations change over time as well, I have pointed only to one pair observing similar pattern.

--- Additional comment from Kyle Johnson on 2016-08-25 15:39:28 EDT ---

I may also be experiencing this bug.

Distributed volume w/ 3.7.14, no patches.  One server is FreeBSD 10.x, the other is CentOS 6.  Both bricks are ZFS backed.

===
[root@colossus2 ~]# gluster volume info ftp
 
Volume Name: ftp
Type: Distribute
Volume ID: f3f2b222-575c-4c8d-92f1-e640fd7edfbb
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 192.168.110.1:/tank/bricks/ftp
Brick2: 192.168.110.2:/ftp/bricks/ftp
Options Reconfigured:
performance.client-io-threads: on
cluster.weighted-rebalance: off
performance.readdir-ahead: on
nfs.disable: on
===

After approximately 24 hours of heaving reading on the client (copying 93728 files totaling 1.4Tb out of the fuse mount point, elsewhere on the system to a non-gluster space) , I found that the glusterfs process was consuming all system memory, including swap, approximately 34Gb.  Remounting the fuse volume freed up the memory.

Let me know if you need anything else.

--- Additional comment from Kyle Johnson on 2016-08-25 15:43:04 EDT ---

Oops.  I am on 3.7.13.  Fat fingers.

--- Additional comment from Oleksandr Natalenko on 2016-08-26 03:33:50 EDT ---

I've run another test: http://termbin.com/cgzj

The idea of it is to split read/write workload across two clients. I'm attaching statedumps of both reader and writer.

Unfortunately, it seems, memory accounting for FUSE translator suffers from integer overflow (underflow?), because I get something like this:

===
[mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage]
size=4262955128
num_allocs=1
max_size=4294967088
max_num_allocs=3
total_allocs=18389093
===

Another observation: reader consumes more memory than writer. Given cache-size is 32M (and if even it is used across 5 xlators), total RAM consumption should not exceed 160-200M). However, reader RAM:

===
root     30865  8.7  3.5 1054244 433676 ?      Ssl  Aug25  61:01 /usr/sbin/glusterfs --volfile-server=gluster.server.com --volfile-id=test /mnt/net/glusterfs/test2
===

and writer RAM:

===
root     30796  3.3  2.0 722464 249352 ?       Ssl  Aug25  23:11 /usr/sbin/glusterfs --volfile-server=gluster.server.com --volfile-id=test /mnt/net/glusterfs/test1
===

--- Additional comment from Oleksandr Natalenko on 2016-08-26 03:38 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-26 03:38 EDT ---



--- Additional comment from Yannick Perret on 2016-08-29 12:35:51 EDT ---

I also have the same bug, using versions 3.7.13 and 3.7.14 (compiled from sources, no patch) and 3.8.1 (using package from gluster download page).

Tested on an up-to-date Debian 8.

Same observations: memory always growing on any operations (recursive listing, reads, writes), even when always reading the same files. Even after days of non-activities.
Memory can grow until OOM-killer activates (>6Go of memory the time I let it go to that point).

--- Additional comment from Yannick Perret on 2016-08-29 13:46:30 EDT ---

Some memory data, tests performed on 3.8.1 gluster version (both servers and client are 64b Debian 8 up-to-date with default kernel options; test volume is a replica-2 volume without any tuning; mount is performed using "mount -t glusterfs" without any option):
What I did:
1. mounting the volume, removing all its content, umounting it
2. mounting the volume
3. performing a cp -Rp /usr/* /root/MNT
4. performing a rm -rf /root/MNT/*
5. re-doing 3, 4

VSZ/RSS of gluster client process are respectively:
- 381896 / 35688 just after mount
- 644040 / 309240 after 1st cp -Rp
- 644040 / 310128 after 1st rm -rf
- 709576 / 310128 after 1st kill -USR1 (it was just for testing)
- 840648 / 421964 after 2nd cp -Rp
- 840648 / 422224 after 2nd rm -rf

I created a small script in order to perform these operations in loop:
while /bin/true
do
  cp -Rp /usr/* /root/MNT/
  + get VSZ/RSS of glusterfs process
  rm -rf /root/MNT/*
  + get VSZ/RSS of glusterfs process
done

It gives:
1037256 645500
1037256 645840
1168328 757348
1168328 757620
1299400 869128
1299400 869328
1364936 980712
1364936 980944
1496008 1092384
1496008 1092404
1627080 1203796
1627080 1203996
1692616 1315572
1692616 1315504
1823688 1426812
1823688 1427340
1954760 1538716
1954760 1538772
2085832 1647676
2085832 1647708
2151368 1750392
2151368 1750708
2282440 1853864
2282440 1853764
2413512 1952668
2413512 1952704
2479048 2056500
2479048 2056712

I let it turning and gluster client finished by being killed by OOM-killer:
[1246854.291996] Out of memory: Kill process 931 (glusterfs) score 742 or sacrifice child
[1246854.292102] Killed process 931 (glusterfs) total-vm:3527624kB, anon-rss:3100328kB, file-rss:0kB

So on this machine ~3.5Gb of VSZ.

--- Additional comment from Oleksandr Natalenko on 2016-08-30 04:08 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-08-30 04:09:17 EDT ---

Attached Valgrind's Memcheck tool output for reading workload.

--- Additional comment from Nithya Balachandran on 2016-08-30 05:00:28 EDT ---

Thank you. I will take a look at this.

--- Additional comment from Oleksandr Natalenko on 2016-08-30 05:03:17 EDT ---

Nithya, please, take a look at my message to ML regarding Massif tool — I'd like to profile GlusterFS client with that too (but have some troubles) [1] [2].

[1] https://www.gluster.org/pipermail/gluster-devel/2016-August/050665.html

[2] https://www.gluster.org/pipermail/gluster-devel/2016-August/050666.html

--- Additional comment from Oleksandr Natalenko on 2016-09-02 06:30:35 EDT ---

I've tried different kinds of workload and rechecked it with 3.7.15, and it seems that leak is reproduced reliably with the following command on volume with lots of files:

find /mnt/volume -type f -print0 | xargs -0 stat

In half an hour RSS went from 150M to 1G.

Valgrind's Memcheck shows nothing. I still need some help on Massif tool.

--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:20:28 EDT ---

Okay, I've finally managed how to deal with Massif tool, and here are the results.

Volume layout and options are the same as pointed in this BZ description. After mount RSS is 90M, after performing this command for some time (there are millions of files):

===
find /mnt/net/glusterfs/test -print0 | xargs -0 stat >/dev/null
===

RSS grows to 900M (and counting). VSZ also grows from ~400M to ~1200M (and counting). Performing drop_caches lowers RSS/VSZ usage only by ~100M.

There are 3 Massif logs attached, the main is for PID 23664. Also, I attach client statedump.

drop_caches was called before taking statedump and unmounting the volume.

--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:20 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:21 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:21 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-09-07 07:22 EDT ---



--- Additional comment from Pranith Kumar K on 2016-09-07 14:40:22 EDT ---

Oleksandr,
     As per the massif tool output, the heal size is only ~113MB:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 89 500,793,720,910      112,228,416      109,224,305     3,004,111            0
 90 504,760,946,959      113,760,568      110,649,152     3,111,416            0

I also went through the major stack-traces which are active during the last snapshot. Nothing seems new. Most of them are related to last inode-allocations.
But you were saying the RSS was around 900M, I am wondering if you captured this at the same time or not.

Pranith

--- Additional comment from Oleksandr Natalenko on 2016-09-08 03:11:06 EDT ---

Re-run profiling with --pages-as-heap=yes as requested by Pranith in #gluster-dev.

===
root     10696 15.2  7.9 1557544 969640 pts/0  Sl+  вер07  95:47 valgrind --tool=massif --pages-as-heap=yes --run-libc-freeres=no --trace-children=yes --sim-hints=fuse-compatible /usr/sbin/glusterfs -N --volfile-server=glusterfs.example.com --volfile-id=some_volume /mnt/net/glusterfs/test
===

Main PID is 10696, attaching Massif output and statedump.

--- Additional comment from Oleksandr Natalenko on 2016-09-08 03:11 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-09-08 03:12 EDT ---



--- Additional comment from Oleksandr Natalenko on 2016-09-08 03:23:18 EDT ---

Pranith, as you may see in #c23 and #c24, now all the memory gets into statistics:

===
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 50 1,164,197,215,106    1,481,912,320    1,481,912,320             0            0
 51 1,166,294,532,574    1,481,912,320    1,481,912,320             0            0
 52 1,168,332,670,161    1,481,912,320    1,481,912,320             0            0
 53 1,176,305,259,696    1,549,017,088    1,549,017,088             0            0
 54 1,176,353,806,434    1,549,021,184    1,549,021,184             0            0
===

--- Additional comment from Nithya Balachandran on 2016-09-09 11:53:51 EDT ---

I can reproduce the memory rise but statedumps don't seem to contain anything useful. I have started looking through the Massif files and will continue to run the tests to see what I can find.

--- Additional comment from Nithya Balachandran on 2016-09-12 12:20:38 EDT ---

Status so far:

I don't see anything in the Massif logs to pinpoint a leak. the quick-read translator was one point to consider but I can see the leak even with it disabled.

I am trying the following:

Created a single brick distribute volume with default options except for quick-read which I have disabled.
Copying 100000 zero byte files to the gluster fuse mount. RSS goes from 30m to 69m. 
Run rm -rf *. RSS jumps to 104m but then comes down to 74m once the operation is complete.

I will take another look at the readdirp code.

--- Additional comment from Nithya Balachandran on 2016-09-20 23:24:40 EDT ---

Just letting you know I am still looking into this but have not had much luck yet.
Will update when I have something.

--- Additional comment from Nithya Balachandran on 2016-09-28 00:26:39 EDT ---

Valgrind etc doesn't show any mem leaks so far but I do see the mem usage rising. I am wondering if this could be because of memory fragmentation rather than memory leaks.

Still looking. All ideas welcome.

--- Additional comment from Oleksandr Natalenko on 2016-09-28 07:51 EDT ---



--- Additional comment from Worker Ant on 2016-09-28 12:12:23 EDT ---

REVIEW: http://review.gluster.org/15593 (protocol/client: reduce memory usage) posted (#1) for review on master by N Balachandran (nbalacha)

--- Additional comment from Nithya Balachandran on 2016-09-29 01:19:25 EDT ---

Oleksandr,

Do you see a difference with if you turn off performance.readdir-ahead? Apparently this is a known memory hog.

Still looking into how to improve things.

Comment 1 Worker Ant 2016-10-14 09:34:41 UTC
REVIEW: http://review.gluster.org/15593 (protocol/client: reduce memory usage) posted (#2) for review on master by N Balachandran (nbalacha)

Comment 2 Worker Ant 2016-10-25 06:08:50 UTC
COMMIT: http://review.gluster.org/15593 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit bca6d0ba54d12d389cfb5c2b37fb8cc12a7e044b
Author: N Balachandran <nbalacha>
Date:   Wed Sep 28 21:38:57 2016 +0530

    protocol/client: reduce memory usage
    
    readdirp calls use a lot of memory in case
    of a large number of files. The dict->extra_free
    is not used here so free buf immediately.
    
    Change-Id: I097f5dde2df471f5834264152711110a3bdb7e9a
    BUG: 1380249
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: http://review.gluster.org/15593
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 3 Nithya Balachandran 2016-10-25 06:35:49 UTC
(In reply to Worker Ant from comment #2)
> COMMIT: http://review.gluster.org/15593 committed in master by Pranith Kumar
> Karampuri (pkarampu) 
> ------
> commit bca6d0ba54d12d389cfb5c2b37fb8cc12a7e044b
> Author: N Balachandran <nbalacha>
> Date:   Wed Sep 28 21:38:57 2016 +0530
> 
>     protocol/client: reduce memory usage
>     
>     readdirp calls use a lot of memory in case
>     of a large number of files. The dict->extra_free
>     is not used here so free buf immediately.
>     
>     Change-Id: I097f5dde2df471f5834264152711110a3bdb7e9a
>     BUG: 1380249
>     Signed-off-by: N Balachandran <nbalacha>
>     Reviewed-on: http://review.gluster.org/15593
>     NetBSD-regression: NetBSD Build System <jenkins.org>
>     CentOS-regression: Gluster Build System <jenkins.org>
>     Smoke: Gluster Build System <jenkins.org>
>     Reviewed-by: Pranith Kumar Karampuri <pkarampu>


This is an interim fix to reduce the amount of memory that we consume. Some code paths hang on to memory until the very end even though it is no longer used.

There will be more patches posted for this.

As yet, I have not managed to find a memory leak.

Comment 4 Shyamsundar 2017-03-06 17:28:10 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.