Bug 1369364

Summary: Huge memory usage of FUSE client
Product: [Community] GlusterFS Reporter: Oleksandr Natalenko <oleksandr>
Component: fuseAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED EOL QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.7.15CC: bugs, geert.nijpels, glush, jbyers, kjohnson, moagrawa, nbalacha, oleksandr, olim, onatalen, pkarampu, yannick.perret
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1380249 1558379 (view as bug list) Environment:
Last Closed: 2017-03-08 11:02:07 UTC Type: Bug
Regression: --- Mount Type: fuse
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1380249    
Bug Blocks:    
Attachments:
Description Flags
pmap
none
Statedump before drop_caches
none
Statedump after drop_caches
none
Statedump (reader)
none
Statedump (writer)
none
Valgrind Memcheck results
none
Massif — Statedump for main PID
none
Massif — log for main PID
none
Massif — log for PID 23668
none
Massif — log for PID 23669
none
Massif — log for main PID 10696
none
Massif — Statedump for main PID 10696
none
Massif log visualization none

Description Oleksandr Natalenko 2016-08-23 08:29:47 UTC
Description of problem:

FUSE client consumes lots of memory for volume with lots of small files (mailboxes storage for dovecot).

Here are volume options:

===
Type: Distributed-Replicate
...
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
...
network.ping-timeout: 10
features.cache-invalidation: on
performance.cache-size: 16777216
performance.cache-max-file-size: 1048576
performance.io-thread-count: 4
performance.write-behind: on
performance.flush-behind: on
performance.read-ahead: on
performance.quick-read: on
performance.stat-prefetch: on
performance.write-behind-window-size: 2097152
storage.linux-aio: on
performance.client-io-threads: off
server.event-threads: 8
network.inode-lru-limit: 4096
client.event-threads: 4
cluster.readdir-optimize: on
cluster.lookup-optimize: on
performance.readdir-ahead: on
nfs.disable: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
===

Here goes memory usage before drop_caches:

===
root      1049  0.0 40.1 1850040 1170912 ?     S<sl сер19   0:28 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.server.com --volfile-id=mail_boxes /var/spool/mail/virtual
===

And after drop_caches:

===
root      1049  0.0 40.1 1850040 1170912 ?     S<sl сер19   0:28 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.server.com --volfile-id=mail_boxes /var/spool/mail/virtual
===

Nothing changes after drop_caches.

pmap of PID 1049 is attached. I see lots of suspicious entries:

===
00007fee74000000  65536K rw---   [ anon ]
===

Are those related to I/O translator? What else could consume 64M at once?

Also, attaching statedumps before drop_caches and after drop_caches (almost no difference between them).

Version-Release number of selected component (if applicable):

GlusterFS v3.7.14 + following patches (all of them are already merged for 3.7.15 release):

===
Aravinda VK (1):
      packaging: Remove ".py" extension from symlink target

Atin Mukherjee (1):
      rpc : build_prog_details should iterate program list inside critical section

Jiffin Tony Thottan (2):
      gfapi : Avoid double freeing of dict in glfs_*_*getxattr
      xlator/trash : append '/' at the end in trash_notify_lookup_cbk

Raghavendra G (2):
      libglusterfs/client_t: Dump the 0th client too
      storage/posix: fix inode leaks

Soumya Koduri (2):
      glfs/upcall: entries should be removed under mutex lock
      gfapi/upcall: Fix a ref leak

Susant Palai (1):
      posix: fix posix_fgetxattr to return the correct error
===

How reproducible:

Always.

Steps to Reproduce:
1. mount volume with lots of small files (mail boxes);
2. use them
3. ...
4. PROFIT!

Actual results:

Client memory leaks.

Expected results:

Client memory does not leak :).

Additional info:

Feel free to ask me for additional info.

Comment 1 Oleksandr Natalenko 2016-08-23 08:30:10 UTC
Created attachment 1193257 [details]
pmap

Comment 2 Oleksandr Natalenko 2016-08-23 08:30:33 UTC
Created attachment 1193258 [details]
Statedump before drop_caches

Comment 3 Oleksandr Natalenko 2016-08-23 08:30:55 UTC
Created attachment 1193259 [details]
Statedump after drop_caches

Comment 4 Oleksandr Natalenko 2016-08-25 15:03:23 UTC
http://termbin.com/66ud — the script I use trying to reproduce the leak.

Observing the following in pmap:

===
00007fe430000000    132K rw---   [ anon ]
00007fe430021000  65404K -----   [ anon ]
00007fe438000000   6564K rw---   [ anon ] ← this value increases over time...
00007fe438669000  58972K -----   [ anon ] ← ...while this value decreases over time
00007fe43c000000    132K rw---   [ anon ]
00007fe43c021000  65404K -----   [ anon ]
00007fe440000000   6800K rw---   [ anon ]
00007fe4406a4000  58736K -----   [ anon ]
00007fe44552c000      4K -----   [ anon ]
00007fe44552d000   8192K rw---   [ anon ]
00007fe445d2d000      4K -----   [ anon ]
00007fe445d2e000  16908K rw---   [ anon ]
00007fe446db1000      4K -----   [ anon ]
00007fe446db2000  18744K rw---   [ anon ]                                                                                                                                                    00007fe448000000   9068K rw---   [ anon ]
00007fe4488db000  56468K -----   [ anon ]
00007fe44c027000  29424K rw---   [ anon ]
00007fe44dce3000     52K r-x-- meta.so
00007fe44dcf0000   2044K ----- meta.so
...snip...
===

Other anonymous allocations change over time as well, I have pointed only to one pair observing similar pattern.

Comment 5 Kyle Johnson 2016-08-25 19:39:28 UTC
I may also be experiencing this bug.

Distributed volume w/ 3.7.14, no patches.  One server is FreeBSD 10.x, the other is CentOS 6.  Both bricks are ZFS backed.

===
[root@colossus2 ~]# gluster volume info ftp
 
Volume Name: ftp
Type: Distribute
Volume ID: f3f2b222-575c-4c8d-92f1-e640fd7edfbb
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 192.168.110.1:/tank/bricks/ftp
Brick2: 192.168.110.2:/ftp/bricks/ftp
Options Reconfigured:
performance.client-io-threads: on
cluster.weighted-rebalance: off
performance.readdir-ahead: on
nfs.disable: on
===

After approximately 24 hours of heaving reading on the client (copying 93728 files totaling 1.4Tb out of the fuse mount point, elsewhere on the system to a non-gluster space) , I found that the glusterfs process was consuming all system memory, including swap, approximately 34Gb.  Remounting the fuse volume freed up the memory.

Let me know if you need anything else.

Comment 6 Kyle Johnson 2016-08-25 19:43:04 UTC
Oops.  I am on 3.7.13.  Fat fingers.

Comment 7 Oleksandr Natalenko 2016-08-26 07:33:50 UTC
I've run another test: http://termbin.com/cgzj

The idea of it is to split read/write workload across two clients. I'm attaching statedumps of both reader and writer.

Unfortunately, it seems, memory accounting for FUSE translator suffers from integer overflow (underflow?), because I get something like this:

===
[mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage]
size=4262955128
num_allocs=1
max_size=4294967088
max_num_allocs=3
total_allocs=18389093
===

Another observation: reader consumes more memory than writer. Given cache-size is 32M (and if even it is used across 5 xlators), total RAM consumption should not exceed 160-200M). However, reader RAM:

===
root     30865  8.7  3.5 1054244 433676 ?      Ssl  Aug25  61:01 /usr/sbin/glusterfs --volfile-server=gluster.server.com --volfile-id=test /mnt/net/glusterfs/test2
===

and writer RAM:

===
root     30796  3.3  2.0 722464 249352 ?       Ssl  Aug25  23:11 /usr/sbin/glusterfs --volfile-server=gluster.server.com --volfile-id=test /mnt/net/glusterfs/test1
===

Comment 8 Oleksandr Natalenko 2016-08-26 07:38:31 UTC
Created attachment 1194221 [details]
Statedump (reader)

Comment 9 Oleksandr Natalenko 2016-08-26 07:38:56 UTC
Created attachment 1194222 [details]
Statedump (writer)

Comment 10 Yannick Perret 2016-08-29 16:35:51 UTC
I also have the same bug, using versions 3.7.13 and 3.7.14 (compiled from sources, no patch) and 3.8.1 (using package from gluster download page).

Tested on an up-to-date Debian 8.

Same observations: memory always growing on any operations (recursive listing, reads, writes), even when always reading the same files. Even after days of non-activities.
Memory can grow until OOM-killer activates (>6Go of memory the time I let it go to that point).

Comment 11 Yannick Perret 2016-08-29 17:46:30 UTC
Some memory data, tests performed on 3.8.1 gluster version (both servers and client are 64b Debian 8 up-to-date with default kernel options; test volume is a replica-2 volume without any tuning; mount is performed using "mount -t glusterfs" without any option):
What I did:
1. mounting the volume, removing all its content, umounting it
2. mounting the volume
3. performing a cp -Rp /usr/* /root/MNT
4. performing a rm -rf /root/MNT/*
5. re-doing 3, 4

VSZ/RSS of gluster client process are respectively:
- 381896 / 35688 just after mount
- 644040 / 309240 after 1st cp -Rp
- 644040 / 310128 after 1st rm -rf
- 709576 / 310128 after 1st kill -USR1 (it was just for testing)
- 840648 / 421964 after 2nd cp -Rp
- 840648 / 422224 after 2nd rm -rf

I created a small script in order to perform these operations in loop:
while /bin/true
do
  cp -Rp /usr/* /root/MNT/
  + get VSZ/RSS of glusterfs process
  rm -rf /root/MNT/*
  + get VSZ/RSS of glusterfs process
done

It gives:
1037256 645500
1037256 645840
1168328 757348
1168328 757620
1299400 869128
1299400 869328
1364936 980712
1364936 980944
1496008 1092384
1496008 1092404
1627080 1203796
1627080 1203996
1692616 1315572
1692616 1315504
1823688 1426812
1823688 1427340
1954760 1538716
1954760 1538772
2085832 1647676
2085832 1647708
2151368 1750392
2151368 1750708
2282440 1853864
2282440 1853764
2413512 1952668
2413512 1952704
2479048 2056500
2479048 2056712

I let it turning and gluster client finished by being killed by OOM-killer:
[1246854.291996] Out of memory: Kill process 931 (glusterfs) score 742 or sacrifice child
[1246854.292102] Killed process 931 (glusterfs) total-vm:3527624kB, anon-rss:3100328kB, file-rss:0kB

So on this machine ~3.5Gb of VSZ.

Comment 12 Oleksandr Natalenko 2016-08-30 08:08:30 UTC
Created attachment 1195709 [details]
Valgrind Memcheck results

Comment 13 Oleksandr Natalenko 2016-08-30 08:09:17 UTC
Attached Valgrind's Memcheck tool output for reading workload.

Comment 14 Nithya Balachandran 2016-08-30 09:00:28 UTC
Thank you. I will take a look at this.

Comment 15 Oleksandr Natalenko 2016-08-30 09:03:17 UTC
Nithya, please, take a look at my message to ML regarding Massif tool — I'd like to profile GlusterFS client with that too (but have some troubles) [1] [2].

[1] https://www.gluster.org/pipermail/gluster-devel/2016-August/050665.html

[2] https://www.gluster.org/pipermail/gluster-devel/2016-August/050666.html

Comment 16 Oleksandr Natalenko 2016-09-02 10:30:35 UTC
I've tried different kinds of workload and rechecked it with 3.7.15, and it seems that leak is reproduced reliably with the following command on volume with lots of files:

find /mnt/volume -type f -print0 | xargs -0 stat

In half an hour RSS went from 150M to 1G.

Valgrind's Memcheck shows nothing. I still need some help on Massif tool.

Comment 17 Oleksandr Natalenko 2016-09-07 11:20:28 UTC
Okay, I've finally managed how to deal with Massif tool, and here are the results.

Volume layout and options are the same as pointed in this BZ description. After mount RSS is 90M, after performing this command for some time (there are millions of files):

===
find /mnt/net/glusterfs/test -print0 | xargs -0 stat >/dev/null
===

RSS grows to 900M (and counting). VSZ also grows from ~400M to ~1200M (and counting). Performing drop_caches lowers RSS/VSZ usage only by ~100M.

There are 3 Massif logs attached, the main is for PID 23664. Also, I attach client statedump.

drop_caches was called before taking statedump and unmounting the volume.

Comment 18 Oleksandr Natalenko 2016-09-07 11:20:59 UTC
Created attachment 1198653 [details]
Massif — Statedump for main PID

Comment 19 Oleksandr Natalenko 2016-09-07 11:21:36 UTC
Created attachment 1198654 [details]
Massif — log for main PID

Comment 20 Oleksandr Natalenko 2016-09-07 11:21:56 UTC
Created attachment 1198655 [details]
Massif — log for PID 23668

Comment 21 Oleksandr Natalenko 2016-09-07 11:22:15 UTC
Created attachment 1198656 [details]
Massif — log for PID 23669

Comment 22 Pranith Kumar K 2016-09-07 18:40:22 UTC
Oleksandr,
     As per the massif tool output, the heal size is only ~113MB:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 89 500,793,720,910      112,228,416      109,224,305     3,004,111            0
 90 504,760,946,959      113,760,568      110,649,152     3,111,416            0

I also went through the major stack-traces which are active during the last snapshot. Nothing seems new. Most of them are related to last inode-allocations.
But you were saying the RSS was around 900M, I am wondering if you captured this at the same time or not.

Pranith

Comment 23 Oleksandr Natalenko 2016-09-08 07:11:06 UTC
Re-run profiling with --pages-as-heap=yes as requested by Pranith in #gluster-dev.

===
root     10696 15.2  7.9 1557544 969640 pts/0  Sl+  вер07  95:47 valgrind --tool=massif --pages-as-heap=yes --run-libc-freeres=no --trace-children=yes --sim-hints=fuse-compatible /usr/sbin/glusterfs -N --volfile-server=glusterfs.example.com --volfile-id=some_volume /mnt/net/glusterfs/test
===

Main PID is 10696, attaching Massif output and statedump.

Comment 24 Oleksandr Natalenko 2016-09-08 07:11:42 UTC
Created attachment 1198918 [details]
Massif — log for main PID 10696

Comment 25 Oleksandr Natalenko 2016-09-08 07:12:20 UTC
Created attachment 1198919 [details]
Massif — Statedump for main PID 10696

Comment 26 Oleksandr Natalenko 2016-09-08 07:23:18 UTC
Pranith, as you may see in #c23 and #c24, now all the memory gets into statistics:

===
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 50 1,164,197,215,106    1,481,912,320    1,481,912,320             0            0
 51 1,166,294,532,574    1,481,912,320    1,481,912,320             0            0
 52 1,168,332,670,161    1,481,912,320    1,481,912,320             0            0
 53 1,176,305,259,696    1,549,017,088    1,549,017,088             0            0
 54 1,176,353,806,434    1,549,021,184    1,549,021,184             0            0
===

Comment 27 Nithya Balachandran 2016-09-09 15:53:51 UTC
I can reproduce the memory rise but statedumps don't seem to contain anything useful. I have started looking through the Massif files and will continue to run the tests to see what I can find.

Comment 28 Nithya Balachandran 2016-09-12 16:20:38 UTC
Status so far:

I don't see anything in the Massif logs to pinpoint a leak. the quick-read translator was one point to consider but I can see the leak even with it disabled.

I am trying the following:

Created a single brick distribute volume with default options except for quick-read which I have disabled.
Copying 100000 zero byte files to the gluster fuse mount. RSS goes from 30m to 69m. 
Run rm -rf *. RSS jumps to 104m but then comes down to 74m once the operation is complete.

I will take another look at the readdirp code.

Comment 29 Nithya Balachandran 2016-09-21 03:24:40 UTC
Just letting you know I am still looking into this but have not had much luck yet.
Will update when I have something.

Comment 30 Nithya Balachandran 2016-09-28 04:26:39 UTC
Valgrind etc doesn't show any mem leaks so far but I do see the mem usage rising. I am wondering if this could be because of memory fragmentation rather than memory leaks.

Still looking. All ideas welcome.

Comment 31 Oleksandr Natalenko 2016-09-28 11:51:20 UTC
Created attachment 1205471 [details]
Massif log visualization

Comment 32 Worker Ant 2016-09-28 16:12:23 UTC
REVIEW: http://review.gluster.org/15593 (protocol/client: reduce memory usage) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 33 Nithya Balachandran 2016-09-29 05:19:25 UTC
Oleksandr,

Do you see a difference with if you turn off performance.readdir-ahead? Apparently this is a known memory hog.

Still looking into how to improve things.

Comment 34 Oleksandr Natalenko 2016-10-25 10:38:41 UTC
Discussed a lot on IRC, so dropping needinfo flag.

Comment 35 Oleksandr Natalenko 2016-10-25 10:39:52 UTC
+

Comment 36 Mohit Agrawal 2016-12-05 10:58:55 UTC
Hi,

  I think i have found one code path specific to memory leak in fuse xlator but it will leak only when a user
  will run umount and mount again.As per fuse xlator code it calls eh_new function to allocate event
  history on transatlor and max. size of this buffer is 1024,it is a circular buffer to store the messages but 
  memory is not released for this buffer in fini function.
  Every time when a user will run command mount/umount it will have a leak of messages in this circular buffer.

  To clean this leak need to update the code of fini in fuse xlator as below

>>>>>>>>>>>>>>>>>
   
@@ -5778,6 +5778,9 @@ fini (xlator_t *this_xl)
         if ((priv = this_xl->private) == NULL)
                 return;
 
+        if (this_xl->history)
+                eh_destroy (this_xl->history);
+


>>>>>>>>>>>>>>>>>>>>
  
Regards
Mohit Agrawal

Comment 37 Yannick Perret 2017-03-07 14:56:53 UTC
I tested this again with the last version (3.10.0-1_amd64.deb). Two servers + one client, all 3.10.0-1 from .deb package (from gluster site), on up-to-date debian 8 64b. One replica 2 volume created for this test, without any option set.

From client, performing:
0. unpacking gluster sources in mountpoint
1. rm -rf glusterfs-3.10.0-* ;  for i in `seq 1 50`; do cp -R glusterfs-3.10.0/ glusterfs-3.10.0-$i; done
2. go to 1
In the same time the memory on the glusterfs process on client (VSZ and RSS size):
431704 10568
(…)
497240 20500
(…)
693848 264396
(…)
759384 388924
(…)
955992 606768
(…)
Note: RSS grows mostly continuously, whereas VSZ grows by "jumps" (i.e. 693848 then suddenly 759384).

Note bis: removing all content, then performing syncs (and sending 3 to /proc/sys/vm/drop_caches), don't make memory (VSZ nor RSS) to reduce, even 1 hour later (well, not exaclty, while removing all content RSS goes from 607300 to 603920).

Regards
--
Y.

Comment 38 Kaushal 2017-03-08 11:02:07 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.