+++ This bug was initially created as a clone of Bug #1369364 +++ Description of problem: FUSE client consumes lots of memory for volume with lots of small files (mailboxes storage for dovecot). Here are volume options: === Type: Distributed-Replicate ... Number of Bricks: 3 x 2 = 6 Transport-type: tcp ... network.ping-timeout: 10 features.cache-invalidation: on performance.cache-size: 16777216 performance.cache-max-file-size: 1048576 performance.io-thread-count: 4 performance.write-behind: on performance.flush-behind: on performance.read-ahead: on performance.quick-read: on performance.stat-prefetch: on performance.write-behind-window-size: 2097152 storage.linux-aio: on performance.client-io-threads: off server.event-threads: 8 network.inode-lru-limit: 4096 client.event-threads: 4 cluster.readdir-optimize: on cluster.lookup-optimize: on performance.readdir-ahead: on nfs.disable: off cluster.data-self-heal: off cluster.metadata-self-heal: off cluster.entry-self-heal: off === Here goes memory usage before drop_caches: === root 1049 0.0 40.1 1850040 1170912 ? S<sl сер19 0:28 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.server.com --volfile-id=mail_boxes /var/spool/mail/virtual === And after drop_caches: === root 1049 0.0 40.1 1850040 1170912 ? S<sl сер19 0:28 /usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable --volfile-server=glusterfs.server.com --volfile-id=mail_boxes /var/spool/mail/virtual === Nothing changes after drop_caches. pmap of PID 1049 is attached. I see lots of suspicious entries: === 00007fee74000000 65536K rw--- [ anon ] === Are those related to I/O translator? What else could consume 64M at once? Also, attaching statedumps before drop_caches and after drop_caches (almost no difference between them). Version-Release number of selected component (if applicable): GlusterFS v3.7.14 + following patches (all of them are already merged for 3.7.15 release): === Aravinda VK (1): packaging: Remove ".py" extension from symlink target Atin Mukherjee (1): rpc : build_prog_details should iterate program list inside critical section Jiffin Tony Thottan (2): gfapi : Avoid double freeing of dict in glfs_*_*getxattr xlator/trash : append '/' at the end in trash_notify_lookup_cbk Raghavendra G (2): libglusterfs/client_t: Dump the 0th client too storage/posix: fix inode leaks Soumya Koduri (2): glfs/upcall: entries should be removed under mutex lock gfapi/upcall: Fix a ref leak Susant Palai (1): posix: fix posix_fgetxattr to return the correct error === How reproducible: Always. Steps to Reproduce: 1. mount volume with lots of small files (mail boxes); 2. use them 3. ... 4. PROFIT! Actual results: Client memory leaks. Expected results: Client memory does not leak :). Additional info: Feel free to ask me for additional info. --- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-08-23 04:30 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-08-25 11:03:23 EDT --- http://termbin.com/66ud — the script I use trying to reproduce the leak. Observing the following in pmap: === 00007fe430000000 132K rw--- [ anon ] 00007fe430021000 65404K ----- [ anon ] 00007fe438000000 6564K rw--- [ anon ] ← this value increases over time... 00007fe438669000 58972K ----- [ anon ] ← ...while this value decreases over time 00007fe43c000000 132K rw--- [ anon ] 00007fe43c021000 65404K ----- [ anon ] 00007fe440000000 6800K rw--- [ anon ] 00007fe4406a4000 58736K ----- [ anon ] 00007fe44552c000 4K ----- [ anon ] 00007fe44552d000 8192K rw--- [ anon ] 00007fe445d2d000 4K ----- [ anon ] 00007fe445d2e000 16908K rw--- [ anon ] 00007fe446db1000 4K ----- [ anon ] 00007fe446db2000 18744K rw--- [ anon ] 00007fe448000000 9068K rw--- [ anon ] 00007fe4488db000 56468K ----- [ anon ] 00007fe44c027000 29424K rw--- [ anon ] 00007fe44dce3000 52K r-x-- meta.so 00007fe44dcf0000 2044K ----- meta.so ...snip... === Other anonymous allocations change over time as well, I have pointed only to one pair observing similar pattern. --- Additional comment from Kyle Johnson on 2016-08-25 15:39:28 EDT --- I may also be experiencing this bug. Distributed volume w/ 3.7.14, no patches. One server is FreeBSD 10.x, the other is CentOS 6. Both bricks are ZFS backed. === [root@colossus2 ~]# gluster volume info ftp Volume Name: ftp Type: Distribute Volume ID: f3f2b222-575c-4c8d-92f1-e640fd7edfbb Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 192.168.110.1:/tank/bricks/ftp Brick2: 192.168.110.2:/ftp/bricks/ftp Options Reconfigured: performance.client-io-threads: on cluster.weighted-rebalance: off performance.readdir-ahead: on nfs.disable: on === After approximately 24 hours of heaving reading on the client (copying 93728 files totaling 1.4Tb out of the fuse mount point, elsewhere on the system to a non-gluster space) , I found that the glusterfs process was consuming all system memory, including swap, approximately 34Gb. Remounting the fuse volume freed up the memory. Let me know if you need anything else. --- Additional comment from Kyle Johnson on 2016-08-25 15:43:04 EDT --- Oops. I am on 3.7.13. Fat fingers. --- Additional comment from Oleksandr Natalenko on 2016-08-26 03:33:50 EDT --- I've run another test: http://termbin.com/cgzj The idea of it is to split read/write workload across two clients. I'm attaching statedumps of both reader and writer. Unfortunately, it seems, memory accounting for FUSE translator suffers from integer overflow (underflow?), because I get something like this: === [mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage] size=4262955128 num_allocs=1 max_size=4294967088 max_num_allocs=3 total_allocs=18389093 === Another observation: reader consumes more memory than writer. Given cache-size is 32M (and if even it is used across 5 xlators), total RAM consumption should not exceed 160-200M). However, reader RAM: === root 30865 8.7 3.5 1054244 433676 ? Ssl Aug25 61:01 /usr/sbin/glusterfs --volfile-server=gluster.server.com --volfile-id=test /mnt/net/glusterfs/test2 === and writer RAM: === root 30796 3.3 2.0 722464 249352 ? Ssl Aug25 23:11 /usr/sbin/glusterfs --volfile-server=gluster.server.com --volfile-id=test /mnt/net/glusterfs/test1 === --- Additional comment from Oleksandr Natalenko on 2016-08-26 03:38 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-08-26 03:38 EDT --- --- Additional comment from Yannick Perret on 2016-08-29 12:35:51 EDT --- I also have the same bug, using versions 3.7.13 and 3.7.14 (compiled from sources, no patch) and 3.8.1 (using package from gluster download page). Tested on an up-to-date Debian 8. Same observations: memory always growing on any operations (recursive listing, reads, writes), even when always reading the same files. Even after days of non-activities. Memory can grow until OOM-killer activates (>6Go of memory the time I let it go to that point). --- Additional comment from Yannick Perret on 2016-08-29 13:46:30 EDT --- Some memory data, tests performed on 3.8.1 gluster version (both servers and client are 64b Debian 8 up-to-date with default kernel options; test volume is a replica-2 volume without any tuning; mount is performed using "mount -t glusterfs" without any option): What I did: 1. mounting the volume, removing all its content, umounting it 2. mounting the volume 3. performing a cp -Rp /usr/* /root/MNT 4. performing a rm -rf /root/MNT/* 5. re-doing 3, 4 VSZ/RSS of gluster client process are respectively: - 381896 / 35688 just after mount - 644040 / 309240 after 1st cp -Rp - 644040 / 310128 after 1st rm -rf - 709576 / 310128 after 1st kill -USR1 (it was just for testing) - 840648 / 421964 after 2nd cp -Rp - 840648 / 422224 after 2nd rm -rf I created a small script in order to perform these operations in loop: while /bin/true do cp -Rp /usr/* /root/MNT/ + get VSZ/RSS of glusterfs process rm -rf /root/MNT/* + get VSZ/RSS of glusterfs process done It gives: 1037256 645500 1037256 645840 1168328 757348 1168328 757620 1299400 869128 1299400 869328 1364936 980712 1364936 980944 1496008 1092384 1496008 1092404 1627080 1203796 1627080 1203996 1692616 1315572 1692616 1315504 1823688 1426812 1823688 1427340 1954760 1538716 1954760 1538772 2085832 1647676 2085832 1647708 2151368 1750392 2151368 1750708 2282440 1853864 2282440 1853764 2413512 1952668 2413512 1952704 2479048 2056500 2479048 2056712 I let it turning and gluster client finished by being killed by OOM-killer: [1246854.291996] Out of memory: Kill process 931 (glusterfs) score 742 or sacrifice child [1246854.292102] Killed process 931 (glusterfs) total-vm:3527624kB, anon-rss:3100328kB, file-rss:0kB So on this machine ~3.5Gb of VSZ. --- Additional comment from Oleksandr Natalenko on 2016-08-30 04:08 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-08-30 04:09:17 EDT --- Attached Valgrind's Memcheck tool output for reading workload. --- Additional comment from Nithya Balachandran on 2016-08-30 05:00:28 EDT --- Thank you. I will take a look at this. --- Additional comment from Oleksandr Natalenko on 2016-08-30 05:03:17 EDT --- Nithya, please, take a look at my message to ML regarding Massif tool — I'd like to profile GlusterFS client with that too (but have some troubles) [1] [2]. [1] https://www.gluster.org/pipermail/gluster-devel/2016-August/050665.html [2] https://www.gluster.org/pipermail/gluster-devel/2016-August/050666.html --- Additional comment from Oleksandr Natalenko on 2016-09-02 06:30:35 EDT --- I've tried different kinds of workload and rechecked it with 3.7.15, and it seems that leak is reproduced reliably with the following command on volume with lots of files: find /mnt/volume -type f -print0 | xargs -0 stat In half an hour RSS went from 150M to 1G. Valgrind's Memcheck shows nothing. I still need some help on Massif tool. --- Additional comment from Oleksandr Natalenko on 2016-09-07 07:20:28 EDT --- Okay, I've finally managed how to deal with Massif tool, and here are the results. Volume layout and options are the same as pointed in this BZ description. After mount RSS is 90M, after performing this command for some time (there are millions of files): === find /mnt/net/glusterfs/test -print0 | xargs -0 stat >/dev/null === RSS grows to 900M (and counting). VSZ also grows from ~400M to ~1200M (and counting). Performing drop_caches lowers RSS/VSZ usage only by ~100M. There are 3 Massif logs attached, the main is for PID 23664. Also, I attach client statedump. drop_caches was called before taking statedump and unmounting the volume. --- Additional comment from Oleksandr Natalenko on 2016-09-07 07:20 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-09-07 07:21 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-09-07 07:21 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-09-07 07:22 EDT --- --- Additional comment from Pranith Kumar K on 2016-09-07 14:40:22 EDT --- Oleksandr, As per the massif tool output, the heal size is only ~113MB: -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 89 500,793,720,910 112,228,416 109,224,305 3,004,111 0 90 504,760,946,959 113,760,568 110,649,152 3,111,416 0 I also went through the major stack-traces which are active during the last snapshot. Nothing seems new. Most of them are related to last inode-allocations. But you were saying the RSS was around 900M, I am wondering if you captured this at the same time or not. Pranith --- Additional comment from Oleksandr Natalenko on 2016-09-08 03:11:06 EDT --- Re-run profiling with --pages-as-heap=yes as requested by Pranith in #gluster-dev. === root 10696 15.2 7.9 1557544 969640 pts/0 Sl+ вер07 95:47 valgrind --tool=massif --pages-as-heap=yes --run-libc-freeres=no --trace-children=yes --sim-hints=fuse-compatible /usr/sbin/glusterfs -N --volfile-server=glusterfs.example.com --volfile-id=some_volume /mnt/net/glusterfs/test === Main PID is 10696, attaching Massif output and statedump. --- Additional comment from Oleksandr Natalenko on 2016-09-08 03:11 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-09-08 03:12 EDT --- --- Additional comment from Oleksandr Natalenko on 2016-09-08 03:23:18 EDT --- Pranith, as you may see in #c23 and #c24, now all the memory gets into statistics: === -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 50 1,164,197,215,106 1,481,912,320 1,481,912,320 0 0 51 1,166,294,532,574 1,481,912,320 1,481,912,320 0 0 52 1,168,332,670,161 1,481,912,320 1,481,912,320 0 0 53 1,176,305,259,696 1,549,017,088 1,549,017,088 0 0 54 1,176,353,806,434 1,549,021,184 1,549,021,184 0 0 === --- Additional comment from Nithya Balachandran on 2016-09-09 11:53:51 EDT --- I can reproduce the memory rise but statedumps don't seem to contain anything useful. I have started looking through the Massif files and will continue to run the tests to see what I can find. --- Additional comment from Nithya Balachandran on 2016-09-12 12:20:38 EDT --- Status so far: I don't see anything in the Massif logs to pinpoint a leak. the quick-read translator was one point to consider but I can see the leak even with it disabled. I am trying the following: Created a single brick distribute volume with default options except for quick-read which I have disabled. Copying 100000 zero byte files to the gluster fuse mount. RSS goes from 30m to 69m. Run rm -rf *. RSS jumps to 104m but then comes down to 74m once the operation is complete. I will take another look at the readdirp code. --- Additional comment from Nithya Balachandran on 2016-09-20 23:24:40 EDT --- Just letting you know I am still looking into this but have not had much luck yet. Will update when I have something. --- Additional comment from Nithya Balachandran on 2016-09-28 00:26:39 EDT --- Valgrind etc doesn't show any mem leaks so far but I do see the mem usage rising. I am wondering if this could be because of memory fragmentation rather than memory leaks. Still looking. All ideas welcome. --- Additional comment from Oleksandr Natalenko on 2016-09-28 07:51 EDT --- --- Additional comment from Worker Ant on 2016-09-28 12:12:23 EDT --- REVIEW: http://review.gluster.org/15593 (protocol/client: reduce memory usage) posted (#1) for review on master by N Balachandran (nbalacha) --- Additional comment from Nithya Balachandran on 2016-09-29 01:19:25 EDT --- Oleksandr, Do you see a difference with if you turn off performance.readdir-ahead? Apparently this is a known memory hog. Still looking into how to improve things.
REVIEW: http://review.gluster.org/15593 (protocol/client: reduce memory usage) posted (#2) for review on master by N Balachandran (nbalacha)
COMMIT: http://review.gluster.org/15593 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit bca6d0ba54d12d389cfb5c2b37fb8cc12a7e044b Author: N Balachandran <nbalacha> Date: Wed Sep 28 21:38:57 2016 +0530 protocol/client: reduce memory usage readdirp calls use a lot of memory in case of a large number of files. The dict->extra_free is not used here so free buf immediately. Change-Id: I097f5dde2df471f5834264152711110a3bdb7e9a BUG: 1380249 Signed-off-by: N Balachandran <nbalacha> Reviewed-on: http://review.gluster.org/15593 NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
(In reply to Worker Ant from comment #2) > COMMIT: http://review.gluster.org/15593 committed in master by Pranith Kumar > Karampuri (pkarampu) > ------ > commit bca6d0ba54d12d389cfb5c2b37fb8cc12a7e044b > Author: N Balachandran <nbalacha> > Date: Wed Sep 28 21:38:57 2016 +0530 > > protocol/client: reduce memory usage > > readdirp calls use a lot of memory in case > of a large number of files. The dict->extra_free > is not used here so free buf immediately. > > Change-Id: I097f5dde2df471f5834264152711110a3bdb7e9a > BUG: 1380249 > Signed-off-by: N Balachandran <nbalacha> > Reviewed-on: http://review.gluster.org/15593 > NetBSD-regression: NetBSD Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Smoke: Gluster Build System <jenkins.org> > Reviewed-by: Pranith Kumar Karampuri <pkarampu> This is an interim fix to reduce the amount of memory that we consume. Some code paths hang on to memory until the very end even though it is no longer used. There will be more patches posted for this. As yet, I have not managed to find a memory leak.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report. glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html [2] https://www.gluster.org/pipermail/gluster-users/