Bug 1623107 - FUSE client's memory leak
Summary: FUSE client's memory leak
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Sunny Kumar
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-28 13:28 UTC by Pavel Znamensky
Modified: 2020-07-16 13:52 UTC (History)
10 users (show)

Fixed In Version: glusterfs-5.3
Clone Of:
Environment:
Last Closed: 2019-01-16 16:00:10 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
volume options (26.56 KB, text/plain)
2018-08-28 13:28 UTC, Pavel Znamensky
no flags Details
FUSE client statedump (4.62 MB, application/x-gzip)
2018-08-28 13:29 UTC, Pavel Znamensky
no flags Details
valgrind report (986.83 KB, text/plain)
2018-08-28 19:49 UTC, Pavel Znamensky
no flags Details
reports for v4.1 and directory with tons of files and subdirs (16.50 KB, application/zip)
2018-09-04 14:54 UTC, Pavel Znamensky
no flags Details
statedump output with readdir_ahead on and off with gluster 5.1 (2.59 MB, application/gzip)
2018-12-29 02:55 UTC, Yan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 22014 0 None Merged fuse: add --lru-limit option 2019-01-16 16:00:08 UTC

Internal Links: 1738878

Description Pavel Znamensky 2018-08-28 13:28:25 UTC
Created attachment 1479256 [details]
volume options

Description of problem:
FUSE clients consume a lot of memory.
In our production environment, single FUSE client uses ~16G RSS after two weeks of uptime.

Version-Release number of selected component (if applicable):
4.1

How reproducible:
Setup glusterfs replication cluster with tons of small files.
Mount storage with FUSE client.
Read files metadata and files content

Actual results:
RSS memory of FUSE client grows infinitely. 

Expected results:
RSS memory doesn't grow infinitely :)

Comment 1 Pavel Znamensky 2018-08-28 13:29:14 UTC
Created attachment 1479257 [details]
FUSE client statedump

Comment 2 Pavel Znamensky 2018-08-28 13:33:57 UTC
Please, let me know if I can do something else to help to investigate the problem.

Comment 3 Pavel Znamensky 2018-08-28 16:22:56 UTC
Also, I summed all `size` fields in statedump and amount is about 4G which is too much, but it's interesting that somewhere were lost another 12G (16G-4G).

Comment 4 Pavel Znamensky 2018-08-28 19:48:27 UTC
Finaly I tried to find a leak with valgrind:

valgrind --leak-check=full --track-origins=yes --show-leak-kinds=all --log-file=gluster_fuse.log /usr/sbin/glusterfs --read-only --process-name fuse --volfile-server=srv --volfile-id=/gl /mnt/gl

Than I did:
find /mnt/gl -type f

Part of valgrind report (full report in attachment):

==15836== 22,504,608 bytes in 104,188 blocks are possibly lost in loss record 976 of 978
==15836==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==15836==    by 0x4E85E95: __gf_calloc (in /usr/lib64/libglusterfs.so.0.0.1)
==15836==    by 0x14495FEA: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/md-cache.so)
==15836==    by 0x144961C3: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/md-cache.so)
==15836==    by 0x1449ECC7: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/md-cache.so)
==15836==    by 0x1407C52A: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/quick-read.so)
==15836==    by 0x13E69552: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/io-cache.so)
==15836==    by 0x13C5AA26: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/readdir-ahead.so)
==15836==    by 0x138364C4: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/write-behind.so)
==15836==    by 0x135E3CBF: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/cluster/dht.so)
==15836==    by 0x132F02B5: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/cluster/afr.so)
==15836==    by 0x130B0439: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/protocol/client.so)
==15836== 
==15836== 25,005,360 bytes in 104,189 blocks are possibly lost in loss record 977 of 978
==15836==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==15836==    by 0x4E85E95: __gf_calloc (in /usr/lib64/libglusterfs.so.0.0.1)
==15836==    by 0x13E70496: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/io-cache.so)
==15836==    by 0x13E68FF9: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/io-cache.so)
==15836==    by 0x13E693FA: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/io-cache.so)
==15836==    by 0x13C5AA26: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/readdir-ahead.so)
==15836==    by 0x138364C4: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/performance/write-behind.so)
==15836==    by 0x135E3CBF: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/cluster/dht.so)
==15836==    by 0x132F02B5: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/cluster/afr.so)
==15836==    by 0x130B0439: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/protocol/client.so)
==15836==    by 0x515306F: ??? (in /usr/lib64/libgfrpc.so.0.0.1)
==15836==    by 0x5153402: ??? (in /usr/lib64/libgfrpc.so.0.0.1)
==15836== 
==15836== 66,357,720 bytes in 127,611 blocks are possibly lost in loss record 978 of 978
==15836==    at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==15836==    by 0x4E85E95: __gf_calloc (in /usr/lib64/libglusterfs.so.0.0.1)
==15836==    by 0x4E6A620: ??? (in /usr/lib64/libglusterfs.so.0.0.1)
==15836==    by 0x4E6BACA: inode_new (in /usr/lib64/libglusterfs.so.0.0.1)
==15836==    by 0x13067018: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/protocol/client.so)
==15836==    by 0x130A695C: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/protocol/client.so)
==15836==    by 0x130B02C2: ??? (in /usr/lib64/glusterfs/4.1.3/xlator/protocol/client.so)
==15836==    by 0x515306F: ??? (in /usr/lib64/libgfrpc.so.0.0.1)
==15836==    by 0x5153402: ??? (in /usr/lib64/libgfrpc.so.0.0.1)
==15836==    by 0x514F2F2: rpc_transport_notify (in /usr/lib64/libgfrpc.so.0.0.1)
==15836==    by 0x11FAA675: ??? (in /usr/lib64/glusterfs/4.1.3/rpc-transport/socket.so)
==15836==    by 0x11FAD146: ??? (in /usr/lib64/glusterfs/4.1.3/rpc-transport/socket.so)
==15836==
==15836== LEAK SUMMARY:
==15836==    definitely lost: 3,832,872 bytes in 15,543 blocks
==15836==    indirectly lost: 1,736,704 bytes in 6,784 blocks
==15836==      possibly lost: 233,130,874 bytes in 1,009,120 blocks
==15836==    still reachable: 1,792,889 bytes in 7,738 blocks
==15836==         suppressed: 0 bytes in 0 blocks
==15836==
==15836== For counts of detected and suppressed errors, rerun with: -v
==15836== ERROR SUMMARY: 2528 errors from 602 contexts (suppressed: 0 from 0)

Comment 5 Pavel Znamensky 2018-08-28 19:49:18 UTC
Created attachment 1479358 [details]
valgrind report

Comment 6 Amar Tumballi 2018-08-29 11:06:16 UTC
What is the available size of RAM? We have previously noticed a huge memory being consumed by client process, if there is no other pressure on Client memory. Ie, if there is a pressure on process, then kernel will send 'forget()' on inodes it has looked up. That is the only time we see mermoy usage coming down. Otherwise, kernel remembers all these inodes, and that way, glusterfs keeps all these inodes in memory, which results in high memory usage.

Comment 7 Pavel Znamensky 2018-08-29 11:17:25 UTC
~16G RSS were consumed on a server with 64G RAM
On another server it was:
~8G RSS on 16G RAM (available memory at that moment was ~1G)

After I drop inodes cache with `echo 3 > /proc/sys/vm/drop_caches` nothing changed.

Comment 8 Pavel Znamensky 2018-09-03 11:56:24 UTC
This issue is critical for us because mount point always killed by OOM-killer. Can I help you to find out the problem?
Debug mode, more statedumps, valgrind reports?

Comment 9 Amar Tumballi 2018-09-03 12:20:30 UTC
Znamensky, for having almost same setup, we would like you to give 'gluster volume info' output. That way, we can try to see if we have the same set of options. 

We will go through the valgrind report too. ETA to get back is 7th Sept. Meantime, a statedump of client while such issue happens would also be great along with valgrind report, which is attached.

Comment 10 Amar Tumballi 2018-09-04 04:51:40 UTC
Something unrelated to the issue here:

I noticed below snippet in statedump:
----
[mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage]
size=3434686672
num_allocs=1
max_size=4294962768
max_num_allocs=19
total_allocs=66157424
----

This was result of 32bit overflow in getting size and max_size... Made me look at wrong place, and search fuse-bridge for leak for more than 20mins just by looking at such a huge usage report.. Needs to be fixed for sure.

I see from valgrind report that SSL is enabled in the setup, so, not sure if that causes the issue. Wasn't able to pin-point at the leak by looking at the reports. Will keep this bug updated!

Comment 11 Amar Tumballi 2018-09-04 04:55:56 UTC
Znamensky Pavel, if this is happening frequently (ie, memory growth and OOM kill), one request while I am trying to analyse the issue.

Can you try below commands (one at a time): (assumption, gl is volume name).


$ gluster volume set gl readdir-ahead disable

run your tests, see for memory usage...

$ gluster volume set gl io-cache disable

run your tests, see for memory usage...

$ gluster volume set gl quick-read disable

run your tests, see for memory usage...


From your statedump, I see that quick-read was very useful for the workload, mainly as the 'hit-rate' was very high, compared to misses, So, I wouldn't recommend disabling it forever in system, this is just an ask for the test period.

Comment 12 Pavel Znamensky 2018-09-04 14:54:35 UTC
Created attachment 1480809 [details]
reports for v4.1 and directory with tons of files and subdirs

Comment 13 Pavel Znamensky 2018-09-04 14:55:00 UTC
I checked version 3.12 and it looks like it already has this problem.
Also, I ran `find -type f ...` on 4.1 on a huge directory with tons of files. And after 12 hours valgrind process ate about 10G RSS on a server with 16G RAM (I don't know about valgrind's memory overhead, but RSS definitely grew over time)
You can find all info and reports in v4.1_12hours_experiment.zip

About last 3 tests: I did `find -type f ...` on the same dir (not very big in fact) each iteration and then remount storage.
Results:
- without `readdir-ahead` gluster process consumed about 270MB RSS (nearly the same as with `readdir-ahead`)
- without `io-cache` and `readdir-ahead` ~220MB
- without `quick-read`, `io-cache`, `readdir-ahead` ~220MB

Our base config:

Volume Name: gl
Type: Replicate
Volume ID: a90603ab-cf81-423d-b28f-3978e5afdcb8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: srv1:/gl
Brick2: srv2:/gl
Options Reconfigured:
cluster.readdir-optimize: on
client.event-threads: 2
server.event-threads: 16
network.ping-timeout: 20
cluster.data-self-heal-algorithm: full
performance.io-thread-count: 16
performance.io-cache: on
performance.cache-size: 32MB
performance.quick-read: on
transport.address-family: inet6
performance.readdir-ahead: enable
nfs.disable: on

Comment 14 Pavel Znamensky 2018-09-05 12:16:12 UTC
Just tried to put glusterfs inside cgroup with memory limit (thanks to systemd).
And it works!
Glusterfs consumes nearly 100MB RSS (limit for the cgroup is 100MB) and even that 12 hours test can't force glusterfs to overuse memory!
I'm not sure about performance in this case, but it look like workaround for cases when FUSE client doesn't release memory.

Comment 15 Amar Tumballi 2018-09-05 12:21:33 UTC
Good to hear it! It is not a bad option, and let us know how it goes for some more time!

Comment 16 Pavel Znamensky 2018-09-05 16:46:48 UTC
Unfortunately, after ~10 hours of testing it was killed.
But all that time it didn't consume more than 100MB.

Comment 17 Amar Tumballi 2018-09-11 10:17:49 UTC
Would be good to try the runs with SSL enabled, and see if we find anything different.

Comment 18 Pavel Znamensky 2018-09-11 11:12:03 UTC
We have to set up SSL on our production cluster. It's not easy and it looks like it can't be done without downtime.
Can I help you in another way?

Comment 19 Amar Tumballi 2018-09-11 11:18:23 UTC
(In reply to Znamensky Pavel from comment #18)
> We have to set up SSL on our production cluster. It's not easy and it looks
> like it can't be done without downtime.
> Can I help you in another way?

No Znamensky, lets try to see if we can understand the leak locally first. Dont' disturb the production cluster now! I guess with the proper workload, it should be possible to identify the leak outside of production cluster too.

Comment 20 Milind Changire 2018-09-12 05:21:04 UTC
(In reply to Amar Tumballi from comment #17)
> Would be good to try the runs with SSL enabled, and see if we find anything
> different.

Is this for me ?

Comment 21 Pavel Znamensky 2018-11-07 09:30:13 UTC
Is there any news on this?
Still critical for our production.
46G RSS on v4.1.2 after two month uptime.

Comment 22 Yan 2018-12-01 20:33:16 UTC
We have seen similar issue with v4.1.4 in our environment ( 2-way replication). We don't have SSL enabled either. Please let us know if there's any additional info that can be helpful to the resolution.

Comment 23 Yan 2018-12-12 18:21:17 UTC
The issue was observed in v5.0.1 as well. Anyone can help with a resolution?

Comment 24 Nithya Balachandran 2018-12-15 05:31:58 UTC
(In reply to Yan from comment #23)
> The issue was observed in v5.0.1 as well. Anyone can help with a resolution?

Hi,

Please provide the following on the v5.0.1 client:

1. gluster volume info (I want to check if anything has changed since comment#13)
2. Statedump of the client process which is consuming a lot of memory (one that is not controlled by cgroups)
 - make sure /var/run/gluster exists on the client machine
 - kill -SIGUSR1 <gluster-fuse-process-pid>
Do this twice with an interval of maybe 30mins in between.

Please send me the glusterdump.* files generated in /var/run/gluster.


After collecting the above, if not already done, disable readdir-ahead on the volume.

gluster v set <volname> performance.readdir-ahead off

and see if the memory continues rising.

Comment 25 Nithya Balachandran 2018-12-15 06:01:31 UTC
(In reply to Znamensky Pavel from comment #1)
> Created attachment 1479257 [details]
> FUSE client statedump

I just checked this:

pool-name=dict_t                                                                
active-count=1291721   <--- this is really high and points to a leak                                                          
sizeof-type=168                                                                 
padded-sizeof=256                                                               
size=330680576                                                                  
shared-pool=0x7f709e813cc8   




This looks similar to the readdir-ahead leak fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1659432.


Please disable readdir-ahead for the time being and see if the problem goes away.

Comment 26 Yan 2018-12-27 17:33:09 UTC
After disable readdir-ahead flag, the memory leak is reduced but not completely eliminated. Please suggest further steps.

Comment 27 Nithya Balachandran 2018-12-28 02:51:56 UTC
(In reply to Yan from comment #26)
> After disable readdir-ahead flag, the memory leak is reduced but not
> completely eliminated. Please suggest further steps.

Please provide the following:

1. statedumps of the client process at intervals where you see an increase in memory
2. A test script to reproduce the issue if available.

Comment 28 Yan 2018-12-29 02:55:28 UTC
Created attachment 1517308 [details]
statedump output with readdir_ahead on and off with gluster 5.1

Comment 29 Yan 2018-12-29 03:13:37 UTC
Please refer the attachment 1517308 [details] for statedump output every half hour. 

1). Test is done with 5.1. Similar issue has been observed in 3.12.13, 4.1.4 as well. 

# gluster --version
glusterfs 5.1

2). The above statedump output is done with a "find" operation every second on the mount dir. 

#!/bin/sh

a=0

while [ $a -lt 36000 ]
do
    find /gserver_mount1/  -type f   > /dev/null  
    sleep 1
    #a=`expr $a + 1`
done
 

3). Here's the volume info with readdir_ahead on. 

# gluster volume info
 
Volume Name: glustervol1
Type: Replicate
Volume ID: 47aecf8c-de2f-43a5-8cab-64832bd28bd1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.76.65.246:/mnt/data1/1
Brick2: 10.76.65.247:/mnt/data1/1
Brick3: 10.76.65.248:/mnt/data1/1
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
performance.read-ahead: on
performance.readdir-ahead: on

4). Memory leakage is reduced with readdir_ahead off ( refer Bug 1659432 ). 

5). Below section has an invalid size number. 

[mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage]
size=18446744073709524184
num_allocs=1
max_size=18446744073709551608
max_num_allocs=22
total_allocs=559978326

Comment 30 Pavel Znamensky 2018-12-29 09:12:22 UTC
I'm sorry for the delay. But I can confirm that patched https://bugzilla.redhat.com/show_bug.cgi?id=1659432 version with disabled readdir-ahead still leaks with near the same speed as v4.1.

Comment 31 Nithya Balachandran 2018-12-31 04:18:14 UTC
(In reply to Znamensky Pavel from comment #30)
> I'm sorry for the delay. But I can confirm that patched
> https://bugzilla.redhat.com/show_bug.cgi?id=1659432 version with disabled
> readdir-ahead still leaks with near the same speed as v4.1.

Then it is likely to be because the fuse client does not invalidate inodes. Does your workload access a lot of files? The earlier statedump showed around 3 million inodes in memory. 

pool-name=inode_t
active-count=3050376  <---- Here
sizeof-type=160
padded-sizeof=256
size=780896256
shared-pool=0x5141cc8

Does the statedump without readdir ahead also show a large number of inodes?

https://review.gluster.org/#/c/glusterfs/+/19778/ has a fix to invalidate inodes but is not targeted for release 5 as yet.

Comment 32 Nithya Balachandran 2018-12-31 05:09:01 UTC
(In reply to Yan from comment #29)
> Please refer the attachment 1517308 [details] for statedump output every
> half hour. 
> 
> 1). Test is done with 5.1. Similar issue has been observed in 3.12.13, 4.1.4
> as well. 
> 
> # gluster --version
> glusterfs 5.1
> 
> 2). The above statedump output is done with a "find" operation every second
> on the mount dir. 
> 


Observations:

For the statedumps captured with readdir-ahead off:
1. There is no increase in the number of inodes. Did you see the memory rising continuously after the first run (where gluster will create and cache the inodes) while the test was run in this case? If yes, how much?
2. The number if dict_t allocs also remains constant, though high. Was readdir-ahead on initially for this volume and was it remounted after turning readdir-ahead off? If not, the high number is due to the leak when readdir-ahead was enabled as the memory will not be freed if readdir-ahead is disabled but the volume is not remounted.

With readdir-ahead on:
1. The inodes do not increase. dict_t does but that is a known issue so I would expect the memory to rise constantly.

Comment 33 Pavel Znamensky 2019-01-01 18:52:17 UTC
(In reply to Nithya Balachandran from comment #31)
> Then it is likely to be because the fuse client does not invalidate inodes.
> Does your workload access a lot of files? The earlier statedump showed
> around 3 million inodes in memory. 
>
>...
>
> https://review.gluster.org/#/c/glusterfs/+/19778/ has a fix to invalidate
> inodes but is not targeted for release 5 as yet.


Nithya, you're right!
I built glusterfs from the current master (https://github.com/gluster/glusterfs/tree/d9a8ccd354df6db94477bf9ecb09735194523665) with the new invalidate inodes mechanism that you mentioned before, and RSS memory consumption indeed became much lower.
And as you supposed our apps quite often access a lot of files.
Here are two tests with clients on v6dev and v4.1 (the server is still on v4.1 and read-ahead=on)

The first test with default --lru-limit=0 (just did `find /in/big/dir -type f`):

v4.1 - ~3GB RSS:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       633  6.6 18.5 3570216 3056136 ?     Ssl  19:44   6:25 /usr/sbin/glusterfs --read-only --process-name fuse --volfile-server=srv --volfile-id=/st1 /mnt/st1

v6dev - ~1.5GB RSS:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     10851 16.5  9.2 2071036 1526456 ?     Ssl  19:45  15:50 /usr/sbin/glusterfs --read-only --process-name fuse --volfile-server=srv --volfile-id=/st1 /mnt/st1

It looks good. Let's do the next test.
The second test with --lru-limit=10_000 for v6dev:

v4.1 - ~3GB RSS:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      3589  4.7 18.6 3570216 3060364 ?     Ssl  13:11  18:40 /usr/sbin/glusterfs --process-name fuse --volfile-server=srv --volfile-id=/st1 /mnt/st1

v6dev - ~170MB RSS:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     24152 14.2  1.0 758768 173704 ?       Ssl  13:58  49:06 /usr/sbin/glusterfs --lru-limit=10000 --process-name fuse --volfile-server=srv --volfile-id=/st1 /mnt/st1

170MB vs. 3GB!
It's incredible!
Unfortunately, the new version has a drawback - CPU time increased 2.5x times. At the moment it doesn't matter for us.
Anyway, I'm sure this change solves our problem. And of course, we're looking forward to a stable version with it.
Thank you a lot!

Comment 34 Travers Carter 2019-01-02 05:51:33 UTC
We are seeing what looks like the same issue on glusterfs 4.1.5

I'm not sure if further information is still needed, given the last few comments, but I've collected client statedumps from three systems, along the with the gluster volume info here:

https://s3.amazonaws.com/public-rhbz/glusterfs-client-4.1.5-statedumps.zip

This includes 2 x client statedumps from each of 3 systems taken roughly 30 to 60 minutes apart

The "webserver" and "appserver-active" gluster clients were restarted after setting readdir-ahead to off as suggested earlier in the ticket (this didn't seem to help much in this case), but the "webserver" client has already reached about 15GiB VIRT in just over 48 hours.

We had also historically seen somewhat slower, but still significant fuse client memory leaks on v3.x (I think 3.11 or 3.12), but not (or at least not significant) on 3.7.11 or 4.0.2 with very similar workloads.

Comment 35 Travers Carter 2019-01-03 05:43:36 UTC
I believe that I'm able to artificially trigger this using fs_mark, for example:

mkdir /srv/gluster/fsmark
cd /srv/gluster/fsmark
fs_mark -L 500 -d $PWD -v -S 0 -D 128 -n 1000 -s $[8*1024]


That's 500 rounds of 128 threads each creating and deleting 1000 8KiB files each in a per-thread subdirectory, where /srv/gluster is a gluster volume mounted with the fuse client

Comment 36 Nithya Balachandran 2019-01-03 05:58:35 UTC
(In reply to Znamensky Pavel from comment #33)
> (In reply to Nithya Balachandran from comment #31)
> > Then it is likely to be because the fuse client does not invalidate inodes.
> > Does your workload access a lot of files? The earlier statedump showed
> > around 3 million inodes in memory. 
> >
> >...
> >
> > https://review.gluster.org/#/c/glusterfs/+/19778/ has a fix to invalidate
> > inodes but is not targeted for release 5 as yet.
> 
> 
> Nithya, you're right!
> I built glusterfs from the current master
> (https://github.com/gluster/glusterfs/tree/
> d9a8ccd354df6db94477bf9ecb09735194523665) with the new invalidate inodes
> mechanism that you mentioned before, and RSS memory consumption indeed
> became much lower.
> And as you supposed our apps quite often access a lot of files.
> Here are two tests with clients on v6dev and v4.1 (the server is still on
> v4.1 and read-ahead=on)
> 
> The first test with default --lru-limit=0 (just did `find /in/big/dir -type
> f`):
> 
> v4.1 - ~3GB RSS:
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       633  6.6 18.5 3570216 3056136 ?     Ssl  19:44   6:25
> /usr/sbin/glusterfs --read-only --process-name fuse --volfile-server=srv
> --volfile-id=/st1 /mnt/st1
> 
> v6dev - ~1.5GB RSS:
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root     10851 16.5  9.2 2071036 1526456 ?     Ssl  19:45  15:50
> /usr/sbin/glusterfs --read-only --process-name fuse --volfile-server=srv
> --volfile-id=/st1 /mnt/st1
> 
> It looks good. Let's do the next test.
> The second test with --lru-limit=10_000 for v6dev:
> 
> v4.1 - ~3GB RSS:
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      3589  4.7 18.6 3570216 3060364 ?     Ssl  13:11  18:40
> /usr/sbin/glusterfs --process-name fuse --volfile-server=srv
> --volfile-id=/st1 /mnt/st1
> 
> v6dev - ~170MB RSS:
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root     24152 14.2  1.0 758768 173704 ?       Ssl  13:58  49:06
> /usr/sbin/glusterfs --lru-limit=10000 --process-name fuse
> --volfile-server=srv --volfile-id=/st1 /mnt/st1
> 
> 170MB vs. 3GB!
> It's incredible!
> Unfortunately, the new version has a drawback - CPU time increased 2.5x
> times. At the moment it doesn't matter for us.
> Anyway, I'm sure this change solves our problem. And of course, we're
> looking forward to a stable version with it.
> Thank you a lot!

Thank you for testing this. I'm glad to hear the patch is working as expected to keep the memory use down.

Comment 37 Amar Tumballi 2019-01-03 06:06:23 UTC
> Unfortunately, the new version has a drawback - CPU time increased 2.5x
> times. At the moment it doesn't matter for us.
> Anyway, I'm sure this change solves our problem. And of course, we're
> looking forward to a stable version with it.

While a release with this patch merged/tested is another 50days away, we surely would like to reduce the CPU load you see too. Whenever you get time, if you can capture CPU info with below tool "perf record -ag --call-graph=dwarf -o perf.data -p <pid of glusterfs process>", and then see "perf report" to see what actually caused the CPU usage, it will help us to resolve that too.

Also note, lru-limit=10000 while many files are accessed may not be a good value. I recommend something like 64k at least. But well, it depends on your memory needs too. So, if you can give 512MB - 1GB RAM for glusterfs, its better at least for performance.

Comment 38 Pavel Znamensky 2019-01-15 16:46:35 UTC
(In reply to Amar Tumballi from comment #37)
> While a release with this patch merged/tested is another 50days away, we
> surely would like to reduce the CPU load you see too. Whenever you get time,
> if you can capture CPU info with below tool "perf record -ag
> --call-graph=dwarf -o perf.data -p <pid of glusterfs process>", and then see
> "perf report" to see what actually caused the CPU usage, it will help us to
> resolve that too.

I'm sorry for the delay. I did `find` on a directory with ~ 200_000 files on v6 with lru-limit, without lru-limit and on v4.1.
`ps aux` shows this:

v6 with lru-limit = 10_000
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     30968  2.2  0.3 637936 54056 ?        Ssl  18:15   0:11 /usr/sbin/glusterfs --read-only --lru-limit=10000 --process-name fuse --volfile-server=srv --volfile-id=/st1 /mnt/st1

v6 without lru-limit
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     31193  2.0  2.0 834544 330680 ?       Ssl  18:32   0:09 /usr/sbin/glusterfs --read-only --process-name fuse --volfile-server=srv --volfile-id=/st1 /mnt/st1

v4.1
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     31416  1.3  2.0 817704 344996 ?       Ssl  18:41   0:06 /usr/sbin/glusterfs --read-only --process-name fuse --volfile-server=srv --volfile-id=/st1 /mnt/st1

Also, I've got `perf` reports, but I'm afraid I can't correctly understand it. Unfortunately, these reports have sensitive information, so I can't attach them to this issue. Nevertheless, if it would be helpful, I could send them directly to you.

(In reply to Amar Tumballi from comment #37)
> Also note, lru-limit=10000 while many files are accessed may not be a good
> value. I recommend something like 64k at least. But well, it depends on your
> memory needs too. So, if you can give 512MB - 1GB RAM for glusterfs, its
> better at least for performance.

Thanks for the advice!

Also, I suppose I should create a new issue for CPU problem, should I?

Comment 39 Worker Ant 2019-01-16 16:00:10 UTC
REVIEW: https://review.gluster.org/22014 (fuse: add --lru-limit option) merged (#3) on release-5 by Shyamsundar Ranganathan

Comment 40 Shyamsundar 2019-01-22 14:08:49 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.3, please open a new bug report.

glusterfs-5.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-January/000118.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.