Bug 1738878
Summary: | FUSE client's memory leak | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Sergey Pleshkov <s.pleshkov> | ||||
Component: | core | Assignee: | Csaba Henk <csaba> | ||||
Status: | CLOSED NEXTRELEASE | QA Contact: | |||||
Severity: | low | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 5 | CC: | bugs, nbalacha, pasik, s.pleshkov | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-10-25 05:03:40 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Sergey Pleshkov
2019-08-08 10:38:26 UTC
Share two statedump https://cloud.hostco.ru/s/w9MY6jj5Hpj2qoa Server and Client OS - Red Hat Enterprise Linux Server release 7.6 (Maipo) / Red Hat Enterprise Linux Server release 7.5 (Maipo) When client had gluster client from RH repo - 3.12 vers - situation was the same if it isn't version bug, would you have suggestions what is could be ? Which gluster volume options check and so on (In reply to Sergey Pleshkov from comment #2) > Server and Client OS - Red Hat Enterprise Linux Server release 7.6 (Maipo) / > Red Hat Enterprise Linux Server release 7.5 (Maipo) > When client had gluster client from RH repo - 3.12 vers - situation was the > same > > if it isn't version bug, would you have suggestions what is could be ? Which > gluster volume options check and so on Please provide the gluster volume info for this volume. Do you have any script/steps we can use to reproduce the leak? [root@LSY-GL-01 host]# gluster volume info PROD Volume Name: PROD Type: Replicate Volume ID: f54a0ce9-d2ec-4d44-a1f8-c53cf1c49a52 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: lsy-gl-01:/diskForData/prod Brick2: lsy-gl-02:/diskForData/prod Brick3: lsy-gl-03:/diskForData/prod Options Reconfigured: performance.readdir-ahead: off client.event-threads: 24 server.event-threads: 24 server.allow-insecure: on features.shard-block-size: 64MB features.shard: on network.ping-timeout: 5 transport.address-family: inet nfs.disable: on performance.client-io-threads: off performance.io-thread-count: 24 cluster.heal-timeout: 120 This problem arose on one production client, so I can’t immediately check which steps can repeat the problem without interrupting business processes. I will try to reproduce the behavior on the test cluster and let you know. Hi Sergey, can you please take statedumps at regular intervals during your test (say, in every 30 minutes, but feel free to adjust in light of the dynamic of the situation) so that we can observe the progress, and tar 'em up and attach to the bug? Hi everybody, I will try to reproduce this problem on test environment on next week and will do statedump in process of testing Created attachment 1609209 [details]
statedump1
Hello Yesterday I ran tests on a problem client. These were the find and chmod commands on gluster share. Actually, the process of the glusterfs continiously eats RAM on them and does not free it away. On another client that uses glusterfs version 3.12.2 (from RHEL7 repo), I also encountered a similar situation - glusterfs process eats RAM and it is also not free it ( but it is eaten very slowly) On other clients that access the same gluster volume, when performing tests with find and chmod command, RAM is also eaten up, but freed when the tests are turned off. I collected a few state dumps from a problem client and put it in the cloud. https://cloud.hostco.ru/s/w9MY6jj5Hpj2qoa In the near future I plan to upgrade glusterfs on client to version 6.5 and set lru-limit (don't know what i can do about this problem). Do you have any advise about it ? Script to reproduce problem: #!/bin/sh a=0 while [ $a -lt 36000 ] do find $gluster_mount_point -type f > /dev/null sleep 1 a=`expr $a + 1` done Hello Yesterday I upgraded the client to version 6.5 and set the lru-limit - the problem with the continuous occupation of RAM was solved by this workaround. Gathered a couple of state dumps if anybody want to see them. https://cloud.hostco.ru/s/w9MY6jj5Hpj2qoa But ran into another problem after this update Server software version 5.5, client version 6.5 - every time I write a file to a mounted shared folder, I see this error in the logs (with or without lru-limit option) [2019-08-30 08:31:04.763118] E [fuse-bridge.c:220:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f361d877a3b] (--> /usr/lib64/glusterfs/6.5/xlator/mount/fuse.so(+0x81d1)[0x7f3614c261d1] (--> /usr/lib64/glusterfs/6.5/xlator/mount/fuse.so(+0x8aaa)[0x7f3614c26aaa] (--> /lib64/libpthread.so.0(+0x7dd5)[0x7f361c6b5dd5] (--> /lib64/libc.so.6(clone+0x6d)[0x7f361bf7dead] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory Files are created in shared folder, on other clients I see them, the contents can be updated. I need to open another bug on this issue? Yes, this is a known phenomenon and basically harmless, if not too frequent. The effect of the lru-limit patch is that the glusterfs client asks the kernel to drop those inodes which are found to be inactive for a long time. (The client can't get rid of them on its own, it needs to keep the inode context as long as the kernel keeps a reference to them. The kernel indicates to the glusterfs client when it abandons all references to the inode (this is called "forgetting the inode"), so this condition shall be known to the client; and what the glusterfs client can do is to ask the kernel to evict an inode from caches (this is called "inode invalidation"), which usually implies abandoning references and forgetting it.) However, it's a naturally racy situation: by the time the glusterfs client sends the request to the kernel to invalidate a given inode, the kernel might have already forgotten it and the reference sent with the invalidation request is dangling. The kernel provides feedback to the glusterfs client of this situation by failing the write to /dev/fuse that carries the invalidation request with errno ENOENT, "No such file or directory". If this occurs, there is nothing wrong about it in itself. However, if this scenario proliferates, that's an indication of the glusterfs client getting overwhelmed by its invalidation requests so that these requests accumulate faster than they are processed and written to /dev/fuse. This might result in trashing performance (while being useless in terms of inode footprint reduction). To overcome this, we introduced a tunable to stop filing invalidation requests if there number of outstanding invalidation request hits a threshold. This is implemented in https://review.gluster.org/23187. Do you think you are in need of, are you interested in trying this patch? Hello Well, users using this client software do not complain about performance at the moment - this error just bothered me. Will this patch be included in client version 6.6? Since I have this client software in production, the software installed from the repository, and there are no user complaints, I think I will wait for the next version of the software. Thanks for the error clarification. Hello For 6 days of using the client 6.5 with the lru-limit option - the client's gluster process still occuping RAM (from 0.9% to 1.3%) Did 2 statedumps - immediately after installing the client software and today - everything looks the same as before install 6.5, if I'm not mistaken, the dict_t mempool is filling. https://cloud.hostco.ru/s/w9MY6jj5Hpj2qoa Are there any other suggestions or tips for me to do in this situation? From the statedump: [mount/fuse.fuse - usage-type gf_fuse_mt_iov_base memusage] size=117432480 num_allocs=212740 -- This is likely due to the bug that was fixed in https://review.gluster.org/#/c/glusterfs/+/23016/. This fix needs to be backported to release-6 as well. The lru-limit has been set to 65000 so there are still over 65K inodes in memory. Those will not be freed until the entries are deleted or the client is remounted. The memory for these inodes, dentries and associated information for various xlators will use up memory. [nbalacha@dhcp35-62 Downloads]$ grep -A3 -B3 lru_size glusterdump.6947.dump.1567662323 xlator.mount.fuse.itable.name=meta-autoload/inode xlator.mount.fuse.itable.lru_limit=65000 xlator.mount.fuse.itable.active_size=4613 xlator.mount.fuse.itable.lru_size=64996 xlator.mount.fuse.itable.purge_size=0 xlator.mount.fuse.itable.invalidate_size=0 You could reduce the lru-limit value further - that will lower the maximum number of inodes in memory and the associated structures various xlators create per inode. For the dict_t , we will need more information before we can proceed. Do you have a test script that reproduces the problem? |