Hide Forgot
[Migrated from savannah BTS] - bug 26626 [https://savannah.nongnu.org/bugs/?26626]
The valgrind report is showing a huge 'still reachable' memory blocks because glusterfs does not destroy the data structures that are still in-use while umount is done. These 'still reachable' reports should go away, once we have even more graceful shutdown of glusterfs. -- Gowda
Wed 20 May 2009 06:28:22 AM GMT, original submission: Running anything, that scans the directory tree recursively (du, ls -R), leads to increasing memory consumption by the glusterfs client. Here is the most remarkable part of the report from running glusterfs client under valgrind --leak-check=full --show-reachable=yes: ==15979== 104,480,064 bytes in 251,154 blocks are still reachable in loss record 73 of 73 ==15979== at 0x68EEC77: calloc (in /usr/local/lib/valgrind/x86-linux/vgpreloa d_memcheck.so) ==15979== by 0x6915D22: __inode_create (inode.c:450) ==15979== by 0x6915D7C: inode_new (inode.c:466) ==15979== by 0x6B1B2A1: fuse_lookup (fuse-bridge.c:441) ==15979== by 0x6B3CEE7: do_lookup (fuse_lowlevel.c:444) ==15979== by 0x6B3E9DD: fuse_ll_process (fuse_lowlevel.c:1182) ==15979== by 0x6B403AA: fuse_session_process (fuse_session.c:90) ==15979== by 0x6B232BF: fuse_thread_proc (fuse-bridge.c:2486) ==15979== by 0x6942A57: pthread_start_thread (in /lib/libpthread-0.10.so) ==15979== by 0x6A2E2E9: clone (in /lib/libc-2.3.6.so) I have no idea, whether the error roots are in glusterfs or libfuse, but anyway, glusterfs depends on fuse. Full report is attached as compressed file. If you need more tests, feel free to contact me. -------------------------------------------------------------------------------- Mon 22 Jun 2009 05:57:23 PM GMT, comment #1 by Raghavendra <raghavendra>: Hi, Inodes are freed when kernel sends forget on them. In other words the extent inodes are cached in glusterfs depends on kernel. Hence you are seeing huge memory consumption, but it's not memory leak. doing echo 3 > /proc/sys/vm/drop_caches should bring down memory consumption. Btw, I did not find any attachment. Am I missing anything? regards, Raghavendra. -------------------------------------------------------------------------------- Tue 23 Jun 2009 07:30:14 AM GMT, comment #2 by Krzysztof Strasburger <strasbur>: Amar Tumballi advised me to set drop_caches to 3, but it did not help. I understand that it is not a memory leak, as valgrind claims that the pointers are not forgotten. Repeating the same operation does not cause additional memory allocations, so the inodes are cached and used, as needed. However, it is not good to see all your memory consumed forever, only because somebody ran du on a big directory tree. My attachement is still accessible, I tried even to download it, to be sure. You can download it directly via http://savannah.nongnu.org/bugs/download.php?file_id=18169
Need verification
Verified with 3.0.1rc1
This bug still exists in 3.0.x and has been confirmed by other users. Neither the performance translators nor networking cause these excessive memory allocations. The bug can be triggered even with a trivial, serverless setup: volume loopback type storage/posix option directory /root/loopback end-volume by running du or ls -R in the glusterfs-mounted directory (containing a large number of files).
experienced: - FORGET messages are sent when kernel garbage collects dentries, as expected - despite FORGET-s, dentries are not freed
(In reply to comment #6) > experienced: > - FORGET messages are sent when kernel garbage collects dentries, as expected > - despite FORGET-s, dentries are not freed ^^^^^^^^ (I mean inodes, on glusterfs side)
We tried by adding many counters inside inode table, and noticed that all the inodes created are getting destroyed. Hence currently we think this is an issue with memory management fragmentation. This can be avoided by using our own 'mem-pool', hence will be in the 3.1.x releases (may not be 3.1.0)
(In reply to comment #8) > We tried by adding many counters inside inode table, and noticed that all the > inodes created are getting destroyed. Hence currently we think this is an issue > with memory management fragmentation. This can be avoided by using our own > 'mem-pool', hence will be in the 3.1.x releases (may not be 3.1.0) I can confirm. As far as I've seen, inodes get freed properly and malloc/free is used properly too. However, glibc malloc implementation implies that a huge buffer is mmap-ed at some point... I guess some similar algorithm is used as in the "readlink_malloc" sample function of the glibc manual: http://gnu.org/software/libc/manual/html_node/Symbolic-Links.html#index-readlink-1468 (ie. double buffer size 'till suffices). I was to go after it in detail, to see if this conjecture is right and if there are better alternatives... but if there is already a plan to tackle this, then I don't kill more time with this.
(In reply to comment #7) > (In reply to comment #6) > > experienced: > > - FORGET messages are sent when kernel garbage collects dentries, as expected > > - despite FORGET-s, dentries are not freed > ^^^^^^^^ > (I mean inodes, on glusterfs side) Further correction: inodes do get freed, just glusterfs virtual memory size (ie. pool/arena used by malloc) doesn't shrink. Cf. comments #8, #9.
(In reply to comment #10) > (In reply to comment #7) > > (In reply to comment #6) > > > experienced: > > > - FORGET messages are sent when kernel garbage collects dentries, as expected > > > - despite FORGET-s, dentries are not freed > > ^^^^^^^^ > > (I mean inodes, on glusterfs side) > > Further correction: inodes do get freed, just glusterfs virtual memory size > (ie. pool/arena used by malloc) doesn't shrink. Cf. comments #8, #9. Csaba, I think when you checked http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=813 was not fixed in your tree yet.
> Csaba, I think when you checked > http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=813 was not fixed in > your tree yet. Krishna, The bug you are pointing is just afr ref leak. 61 bug is not a leak it is malloc/arena glibc issue. This is reproducible over just posix volume by running ls -lR and du. Such a case is fuse_thread_proc:3103 iov_in[0].iov_base = CALLOC (1, msg0_size); Internally glibc does a mmap and it never gets freed during forget as their are no explicit munmap called. Csaba pointed about using tcmalloc from google perf tools, which would help us avoid using glibc malloc issues.
(In reply to comment #12) > > Csaba, I think when you checked > > http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=813 was not fixed in > > your tree yet. > > Krishna, > > The bug you are pointing is just afr ref leak. 61 bug is not a leak it is > malloc/arena glibc issue. This is reproducible over just posix volume by > running ls -lR and du. > > Such a case is fuse_thread_proc:3103 iov_in[0].iov_base = CALLOC (1, > msg0_size); > > Internally glibc does a mmap and it never gets freed during forget as their are > no explicit munmap called. > > Csaba pointed about using tcmalloc from google perf tools, which would help us > avoid using glibc malloc issues. Harsha Yes I know that 813 is inode leak and 61 is suspected to be a memory fragmentation issue. I was suspecting maybe csaba might have been hitting 813 (i.e kernel sends forgets but inodes still in inode table and not getting freed). But if he tried with just storage/posix then 813 can not be in the picture.
(In reply to comment #13) > But if he tried with just storage/posix then 813 can not be in the > picture. Yes, Krishna, that's the case, I used a minimalistic configuration with just storage/posix.
OK, I found finally a workaround. One has to update periodically the /proc/sys/vm/drop_caches file, with the value of 2 (we want to get rid of inode/dentry caches only, so 3 is an overkill). Instead of running an update daemon like this: 'while [ 1 ]; do echo 2 > /proc/sys/vm/drop_caches; sleep 10;done', I would propose to move this task into the glusterfs daemon and make it a configurable option. If active, after reaching a (configurable) limit of inode entries, "2" would be written to the proc file (with time keeping, in order to not repeat it too frequently).
Created attachment 362 This is a "proof of concept" patch, against 2.0.10rc1. It works on my site :). No more excessive allocations, as the kernel is forced to send forgets, instead of silly waiting for filling whole memory with cache entries.
drop_caches has a system wide effect and not on glusterfs alone. Hence we would not want your patch to be merged into our mainline. We plan to use fuse invalidation APIs to keep inode table trim in the future. Thanks for your patch, if it works fine in your environment, please continue to use it.
I didn't even expect you include my patch in the mainline - it is a stopgap! The fact that it works for me confirms your earlier conclusion that the whole problem is related to memory fragmentation and possibly exposes a longstanding memory management bug in glibc. BTW, I'm aware of the side effect, but IMHO frequent flushing of inode/dentry caches system-wide is better than consuming whole memory. Anyway, using fuse API for invalidation would be much wiser.