Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 761793 (GLUSTER-61)

Summary:

Inodes never freed + valgrind report

Product:

[Community] GlusterFS

Reporter:

Basavanagowda Kanur <gowda>

Component:

core

Assignee:

Anush Shetty <anush>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

medium

Docs Contact:

Priority:

low

Version:

mainline

CC:

amarts, anush, csaba, fharshav, gluster-bugs, krishna, pavan, strasbur, tejas, vijay

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

---

Regression:

RTNR

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Valgrind Report	none
force the kernel to send forgets	none

Description Basavanagowda Kanur 2009-06-25 03:18:45 UTC

[Migrated from savannah BTS] - bug 26626 [https://savannah.nongnu.org/bugs/?26626]

Comment 1 Basavanagowda Kanur 2009-06-25 03:20:55 UTC

The valgrind report is showing a huge 'still reachable' memory blocks because glusterfs does not destroy the data structures that are still in-use while umount is done.

These 'still reachable' reports should go away, once we have even more graceful shutdown of glusterfs.

--
Gowda

Comment 2 Basavanagowda Kanur 2009-06-25 06:18:06 UTC

Wed 20 May 2009 06:28:22 AM GMT, original submission:

Running anything, that scans the directory tree recursively (du, ls -R), leads to increasing memory consumption by the glusterfs client. Here is the most remarkable part of the report from running glusterfs client under valgrind --leak-check=full --show-reachable=yes:

==15979== 104,480,064 bytes in 251,154 blocks are still reachable in loss record
73 of 73
==15979== at 0x68EEC77: calloc (in /usr/local/lib/valgrind/x86-linux/vgpreloa
d_memcheck.so)
==15979== by 0x6915D22: __inode_create (inode.c:450)
==15979== by 0x6915D7C: inode_new (inode.c:466)
==15979== by 0x6B1B2A1: fuse_lookup (fuse-bridge.c:441)
==15979== by 0x6B3CEE7: do_lookup (fuse_lowlevel.c:444)
==15979== by 0x6B3E9DD: fuse_ll_process (fuse_lowlevel.c:1182)
==15979== by 0x6B403AA: fuse_session_process (fuse_session.c:90)
==15979== by 0x6B232BF: fuse_thread_proc (fuse-bridge.c:2486)
==15979== by 0x6942A57: pthread_start_thread (in /lib/libpthread-0.10.so)
==15979== by 0x6A2E2E9: clone (in /lib/libc-2.3.6.so)

I have no idea, whether the error roots are in glusterfs or libfuse, but anyway, glusterfs depends on fuse.
Full report is attached as compressed file.
If you need more tests, feel free to contact me. 

--------------------------------------------------------------------------------
Mon 22 Jun 2009 05:57:23 PM GMT, comment #1 by 	Raghavendra <raghavendra>:

Hi,

Inodes are freed when kernel sends forget on them. In other words the extent inodes are cached in glusterfs depends on kernel. Hence you are seeing huge memory consumption, but it's not memory leak. doing echo 3 > /proc/sys/vm/drop_caches should bring down memory consumption.

Btw, I did not find any attachment. Am I missing anything?

regards,
Raghavendra.

--------------------------------------------------------------------------------
Tue 23 Jun 2009 07:30:14 AM GMT, comment #2 by 	Krzysztof Strasburger <strasbur>:

Amar Tumballi advised me to set drop_caches to 3, but it did not help.
I understand that it is not a memory leak, as valgrind claims that the pointers are not forgotten. Repeating the same operation does not cause additional memory allocations, so the inodes are cached and used, as needed. However, it is not good to see all your memory consumed forever, only because somebody ran du on a big directory tree.
My attachement is still accessible, I tried even to download it, to be sure.
You can download it directly via http://savannah.nongnu.org/bugs/download.php?file_id=18169

Comment 3 Pavan Vilas Sondur 2010-01-19 08:34:27 UTC

Need verification

Comment 4 Anush Shetty 2010-01-20 04:46:12 UTC

Verified with 3.0.1rc1

Comment 5 Krzysztof Strasburger 2010-03-30 10:17:25 UTC

This bug still exists in 3.0.x and has been confirmed by other users.
Neither the performance translators nor networking cause these excessive memory allocations.
The bug can be triggered even with a trivial, serverless setup:
volume loopback
 type storage/posix
 option directory /root/loopback
end-volume
by running du or ls -R in the glusterfs-mounted directory (containing a large number of files).

Comment 6 Csaba Henk 2010-04-15 17:53:45 UTC

experienced:
- FORGET messages are sent when kernel garbage collects dentries, as expected
- despite FORGET-s, dentries are not freed

Comment 7 Csaba Henk 2010-04-15 17:54:58 UTC

(In reply to comment #6)
> experienced:
> - FORGET messages are sent when kernel garbage collects dentries, as expected
> - despite FORGET-s, dentries are not freed
                      ^^^^^^^^
                    (I mean inodes, on glusterfs side)

Comment 8 Amar Tumballi 2010-04-20 07:46:28 UTC

We tried by adding many counters inside inode table, and noticed that all the inodes created are getting destroyed. Hence currently we think this is an issue with memory management fragmentation. This can be avoided by using our own 'mem-pool', hence will be in the 3.1.x releases (may not be 3.1.0)

Comment 9 Csaba Henk 2010-04-20 14:20:31 UTC

(In reply to comment #8)
> We tried by adding many counters inside inode table, and noticed that all the
> inodes created are getting destroyed. Hence currently we think this is an issue
> with memory management fragmentation. This can be avoided by using our own
> 'mem-pool', hence will be in the 3.1.x releases (may not be 3.1.0)

I can confirm.

As far as I've seen, inodes get freed properly and malloc/free is used properly too.

However, glibc malloc implementation implies that a huge buffer is mmap-ed at some point... I guess some similar algorithm is used as in the "readlink_malloc" sample function of the glibc manual:

http://gnu.org/software/libc/manual/html_node/Symbolic-Links.html#index-readlink-1468

(ie. double buffer size 'till suffices).

I was to go after it in detail, to see if this conjecture is right and if there are better alternatives... but if there is already a plan to tackle this, then I don't kill more time with this.

Comment 10 Csaba Henk 2010-04-20 17:49:40 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > experienced:
> > - FORGET messages are sent when kernel garbage collects dentries, as expected
> > - despite FORGET-s, dentries are not freed
>                       ^^^^^^^^
>                     (I mean inodes, on glusterfs side)

Further correction: inodes do get freed, just glusterfs virtual memory size (ie. pool/arena used by malloc) doesn't shrink. Cf. comments #8, #9.

Comment 11 Krishna Srinivas 2010-04-20 20:09:43 UTC

(In reply to comment #10)
> (In reply to comment #7)
> > (In reply to comment #6)
> > > experienced:
> > > - FORGET messages are sent when kernel garbage collects dentries, as expected
> > > - despite FORGET-s, dentries are not freed
> >                       ^^^^^^^^
> >                     (I mean inodes, on glusterfs side)
> 
> Further correction: inodes do get freed, just glusterfs virtual memory size
> (ie. pool/arena used by malloc) doesn't shrink. Cf. comments #8, #9.

Csaba, I think when you checked http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=813 was not fixed in your tree yet.

Comment 12 Harshavardhana 2010-04-20 20:19:50 UTC

> Csaba, I think when you checked
> http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=813 was not fixed in
> your tree yet.

Krishna, 

  The bug you are pointing is just afr ref leak. 61 bug is not a leak it is malloc/arena glibc issue. This is reproducible over just posix volume by running ls -lR and du. 

Such a case is fuse_thread_proc:3103  iov_in[0].iov_base = CALLOC (1, msg0_size);

Internally glibc does a mmap and it never gets freed during forget as their are no explicit munmap called. 

Csaba pointed about using tcmalloc from google perf tools, which would help us avoid using glibc malloc issues.

Comment 13 Krishna Srinivas 2010-04-20 20:57:45 UTC

(In reply to comment #12)
> > Csaba, I think when you checked
> > http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=813 was not fixed in
> > your tree yet.
> 
> Krishna, 
> 
>   The bug you are pointing is just afr ref leak. 61 bug is not a leak it is
> malloc/arena glibc issue. This is reproducible over just posix volume by
> running ls -lR and du. 
> 
> Such a case is fuse_thread_proc:3103  iov_in[0].iov_base = CALLOC (1,
> msg0_size);
> 
> Internally glibc does a mmap and it never gets freed during forget as their are
> no explicit munmap called. 
> 
> Csaba pointed about using tcmalloc from google perf tools, which would help us
> avoid using glibc malloc issues.

Harsha Yes I know that 813 is inode leak and 61 is suspected to be a memory fragmentation issue. I was suspecting maybe csaba might have been hitting 813 (i.e kernel sends forgets but inodes still in inode table and not getting freed). But if he tried with just storage/posix then 813 can not be in the picture.

Comment 14 Csaba Henk 2010-04-20 23:48:03 UTC

(In reply to comment #13)

> But if he tried with just storage/posix then 813 can not be in the
> picture.

Yes, Krishna, that's the case, I used a minimalistic configuration with just storage/posix.

Comment 15 Krzysztof Strasburger 2010-10-21 02:56:20 UTC

OK, I found finally a workaround. One has to update periodically the /proc/sys/vm/drop_caches file, with the value of 2 (we want to get rid of inode/dentry caches only, so 3 is an overkill).
Instead of running an update daemon like this: 'while [ 1 ]; do echo 2 > /proc/sys/vm/drop_caches; sleep 10;done', I would propose to move this task into the glusterfs daemon and make it a configurable option. If active, after reaching a (configurable) limit of inode entries, "2" would be written to the proc file (with time keeping, in order to not repeat it too frequently).

Comment 16 Krzysztof Strasburger 2010-10-25 05:30:55 UTC

Created attachment 362


This is a "proof of concept" patch, against 2.0.10rc1. It works on my site :). No more excessive allocations, as the kernel is forced to send forgets, instead of silly waiting for filling whole memory with cache entries.

Comment 17 Vijay Bellur 2010-10-27 05:45:30 UTC

drop_caches has a system wide effect and not on glusterfs alone. Hence we would not want your patch to be merged into our mainline. We plan to use fuse invalidation APIs to keep inode table trim in the future.

Thanks for your patch, if it works fine in your environment, please continue to use it.

Comment 18 Krzysztof Strasburger 2010-10-28 02:50:26 UTC

I didn't even expect you include my patch in the mainline - it is a stopgap! The fact that it works for me confirms your earlier conclusion that the whole problem is related to memory fragmentation and possibly exposes a longstanding memory management bug in glibc. BTW, I'm aware of the side effect, but IMHO frequent flushing of inode/dentry caches system-wide is better than consuming whole memory. Anyway, using fuse API for invalidation would be much wiser.