Bug 1461543
Summary: | [Ganesha] : Ganesha occupies increased memory (RES size) even when all the files are deleted from the mount | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> | ||||||||||||||
Component: | nfs-ganesha | Assignee: | Kaleb KEITHLEY <kkeithle> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Ambarish <asoman> | ||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||
Priority: | unspecified | ||||||||||||||||
Version: | rhgs-3.3 | CC: | amukherj, asoman, bturner, dang, ffilz, jthottan, kkeithle, mbenjamin, ndevos, pgurusid, rcyriac, rgowdapp, rhinduja, rhs-bugs, skoduri, srangana, storage-qa-internal | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | RHGS 3.3.0 | ||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | 3.3.0-devel-freeze-exception | ||||||||||||||||
Fixed In Version: | glusterfs-3.8.4-38 | Doc Type: | If docs needed, set a value | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | |||||||||||||||||
: | 1481392 1481925 1481951 (view as bug list) | Environment: | |||||||||||||||
Last Closed: | 2017-09-21 04:59:42 UTC | Type: | Bug | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | 1477668, 1481392 | ||||||||||||||||
Bug Blocks: | 1417151 | ||||||||||||||||
Attachments: |
|
Description
Ambarish
2017-06-14 18:11:22 UTC
Lost qa_ack in collision. Reset now. So, most of those leaks are in gluster; however, it looks like we are leaking MDCACHE entries, but only 1.5MB worth, so it shouldn't be causing issues... Total leak is only 150MB, but total unleaked is only 58MB, so it's leaking 3x. Could you please attach the statedump file of ganesha/gfapi process. @Ambarish, But the ganesha process PIDs mentioned in the bug description (comment#1) are 15816 & 26055. Are you sure the statedump provided above was taken post running the tests (i.e, when ganesha process is consuming lot of memory)? I do not see much memory being consumed in the gfapi/gluster stack from the statedump provided. Sorry,I did not notice that. I'll update this with one of the reproducers. if you are reproducing the issue, could you please collect statedump at regular intervals.. * couple of them while running bonnie, & * before and after removal of mount contents. Thanks! In order to chase down any contribution of Ganesha to the huge memory footprint, I think it would help to have availability of a cluster that can reproduce the issue along with instructions on how to run the scenario. I really want to get output from massif under the suspect workload so I can see just where all the memory is being used. Created attachment 1292484 [details]
Summary from massif output
I have done an analysis of massif output. At it's peak, it doesn't look like 6 GB of memory is used on the heap, in fact it only shows 172 MB of memory allocated from the heap. That doesn't account for stacks and such, so I don't know if that might be part of the 6 GB of memory, or exactly what contribute.
The main things I see are:
1. hashtable uses a LOT of memory all the time. This is planned to be fixed for 2.6.
2. Gluster is the next biggest use of memory
3. mdcache uses a bunch of memory, however, this memory is all released. Planned for early 2.6 and backported to 2.5 stable is memory management for the dirent cache to limit the size it can attain.
The issue I think we might have been having before is that massif doesn't produce its output until the process is terminated. By terminating the process, I was able to get the output summarized in this attachment.
I believe this is not a bug. Here's my reasoning. 1. VSS (Virtual memory used) does not equate to memory used. Lots of things (most common example is memory mapped files) can use virtual address space without using actual RAM. 2. In the listed output, the RES (actual memory used) is ~170 MB. 3. I've reproduced this on my cluster, seeing high VSS and normal (<200 MB) RES. I've run this under Massif, and verified that the actual heap use is the same as the RES. Therefore, I conclude Ganesha is *not* using lots of RAM in these tests. (Also note that the dirents and obj cache don't even show up in the Massif output, they're so small) I take it back, I missed that one of the RES was huge. So there is some issue somewhere. (In reply to Daniel Gryniewicz from comment #23) > I believe this is not a bug. Here's my reasoning. > > 1. VSS (Virtual memory used) does not equate to memory used. Lots of things > (most common example is memory mapped files) can use virtual address space > without using actual RAM. > > 2. In the listed output, the RES (actual memory used) is ~170 MB. > > 3. I've reproduced this on my cluster, seeing high VSS and normal (<200 MB) > RES. I've run this under Massif, and verified that the actual heap use is > the same as the RES. > > Therefore, I conclude Ganesha is *not* using lots of RAM in these tests. > > (Also note that the dirents and obj cache don't even show up in the Massif > output, they're so small) The above considerations are valid, nevertheless, I think the memory utilization should be accounted for, as I at least don't think I have a good explanation why we would have anything like 6G of mappings in use. Matt Okay, I got a snapshot from a run using 2.3G of RES. Here's the highlights, and I'll attach the full output: ->53.65% (644,874,245B) 0x8F8B5C8: __gf_calloc (mem-pool.c:117) The bulk is from GFAPI calls. ->20.55% (247,037,160B) 0x8B13B19: __new_var_obj_p (__libobj.c:36) | ->16.32% (196,158,024B) 0x8B11455: __acl_create_entry_obj (acl_create_entry.c:31) | | ->16.32% (196,158,024B) 0x8B11496: acl_create_entry (acl_create_entry.c:67) | ->04.23% (50,879,136B) 0x8B124C6: __acl_init_obj (acl_init.c:28) | ->04.23% (50,879,136B) 0x8B1254B: acl_init (acl_init.c:66) 250 MB is from ACLs from FSAL_GLUSTER. I think these may be leaked, and I'm trying a patch out to see. ->06.73% (80,877,576B) 0x5201DB: gsh_calloc__ (abstract_mem.h:145) | ->06.73% (80,877,576B) 0x52025E: pool_alloc__ (abstract_mem.h:395) | ->06.73% (80,877,576B) 0x5241DB: alloc_cache_entry (mdcache_lru.c:1277) 80 MB from the Ganesha handle cache ->04.06% (48,790,984B) 0x4E0693: gsh_calloc__ (abstract_mem.h:145) | ->04.06% (48,757,296B) 0x4E1086: hashtable_init (hashtable.c:341) The standard 48M from Ganesha hash tables So, the one that may be Ganesha's fault and fixable is the ACLs. Created attachment 1292689 [details]
Massif snapshot
Created attachment 1292691 [details]
massif output on Dev setup,post rm
I retried this on Dev setup with Frank and Dan(thanks guys!).
Ganesha took ~1.25G of memory post bonnie and subsequent rm/cleanup from mount.
Attaching massif o/p to the bz.
The final output from massif does not show the ACLs, so they're not being leaked, they just take up a lot of memory. Created attachment 1292855 [details]
massif_snaphost_idle_setup_post_io
upstream patches: https://review.gluster.org/17662 https://review.gluster.org/17666 https://review.gluster.org/17728 https://review.gluster.org/17734 I've been looking at libglusterfs/src/mem-pool.c for cleaning up when glfs_fini() is called. This cleanup is not available yet, patches are being worked on and show up here: https://review.gluster.org/#/q/topic:bug-1470170 That means I have gained some understanding in how the new mem-pools work, but I have not focussed much on the details. So, take the below with some gain of salt, it may not be correct. Writing this while reading the sources. TL;DR: I don't see any memory leaks in mem-pools itself. It may take up to 2x 30 seconds before memory returned with mem_put() is released. Because the test-build (where the pool_sweeper thread gets enabled for libgfapi) showed reduced memory usage, we know that at least some of the functionality is working as expected. If this is a problem with the new mem-pool implementation, it should not occur when running these tests with a glusterfs build that has --disable-mempool in its ./configure options. Unfortunately the new mem-pools do not offer statedump features. It is difficult to debug problems in case mem_get() allocations are not free'd. (This is also the case when building with --disable-mempool.) -- %< -- more notes about the structure of the different structures and lists -- It's really great that doc/developer-guide/datastructure-mem-pool.md was not updated with the new model :-/ There are per-thread-pools, each per_thread_pool has two arrays[NPOOLS=14] of lists (hot_list and cold_list). libglusterfs/src/mem-pools.c: 374 #define POOL_SMALLEST 7 /* i.e. 128 */ 375 #define POOL_LARGEST 20 /* i.e. 1048576 */ 376 #define NPOOLS (POOL_LARGEST - POOL_SMALLEST + 1) A mem_get() rounds-up the allocation to the next power-of-2 size and places that allocation (+mem_header) in the per_thread_pool_list (which links to a per_thread_pool mentioned above). A mem_put() reads the mem_header, and uses the size of the allocation to place it on the associated per_thread_pool_list.hot_list. At one point (every 30 secs) the pool_sweeper thread goes through all the per_thread_pool_list's. - If that per_thread_pool_list belonged to a thread that exited, the per_thread_pool_list will have been marked with ->poison. This causes collect_garbage() to move the per_thread_pool_list to death_row (a pool_sweeper state) - In addition to cleaning up after threads exited, collect_garbage() marks the per_thread_pool_list.cold_list's for removal in pool_sweeper. Each per_thread_pool_list.hot_list is renamed to cold_list (and will be cleaned up on the next run of pool_sweeper. - In pool_sweeper, all per_thread_pool_list's in the death_row will be free'd (everything in the hot_list and _cold_list). The per_thread_pool_list itself moves to the pool_free_threads list where it can be used for a next new thread. - At last, in pool_sweeper, all cold_lists that were marked as garbage will be free'd. Created attachment 1301119 [details]
mem-pool: mem_put() returns to current thread pool
There is a possible problem that could cause memory leaks in the case where threads get started and stopped a lot. I do not know if this is the case with NFS-Ganesha though.
- a mem_get() sets up a mem_header that points to the mem-pool of the current
thread
- if the thread exits, the mem-pool of the thread is cleaned (hot_list,
cold_list, not active allocations) and moved to pool_free_list
- mem_put() takes the pointer from the mem_header and moves the allocation from
the active state to the hot_list for re-use
- inactive threads (listed in pool_free_list) are not cleaned by the pool_sweeper
- the mem_put()'d allocation will linger around indefinitely
This patch moves the allocation to the mem-pool of the currently running thread. This prevents growing the hot_list of inactive per_thread_pool_t structures, so no additional cleaning is needed.
As noted in the commit message of the patch, this theory needs to be proven and further polishing is needed as well.
So, since i was told that normal dd operation reproduces the memory leak, took some time to write a gfapi application. What it does is spawn specified number of threads and issue glfs_pwrite on each of those threads and create a big file. When i ran with 100 threads, each writing 200MB of data, and hence creating on large 20GB file, my observation were: 1. When run on RHGS 3.3: RES memory goes to 10G while the IO is going on, but the moment threads are done with the IO, the memory falls down to as low as 300M. 2. When run on gluster master branch, RES memory didn't ever cross few 100 MBs, hence, definitely there are patches upstream (may be Kalebs gfapi sweeper patch), that needs to be pulled in any case. Conclusion: I think its a good idea to try with master build and Ganesha test to check if the issue is already fixed. 2. Its mostly not any simple large file write that causes the leak, as the above test case showed no leaks. Attaching the test case. Created attachment 1305946 [details]
Test case that didn't reproduce the issue
(In reply to Poornima G from comment #57) > So, since i was told that normal dd operation reproduces the memory leak, > took some time to write a gfapi application. What it does is spawn specified > number of threads and issue glfs_pwrite on each of those threads and create > a big file. > > When i ran with 100 threads, each writing 200MB of data, and hence creating > on large 20GB file, my observation were: > 1. When run on RHGS 3.3: RES memory goes to 10G while the IO is going on, > but the moment threads are done with the IO, the memory falls down to as low > as 300M. > 2. When run on gluster master branch, RES memory didn't ever cross few 100 > MBs, hence, definitely there are patches upstream (may be Kalebs gfapi > sweeper patch), that needs to be pulled in any case. > > Conclusion: > I think its a good idea to try with master build and Ganesha test to check > if the issue is already fixed. > > 2. Its mostly not any simple large file write that causes the leak, as the > above test case showed no leaks. > > > Attaching the test case. I think we need results with corresponding builds of libgfapi and nfs-ganesha serving candidate workloads, as a starting point for decision making. Matt (In reply to Poornima G from comment #57) > So, since i was told that normal dd operation reproduces the memory leak, > took some time to write a gfapi application. What it does is spawn specified > number of threads and issue glfs_pwrite on each of those threads and create > a big file. > > When i ran with 100 threads, each writing 200MB of data, and hence creating > on large 20GB file, my observation were: > 1. When run on RHGS 3.3: RES memory goes to 10G while the IO is going on, > but the moment threads are done with the IO, the memory falls down to as low > as 300M. @Poornima, Thanks for the testcase. Just an observation from the test-script. I see that we are doing - printf ("\nspawning threads %d", threads_cnt); for (i = 0; i < threads_cnt ; i++) { thcnt[i] = i; pthread_create (&tid[i], NULL, &thread_func, &thcnt[i]); } for (i = 0; i < threads_cnt ; i++) { pthread_join (tid[i], NULL); } printf ("\ndone writing"); scanf ("%d", &tmp); Did you see memory drop right after writes are done i.e, before pthread_join of threads ? Or was the drop observed after pthread_join. If that is the case then the threads may have exited and hence cleaned up the memory consumed. I am sorry I haven't run the testcase but will give it a try...but just wanted to confirm with you if my doubts are valid. Thanks! (In reply to Soumya Koduri from comment #61) > (In reply to Poornima G from comment #57) > > So, since i was told that normal dd operation reproduces the memory leak, > > took some time to write a gfapi application. What it does is spawn specified > > number of threads and issue glfs_pwrite on each of those threads and create > > a big file. > > > > When i ran with 100 threads, each writing 200MB of data, and hence creating > > on large 20GB file, my observation were: > > 1. When run on RHGS 3.3: RES memory goes to 10G while the IO is going on, > > but the moment threads are done with the IO, the memory falls down to as low > > as 300M. > > @Poornima, > Thanks for the testcase. Just an observation from the test-script. I see > that we are doing - > > > printf ("\nspawning threads %d", threads_cnt); > for (i = 0; i < threads_cnt ; i++) { > thcnt[i] = i; > pthread_create (&tid[i], NULL, &thread_func, &thcnt[i]); > } > > for (i = 0; i < threads_cnt ; i++) { > pthread_join (tid[i], NULL); > } > > printf ("\ndone writing"); > scanf ("%d", &tmp); > > Did you see memory drop right after writes are done i.e, before pthread_join > of threads ? Or was the drop observed after pthread_join. If that is the > case then the threads may have exited and hence cleaned up the memory > consumed. > > I am sorry I haven't run the testcase but will give it a try...but just > wanted to confirm with you if my doubts are valid. Thanks! In each thread i have written a scanf(), so that the thread doesn't join, bad hack but serves the purpose. In the thread_func: thread_func (void *opaque) { ... write 200MB in 1MB chunks printf ("\n Done with thread:%d", *((int *)opaque)); scanf ("%d", &j); return NULL; } The memory dropped, when all the threads were done writing, and were in scanf, and not yet joined.
> In each thread i have written a scanf(), so that the thread doesn't join,
> bad hack but serves the purpose. In the thread_func:
> thread_func (void *opaque)
> {
> ...
> write 200MB in 1MB chunks
> printf ("\n Done with thread:%d", *((int *)opaque));
> scanf ("%d", &j);
> return NULL;
> }
>
> The memory dropped, when all the threads were done writing, and were in
> scanf, and not yet joined.
Oh okay..thanks for the confirmation!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |