Description of problem:
Memory leak on FreeBSD self-heal-daemon, brick, nfs server with no data on the volume
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Start glusterd
2. Start/Create a replicated volume
3. self-heal-deamon uses a lot of memory and FreeBSD memory manager kills the process
No memory leak
Created attachment 918627 [details]
Created attachment 918628 [details]
FreeBSD memory usage sampled in 1sec
Created attachment 918630 [details]
FreeBSD memory usage sampled in 10sec
Created attachment 918632 [details]
Valgrind output beware its 3.2Gigs!!
I looked for the following things in the statedump for any memory allocation
1) grep "pool-misses" *dump*
This tells us if there were any objects whose allocated mem-pool wasn't sufficient
for the load it was working under.
I see that the pool-misses were zero, which means we are doing good with the mem-pools we allocated.
2) grep "hot-count" *dump*
This tells us the no. of objects of any kind that is 'active' in the process while the state-dump
was taken. This should allow us to see if the numbers we see are explicable.
I see the maximum hot-count across statedumps of processes is 50, which isn't alarming or pointing any obvious memory leaks.
The above observations indicate that some object that is not mem-pool allocated is being leaked or the statedump was taken 'prematurely'. That is the memory leak has not yet reached observably significant levels.
Created attachment 918968 [details]
Gluster Management Daemon logs
I had some of my own debugging logs while debugging the issue of 'glustershd' pid showing as 'N/A' - You can observe RPC_CLNT_CONNECT without RPC_CLNT_DISCONNECT, btw i do not see RPC_CLNT_CONNECT always in a loop
But without any logs memory does increase, in-fact the strange part is sometimes there is no RPC_CLNT_CONNECT notification from self-heal daemon - if i kill the self-heal daemon then brick memory utilization which was being incremented stops. Enable self-heal daemon again brick memory usage climbs up.
So we in-fact figured two issues here
- N/A for a running glustershd process through volume status, glustershd doesn't register itself with Gluster management deamon on volume start, but on a Gluster management daemon restart it gets fixed
- glustershd also going OOM on FreeBSD, in-fact the issue fixed in previous run is now back again - in this case i do see a RPC_CLNT_DISCONNECT - now restarting the volume doesn't fix the N/A issue from volume status - you have to restart 'gluster management' daemon again.
And this cycle continues!
Assigning it to the maintainer.
Closing this bug as it is filed on a version which is EOL. We have new releases available with fixes to the memory leaks. If you were able to hit this issue on the latest maintained release, please feel free to reopen the bug against that version.