Bug 1346926

Summary: [scale] - momd memory leakage
Product: [oVirt] mom Reporter: Eldad Marciano <emarcian>
Component: CoreAssignee: Martin Sivák <msivak>
Status: CLOSED CURRENTRELEASE QA Contact: Eldad Marciano <emarcian>
Severity: medium Docs Contact:
Priority: medium    
Version: ---CC: bugs, dfediuck, eedri, emarcian, gklein, mgoldboi, michal.skrivanek, msivak, s.kieske
Target Milestone: ovirt-4.0.6Keywords: Performance
Target Release: 0.5.8Flags: rule-engine: ovirt-4.0.z+
rule-engine: ovirt-4.1+
rule-engine: blocker+
mgoldboi: planning_ack+
dfediuck: devel_ack+
gklein: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: mom-0.5.8-1 and vdsm from 4.0.6 Doc Type: Bug Fix
Doc Text:
Cause: We tracked the cause to somewhere in the xmlrpc libraries (can be ssl or anything else there) in Python 2. Consequence: A memory leak in the MOM process. Fix: We moved to jsonrpc. Result: Hopefully no memory leak.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 07:27:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1373832    
Bug Blocks:    

Description Eldad Marciano 2016-06-15 16:18:18 UTC
Description of problem:
In portion of scale out testing we found the momd process leaking over time in idle conditions.

leakage ratio 100mb per 12 hours.
(8.3mb in hour)

currently it dose not clear if it is related to the # of vms.

memory profiling in progress

Version-Release number of selected component (if applicable):
4.18.0

How reproducible:
100%

Steps to Reproduce:
1. running 
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Doron Fediuck 2016-06-16 10:57:53 UTC
Which mom version did you use?
We need more details on what mom was doing- were there any running VMs?

Comment 2 Martin Sivák 2016-06-16 11:04:09 UTC
Also, how did you watch the memory and for how long? MOM will eat some memory during the first hours after a VM apears before the stats buffers are filled and that is normal and expected behaviour.

Comment 4 Eldad Marciano 2016-06-19 12:38:21 UTC
(In reply to Doron Fediuck from comment #1)
> Which mom version did you use?
> We need more details on what mom was doing- were there any running VMs?

mom-0.5.3-1.el7ev.noarch.

Comment 7 Martin Sivák 2016-07-01 14:25:12 UTC
I was able to observe a slight increase (but with some sudden drops) of RSS over time. But instrumenting MOM with muppy did not show anything. There is no leak in high level Python structures according to it.

I was not able to reproduce 8 MiB per hour. I was not able to reproduce 1 MiB/h either. Just a bit less than that using about 30 VMs.

And I saw a sudden drop from 30 MB of RSS memory to 25 MB of used memory just now. So the garbage collector is probably not freeing anything unless it thinks it will be worth it.


Do you have some additional data collected over a longer period of time? What was VDSM and the engine doing? And was that a standard python interpreter?

Comment 8 Martin Sivák 2016-07-01 14:34:26 UTC
Btw the values in the graph are totally off. MOM uses about 30 MB of RSS RAM with 30 VMs. The graph shows values around a gigabyte. Are you sure you were watching RSS? My VSZ values are about the same, but those represent the highest allocated address and not the amount of used memory.

Comment 9 Eldad Marciano 2016-07-03 09:46:40 UTC
(In reply to Martin Sivák from comment #8)
> Btw the values in the graph are totally off. MOM uses about 30 MB of RSS RAM
> with 30 VMs. The graph shows values around a gigabyte. Are you sure you were
> watching RSS? My VSZ values are about the same, but those represent the
> highest allocated address and not the amount of used memory.

Yes, the graphs show the memory usage after few days of monitoring.
we don't have additional data yet.  
for further memory investigation we need tracemalloc here (better to ask Francesco).

Comment 10 Yaniv Kaul 2016-07-04 12:54:58 UTC
Do we limit mom CPU, similar to how we do it with VDSM? Perhaps we are suffering from similar issues on multiple (many) physical cores as we used to have in VDSM?

Comment 11 Roy Golan 2016-07-06 10:17:19 UTC
(In reply to Yaniv Kaul from comment #10)
> Do we limit mom CPU, similar to how we do it with VDSM? Perhaps we are
> suffering from similar issues on multiple (many) physical cores as we used
> to have in VDSM?

no cpu limiting. 

Eldad can you provide the tracemalloc info?

Comment 12 Eldad Marciano 2016-07-07 09:15:29 UTC
Yes, we can do it but we need some help from Francesco 
lets take it offline..

Comment 16 Martin Sivák 2016-07-11 07:59:15 UTC
I identified the leaking call. MOM itself is fine, the leak happens somewhere in the call to vdsm_api.getAllVmStats when being done from mom/HypervisorInterfaces/vdsmxmlrpcInterface.py

vdsm_api variable holds a reference to xmlrpclib.ServerProxy with ssl transport configured by VDSM.


Caching the return value of getAllVmStats call for 50000 seconds (a bit more than half a day) still showed a slow leak over the weekend (22 MB -> 22.2 MB).

I also ran a test with normal cache (5 s), no collectors, no controllers and no policy files and that leaked as fast as the report suggests.


It remains to be seen whether the issue is in VDSM's vdscli code or in the underlying Python, because VDSM uses plain xml-rpc (over an unix socket) to talk to MOM too. So we might want to check whether that is leaking as well.

Comment 17 Eyal Edri 2016-07-12 06:55:38 UTC
Following comment #16 should we close this bug or move it to VDSM instead of mom?

Comment 18 Martin Sivák 2016-07-12 09:04:57 UTC
We do not know yet. We might update the description a bit and move it to some other component or we can use this bug to get rid of xml rpc in MOM. Some testing is still pending (Elad's tracemalloc or possibly valgrind) to pinpoint the leak source exactly.

Comment 20 Yaniv Kaul 2016-08-18 07:52:31 UTC
Any reason this should not go to 4.0.3?

Comment 22 Eldad Marciano 2016-09-01 10:08:13 UTC
In closer look on Monitor.py seems like we have a circular reference issue
    def collect(self):
        ...
        ...
        for c in self.collectors:
            try:
                collected = c.collect()

weakref might resolve it...

Comment 23 Martin Sivák 2016-09-01 11:59:29 UTC
Uh? Where do you see a circular reference?

There is no leak in the Python code according to Pympler, how were you measuring it? What were the sample points?

Both Monitor and Entity accumulate some data, but the amount should stay constant after couple of minutes (the old entries are evicted once the cache size is reached).

Comment 24 Eldad Marciano 2016-09-01 12:05:41 UTC
(In reply to Martin Sivák from comment #23)
> Uh? Where do you see a circular reference?
> 
> There is no leak in the Python code according to Pympler, how were you
> measuring it? What were the sample points?
> 
> Both Monitor and Entity accumulate some data, but the amount should stay
> constant after couple of minutes (the old entries are evicted once the cache
> size is reached).

by tracemalloc see the screenshot attached, i did memory dump every 1800sec, for 12 hours.

Comment 25 Martin Sivák 2016-09-01 12:10:19 UTC
The screenshot has no timing data, what does it show? I see increase in memory, but is this an increase for the last half an hour, or per 12 hours since the starts..

And I was also asking you where do you see the circular dependency in the attached code snippet.

Comment 26 Eldad Marciano 2016-09-01 15:11:51 UTC
the screen shoot describes two memory dumps comparison, in range of 12 hours. 
also in the screenshot you can see the module name and the exec line
i.e
Monitor.py:102 which is this line
data[key] = val

which is inside and refer to the circular reference.

attaching the memory dumps as well

Comment 29 Martin Sivák 2016-09-01 16:28:51 UTC
(In reply to Eldad Marciano from comment #26)
> the screen shoot describes two memory dumps comparison, in range of 12
> hours. 

When was the first snapshot taken in terms of mom startup? How long was the process running? Is the machine still up and can you take another one?

> also in the screenshot you can see the module name and the exec line
> i.e
> Monitor.py:102 which is this line
> data[key] = val
> 
> which is inside and refer to the circular reference.

There is no circular reference. This is pure and simple dictionary with scalar value assigned to it. The value is rewritten every 10 seconds and the old value is garbage collected at Monitor.py:102.

> attaching the memory dumps as well

I loaded the tracemalloc dumps and the first dump showed almost no data except couple of threads. Was it taken right after MOM was started? Then it is expected that the memory usage will grow a bit before stabilizing. And 1 MiB is hardly excessive for mom variables and caches.

How many VMs were you running?

Can you take another dump and compare it with the second one (like t+12h and t+24h) to see whether the memory is actually leaking?


The ssl and httplib seem to be eating quite a lot for what they are supposed to be doing, but still under another megabyte each.

Comment 30 Eldad Marciano 2016-09-04 08:04:06 UTC
(In reply to Martin Sivák from comment #29)
> (In reply to Eldad Marciano from comment #26)
> > the screen shoot describes two memory dumps comparison, in range of 12
> > hours. 
> 
> When was the first snapshot taken in terms of mom startup? How long was the
> process running? Is the machine still up and can you take another one?
> 
> > also in the screenshot you can see the module name and the exec line
> > i.e
> > Monitor.py:102 which is this line
> > data[key] = val
> > 
> > which is inside and refer to the circular reference.
> 
> There is no circular reference. This is pure and simple dictionary with
> scalar value assigned to it. The value is rewritten every 10 seconds and the
> old value is garbage collected at Monitor.py:102.
> 
> > attaching the memory dumps as well
> 
> I loaded the tracemalloc dumps and the first dump showed almost no data
> except couple of threads. Was it taken right after MOM was started? Then it
> is expected that the memory usage will grow a bit before stabilizing. And 1
> MiB is hardly excessive for mom variables and caches.
> 
> How many VMs were you running?
> 
> Can you take another dump and compare it with the second one (like t+12h and
> t+24h) to see whether the memory is actually leaking?
> 
> 
> The ssl and httplib seem to be eating quite a lot for what they are supposed
> to be doing, but still under another megabyte each.

right, well approximately we have the same picture for continues run over the weekend.
as i mention at the begging we running around ~77 vms, for the weekend run we had 80 vms.
I also did some comparison with lower number of vms to see how it affect the leakage (see comment https://bugzilla.redhat.com/show_bug.cgi?id=1346926#c3).
anyway this leakage is in size of few MB and not a 100 as mention at the beginning of the thread.
not sure what changed this version but it is not on the same rate, which is good.

mom version:
mom-0.5.5-1.el7ev.noarch

anyway its a very small leakage which will be nice to solve it.
attaching dumps

Comment 34 Eldad Marciano 2016-09-27 18:44:55 UTC
this bug reproduced in master 4.1.0-0 in rate of 5.3mb in 12 hours.

Comment 35 Martin Sivák 2016-09-29 12:42:50 UTC
Eldad: again.. we need more information than that. Did you let MOM stabilize its caches before taking the first snapshot?

Comment 39 Martin Sivák 2016-11-04 12:27:11 UTC
The pieces in 4.0.6 (mom and vdsm) are now configured to use jsonrpc and that should solve the leak (as it was caused by something inside xmlrpc).

Comment 40 Eldad Marciano 2016-11-29 10:14:02 UTC
verified on top of build 4.0.6-1
currently there is a leakage in size 0.37mb (378KB) for 12 hours.
31.5KB per hour.

currently we can close this bug, and open a new one in lower priority.

Comment 42 Sven Kieske 2016-11-29 11:04:12 UTC
Hi,

so might I kindly ask where this new bug is?

Thanks

Sven

Comment 43 Eldad Marciano 2016-11-29 12:34:08 UTC
(In reply to Sven Kieske from comment #42)
> Hi,
> 
> so might I kindly ask where this new bug is?
> 
> Thanks
> 
> Sven

Sure,
there is a link
https://bugzilla.redhat.com/show_bug.cgi?id=1399567