Bug 1346926
Summary: | [scale] - momd memory leakage | ||
---|---|---|---|
Product: | [oVirt] mom | Reporter: | Eldad Marciano <emarcian> |
Component: | Core | Assignee: | Martin Sivák <msivak> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Eldad Marciano <emarcian> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | --- | CC: | bugs, dfediuck, eedri, emarcian, gklein, mgoldboi, michal.skrivanek, msivak, s.kieske |
Target Milestone: | ovirt-4.0.6 | Keywords: | Performance |
Target Release: | 0.5.8 | Flags: | rule-engine:
ovirt-4.0.z+
rule-engine: ovirt-4.1+ rule-engine: blocker+ mgoldboi: planning_ack+ dfediuck: devel_ack+ gklein: testing_ack+ |
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | mom-0.5.8-1 and vdsm from 4.0.6 | Doc Type: | Bug Fix |
Doc Text: |
Cause:
We tracked the cause to somewhere in the xmlrpc libraries (can be ssl or anything else there) in Python 2.
Consequence:
A memory leak in the MOM process.
Fix:
We moved to jsonrpc.
Result:
Hopefully no memory leak.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2017-01-18 07:27:44 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1373832 | ||
Bug Blocks: |
Description
Eldad Marciano
2016-06-15 16:18:18 UTC
Which mom version did you use? We need more details on what mom was doing- were there any running VMs? Also, how did you watch the memory and for how long? MOM will eat some memory during the first hours after a VM apears before the stats buffers are filled and that is normal and expected behaviour. (In reply to Doron Fediuck from comment #1) > Which mom version did you use? > We need more details on what mom was doing- were there any running VMs? mom-0.5.3-1.el7ev.noarch. I was able to observe a slight increase (but with some sudden drops) of RSS over time. But instrumenting MOM with muppy did not show anything. There is no leak in high level Python structures according to it. I was not able to reproduce 8 MiB per hour. I was not able to reproduce 1 MiB/h either. Just a bit less than that using about 30 VMs. And I saw a sudden drop from 30 MB of RSS memory to 25 MB of used memory just now. So the garbage collector is probably not freeing anything unless it thinks it will be worth it. Do you have some additional data collected over a longer period of time? What was VDSM and the engine doing? And was that a standard python interpreter? Btw the values in the graph are totally off. MOM uses about 30 MB of RSS RAM with 30 VMs. The graph shows values around a gigabyte. Are you sure you were watching RSS? My VSZ values are about the same, but those represent the highest allocated address and not the amount of used memory. (In reply to Martin Sivák from comment #8) > Btw the values in the graph are totally off. MOM uses about 30 MB of RSS RAM > with 30 VMs. The graph shows values around a gigabyte. Are you sure you were > watching RSS? My VSZ values are about the same, but those represent the > highest allocated address and not the amount of used memory. Yes, the graphs show the memory usage after few days of monitoring. we don't have additional data yet. for further memory investigation we need tracemalloc here (better to ask Francesco). Do we limit mom CPU, similar to how we do it with VDSM? Perhaps we are suffering from similar issues on multiple (many) physical cores as we used to have in VDSM? (In reply to Yaniv Kaul from comment #10) > Do we limit mom CPU, similar to how we do it with VDSM? Perhaps we are > suffering from similar issues on multiple (many) physical cores as we used > to have in VDSM? no cpu limiting. Eldad can you provide the tracemalloc info? Yes, we can do it but we need some help from Francesco lets take it offline.. I identified the leaking call. MOM itself is fine, the leak happens somewhere in the call to vdsm_api.getAllVmStats when being done from mom/HypervisorInterfaces/vdsmxmlrpcInterface.py vdsm_api variable holds a reference to xmlrpclib.ServerProxy with ssl transport configured by VDSM. Caching the return value of getAllVmStats call for 50000 seconds (a bit more than half a day) still showed a slow leak over the weekend (22 MB -> 22.2 MB). I also ran a test with normal cache (5 s), no collectors, no controllers and no policy files and that leaked as fast as the report suggests. It remains to be seen whether the issue is in VDSM's vdscli code or in the underlying Python, because VDSM uses plain xml-rpc (over an unix socket) to talk to MOM too. So we might want to check whether that is leaking as well. Following comment #16 should we close this bug or move it to VDSM instead of mom? We do not know yet. We might update the description a bit and move it to some other component or we can use this bug to get rid of xml rpc in MOM. Some testing is still pending (Elad's tracemalloc or possibly valgrind) to pinpoint the leak source exactly. Any reason this should not go to 4.0.3? In closer look on Monitor.py seems like we have a circular reference issue def collect(self): ... ... for c in self.collectors: try: collected = c.collect() weakref might resolve it... Uh? Where do you see a circular reference? There is no leak in the Python code according to Pympler, how were you measuring it? What were the sample points? Both Monitor and Entity accumulate some data, but the amount should stay constant after couple of minutes (the old entries are evicted once the cache size is reached). (In reply to Martin Sivák from comment #23) > Uh? Where do you see a circular reference? > > There is no leak in the Python code according to Pympler, how were you > measuring it? What were the sample points? > > Both Monitor and Entity accumulate some data, but the amount should stay > constant after couple of minutes (the old entries are evicted once the cache > size is reached). by tracemalloc see the screenshot attached, i did memory dump every 1800sec, for 12 hours. The screenshot has no timing data, what does it show? I see increase in memory, but is this an increase for the last half an hour, or per 12 hours since the starts.. And I was also asking you where do you see the circular dependency in the attached code snippet. the screen shoot describes two memory dumps comparison, in range of 12 hours. also in the screenshot you can see the module name and the exec line i.e Monitor.py:102 which is this line data[key] = val which is inside and refer to the circular reference. attaching the memory dumps as well (In reply to Eldad Marciano from comment #26) > the screen shoot describes two memory dumps comparison, in range of 12 > hours. When was the first snapshot taken in terms of mom startup? How long was the process running? Is the machine still up and can you take another one? > also in the screenshot you can see the module name and the exec line > i.e > Monitor.py:102 which is this line > data[key] = val > > which is inside and refer to the circular reference. There is no circular reference. This is pure and simple dictionary with scalar value assigned to it. The value is rewritten every 10 seconds and the old value is garbage collected at Monitor.py:102. > attaching the memory dumps as well I loaded the tracemalloc dumps and the first dump showed almost no data except couple of threads. Was it taken right after MOM was started? Then it is expected that the memory usage will grow a bit before stabilizing. And 1 MiB is hardly excessive for mom variables and caches. How many VMs were you running? Can you take another dump and compare it with the second one (like t+12h and t+24h) to see whether the memory is actually leaking? The ssl and httplib seem to be eating quite a lot for what they are supposed to be doing, but still under another megabyte each. (In reply to Martin Sivák from comment #29) > (In reply to Eldad Marciano from comment #26) > > the screen shoot describes two memory dumps comparison, in range of 12 > > hours. > > When was the first snapshot taken in terms of mom startup? How long was the > process running? Is the machine still up and can you take another one? > > > also in the screenshot you can see the module name and the exec line > > i.e > > Monitor.py:102 which is this line > > data[key] = val > > > > which is inside and refer to the circular reference. > > There is no circular reference. This is pure and simple dictionary with > scalar value assigned to it. The value is rewritten every 10 seconds and the > old value is garbage collected at Monitor.py:102. > > > attaching the memory dumps as well > > I loaded the tracemalloc dumps and the first dump showed almost no data > except couple of threads. Was it taken right after MOM was started? Then it > is expected that the memory usage will grow a bit before stabilizing. And 1 > MiB is hardly excessive for mom variables and caches. > > How many VMs were you running? > > Can you take another dump and compare it with the second one (like t+12h and > t+24h) to see whether the memory is actually leaking? > > > The ssl and httplib seem to be eating quite a lot for what they are supposed > to be doing, but still under another megabyte each. right, well approximately we have the same picture for continues run over the weekend. as i mention at the begging we running around ~77 vms, for the weekend run we had 80 vms. I also did some comparison with lower number of vms to see how it affect the leakage (see comment https://bugzilla.redhat.com/show_bug.cgi?id=1346926#c3). anyway this leakage is in size of few MB and not a 100 as mention at the beginning of the thread. not sure what changed this version but it is not on the same rate, which is good. mom version: mom-0.5.5-1.el7ev.noarch anyway its a very small leakage which will be nice to solve it. attaching dumps this bug reproduced in master 4.1.0-0 in rate of 5.3mb in 12 hours. Eldad: again.. we need more information than that. Did you let MOM stabilize its caches before taking the first snapshot? The pieces in 4.0.6 (mom and vdsm) are now configured to use jsonrpc and that should solve the leak (as it was caused by something inside xmlrpc). verified on top of build 4.0.6-1 currently there is a leakage in size 0.37mb (378KB) for 12 hours. 31.5KB per hour. currently we can close this bug, and open a new one in lower priority. Hi, so might I kindly ask where this new bug is? Thanks Sven (In reply to Sven Kieske from comment #42) > Hi, > > so might I kindly ask where this new bug is? > > Thanks > > Sven Sure, there is a link https://bugzilla.redhat.com/show_bug.cgi?id=1399567 |