Bug 1130045
Summary: | Very high memory consumption | ||
---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Yevgeny Zaspitsky <yzaspits> |
Component: | ioprocess | Assignee: | Yeela Kaplan <ykaplan> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Kubica <pkubica> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.5 | CC: | asegurap, bazulay, bugs, daniel.helgenberger, danken, ecohen, fromani, gklein, iheim, mgoldboi, mst, oourfali, rbalakri, s.kieske, yeylon, ykaplan |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | 3.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | infra | ||
Fixed In Version: | 0.12 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-10-17 12:22:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1073943 |
Description
Yevgeny Zaspitsky
2014-08-14 08:37:54 UTC
Barak, Yevgeny performed a couple of setupNetworks operations, but memory consumption rose much later, during two weeks of polling by Engine. We don't know what in Vdsm introduced this leak (in the past it was a m2crypto bug), so on the mean time it's a general infrastructural issue. I did some research yesterday on the memory consumption issue because I stumbled on the same issue. Here's what I found so far, hope it helps somehow: 1. first, I have an old development setup of mine with Engine something ~3.5.0 before RC1, VDSM 4.16.2 (very recent snapsnot of the 3.5.0 branch), using XML-RPC and polling, SSL *disabled*. VDSM runs on stock RHEL7, Engine on F19. 2. On that setup, I quickly and hackishly bolted heapy <http://guppy-pe.sourceforge.net/#Heapy> on the HostStatsThread. This gave me a snapshot of heap state overy 2 seconds on a new log file 3. I restarted the patched VDSM and booted 10 empty VMs (no OS installed), then I waited. I sampled the memory consumption in the next couple of hours and indeed it grows slowly but steadly. Here's are a couple of excerpts from that log: --- <451 more rows. Type e.g. '_.more' to view.> Thread-12::INFO::2014-08-26 17:23:00,689::sampling::533::heap::(run) Partition of a set of 218275 objects. Total size = 29054000 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 83768 38 8757872 30 8757872 30 str 1 37991 17 3263624 11 12021496 41 tuple 2 4254 2 2673744 9 14695240 51 dict (no owner) 3 408 0 1434432 5 16129672 56 dict of module 4 11263 5 1351560 5 17481232 60 function 5 9703 4 1241984 4 18723216 64 types.CodeType 6 1244 1 1125512 4 19848728 68 type 7 13130 6 1083840 4 20932568 72 unicode 8 1244 1 993056 3 21925624 75 dict of type 9 4945 2 730888 3 22656512 78 list <451 more rows. Type e.g. '_.more' to view.> Thread-12::INFO::2014-08-26 17:23:03,213::sampling::533::heap::(run) Partition of a set of 218275 objects. Total size = 29054040 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 83768 38 8757872 30 8757872 30 str 1 37991 17 3263624 11 12021496 41 tuple 2 4254 2 2673744 9 14695240 51 dict (no owner) 3 408 0 1434432 5 16129672 56 dict of module 4 11263 5 1351560 5 17481232 60 function 5 9703 4 1241984 4 18723216 64 types.CodeType 6 1244 1 1125512 4 19848728 68 type 7 13130 6 1083840 4 20932568 72 unicode 8 1244 1 993056 3 21925624 75 dict of type 9 4945 2 730888 3 22656512 78 list <451 more rows. Type e.g. '_.more' to view.> Thread-12::INFO::2014-08-26 17:23:05,961::sampling::533::heap::(run) Partition of a set of 218318 objects. Total size = 29058840 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 83769 38 8757984 30 8757984 30 str 1 37993 17 3263768 11 12021752 41 tuple 2 4256 2 2674304 9 14696056 51 dict (no owner) 3 408 0 1434432 5 16130488 56 dict of module 4 11263 5 1351560 5 17482048 60 function 5 9703 4 1241984 4 18724032 64 types.CodeType 6 1244 1 1125512 4 19849544 68 type 7 13140 6 1084608 4 20934152 72 unicode 8 1244 1 993056 3 21927208 75 dict of type 9 4945 2 730888 3 22658096 78 list <451 more rows. Type e.g. '_.more' to view.> Thread-12::INFO::2014-08-26 17:23:08,414::sampling::533::heap::(run) Partition of a set of 218318 objects. Total size = 29058888 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 83769 38 8757984 30 8757984 30 str 1 37993 17 3263768 11 12021752 41 tuple 2 4256 2 2674304 9 14696056 51 dict (no owner) 3 408 0 1434432 5 16130488 56 dict of module 4 11263 5 1351560 5 17482048 60 function 5 9703 4 1241984 4 18724032 64 types.CodeType 6 1244 1 1125512 4 19849544 68 type 7 13140 6 1084608 4 20934152 72 unicode 8 1244 1 993056 3 21927208 75 dict of type 9 4945 2 730888 3 22658096 78 list <451 more rows. Type e.g. '_.more' to view.> Thread-12::INFO::2014-08-26 17:23:10,826::sampling::533::heap::(run) Partition of a set of 218399 objects. Total size = 29067696 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 83768 38 8757872 30 8757872 30 str 1 37991 17 3263624 11 12021496 41 tuple 2 4260 2 2676960 9 14698456 51 dict (no owner) 3 408 0 1434432 5 16132888 56 dict of module 4 11263 5 1351560 5 17484448 60 function 5 9703 4 1241984 4 18726432 64 types.CodeType 6 1244 1 1125512 4 19851944 68 type 7 13172 6 1087216 4 20939160 72 unicode 8 1244 1 993056 3 21932216 75 dict of type 9 4945 2 730888 3 22663104 78 list <451 more rows. Type e.g. '_.more' to view.> --- We can see here the overall size (objects number, total size) is slowly growing, but we must consider the GC intervals. Here's a more explicit exceprt after ~50 minutes of runtime Thread-12::INFO::2014-08-26 16:36:38,242::sampling::533::heap::(run) Partition of a set of 160330 objects. Total size = 20055064 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 74611 47 5956520 30 5956520 30 str 1 36086 23 3098160 15 9054680 45 tuple 2 400 0 1407616 7 10462296 52 dict of module 3 1863 1 1288104 6 11750400 59 dict (no owner) 4 9579 6 1226112 6 12976512 65 types.CodeType 5 9947 6 1193640 6 14170152 71 function 6 1246 1 1127320 6 15297472 76 type 7 1246 1 989392 5 16286864 81 dict of type 8 4412 3 615904 3 16902768 84 list 9 442 0 440176 2 17342944 86 dict of class <414 more rows. Type e.g. '_.more' to view.> Thread-12::INFO::2014-08-26 17:26:47,470::sampling::533::heap::(run) Partition of a set of 221002 objects. Total size = 29355984 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 83768 38 8757872 30 8757872 30 str 1 37991 17 3263624 11 12021496 41 tuple 2 4386 2 2744496 9 14765992 50 dict (no owner) 3 408 0 1434432 5 16200424 55 dict of module 4 11263 5 1351560 5 17551984 60 function 5 9703 4 1241984 4 18793968 64 types.CodeType 6 14054 6 1158112 4 19952080 68 unicode 7 1244 1 1125512 4 21077592 72 type 8 1244 1 993056 3 22070648 75 dict of type 9 4945 2 730888 2 22801536 78 list The comparison is not enterly fair: at 16:36 VDSM was doing nothing, at 17:26 it was sampling 10 VMs. But indeed here we can clearly see a huge increase in unicode objects. A slower growth of unicode objects is also noticeable in the former excerpt. The other kind of objects are maybe increasing at much slower rate -or maybe just floating around a stable value. The first suspect by far is unicode, but unfortunatley I can't offer more data yet. The heapy patches I made are defintely too hacky and dirty, so I took a different approach and crafted a more reusable patch leveraging a different profiler, which seems more suitable for VDSM (rationale in the commit message) here: http://gerrit.ovirt.org/#/c/32019/1 According to profile gathered using http://gerrit.ovirt.org/#/c/32019/2 it seems there is a leak of ioprocess.CmdResult objects. *** Bug 1124451 has been marked as a duplicate of this bug. *** fromani and I found two memory leaks in the python bindings. I updated ykaplan on them. Basically they are: (communicate thread): CmdResult objects are not being removed from the pendingrequests dictionary, so they will be held on forever and they accumulate. Changing .get to .pop would fix it (sendCommand <-> communicate thread): When sendCommand stops waiting for completion due to timeout, that is, timeout finishes and the event is not set, just before raising Timeout it should notify the communicate thread so that the thread removes the task from pendingrequests. This is fine because communicate handles the case of an eventual unknown (in this case, no longer waited upon) task being received. Will this be backported to the oVirt 3.4.x branch? Right now, I have a cron job restarting vdsmd once a week on my hosts (I am running into BZ 1124451, witch was flagged as duplicate) If you are seeing a memory leak with ovirt-3.4, it's another bug. Both this bug and its dup related to ioprocess which is introduced as part of ovirt-3.5. Could you explain more of your memory leak? Preferably elsewhere, since this bug is about the ioprocess issue. Verified in 3.5.0-0.0.master.20140923231936.git42065cc.el6 oVirt 3.5 has been released and should include the fix for this issue. |