Bug 1130045

Summary: Very high memory consumption
Product: [Retired] oVirt Reporter: Yevgeny Zaspitsky <yzaspits>
Component: ioprocessAssignee: Yeela Kaplan <ykaplan>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Kubica <pkubica>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5CC: asegurap, bazulay, bugs, daniel.helgenberger, danken, ecohen, fromani, gklein, iheim, mgoldboi, mst, oourfali, rbalakri, s.kieske, yeylon, ykaplan
Target Milestone: ---Keywords: Regression
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: 0.12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-17 12:22:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1073943    

Description Yevgeny Zaspitsky 2014-08-14 08:37:54 UTC
Description of problem:
vdsm process consumes all available memory on the host.

Version-Release number of selected component (if applicable):
vdsm-4.16.1-0.gita4d9abf.el6
libvirt-0.10.2-29.el6_5.9
python-cpopen-1.3-1.el6
ioprocess-0.5.0-1.el6

OS: CentOS release 6.5 (Final)

How reproducible:


Steps to Reproduce:
1. connect to Engine.
2. run few VMs.
3. configure networks.
4. wait 14 days

Actual results:
"OSError: [Errno 12] Cannot allocate memory" in vdsm.log file.

[root@yzaspits-vdsc vdsm]# ps -o etime,rss,vsz 24135
    ELAPSED   RSS    VSZ
14-15:55:04 3498112 4682764

Expected results:


Additional info:

Comment 1 Dan Kenigsberg 2014-08-26 16:26:04 UTC
Barak, Yevgeny performed a couple of setupNetworks operations, but memory consumption rose much later, during two weeks of polling by Engine.

We don't know what in Vdsm introduced this leak (in the past it was a m2crypto bug), so on the mean time it's a general infrastructural issue.

Comment 2 Francesco Romani 2014-08-27 06:33:42 UTC
I did some research yesterday on the memory consumption issue because I stumbled on the same issue.

Here's what I found so far, hope it helps somehow:

1. first, I have an old development setup of mine with Engine something ~3.5.0 before RC1, VDSM 4.16.2 (very recent snapsnot of the 3.5.0 branch), using XML-RPC and polling, SSL *disabled*. VDSM runs on stock RHEL7, Engine on F19.

2. On that setup, I quickly and hackishly bolted heapy <http://guppy-pe.sourceforge.net/#Heapy> on the HostStatsThread. This gave me a snapshot of heap state overy 2 seconds on a new log file

3. I restarted the patched VDSM and booted 10 empty VMs (no OS installed), then I waited.

I sampled the memory consumption in the next couple of hours and indeed it grows slowly but steadly. Here's are a couple of excerpts from that log:

---
<451 more rows. Type e.g. '_.more' to view.>
Thread-12::INFO::2014-08-26 17:23:00,689::sampling::533::heap::(run) Partition of a set of 218275 objects. Total size = 29054000 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  83768  38  8757872  30   8757872  30 str
     1  37991  17  3263624  11  12021496  41 tuple
     2   4254   2  2673744   9  14695240  51 dict (no owner)
     3    408   0  1434432   5  16129672  56 dict of module
     4  11263   5  1351560   5  17481232  60 function
     5   9703   4  1241984   4  18723216  64 types.CodeType
     6   1244   1  1125512   4  19848728  68 type
     7  13130   6  1083840   4  20932568  72 unicode
     8   1244   1   993056   3  21925624  75 dict of type
     9   4945   2   730888   3  22656512  78 list
<451 more rows. Type e.g. '_.more' to view.>
Thread-12::INFO::2014-08-26 17:23:03,213::sampling::533::heap::(run) Partition of a set of 218275 objects. Total size = 29054040 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  83768  38  8757872  30   8757872  30 str
     1  37991  17  3263624  11  12021496  41 tuple
     2   4254   2  2673744   9  14695240  51 dict (no owner)
     3    408   0  1434432   5  16129672  56 dict of module
     4  11263   5  1351560   5  17481232  60 function
     5   9703   4  1241984   4  18723216  64 types.CodeType
     6   1244   1  1125512   4  19848728  68 type
     7  13130   6  1083840   4  20932568  72 unicode
     8   1244   1   993056   3  21925624  75 dict of type
     9   4945   2   730888   3  22656512  78 list
<451 more rows. Type e.g. '_.more' to view.>
Thread-12::INFO::2014-08-26 17:23:05,961::sampling::533::heap::(run) Partition of a set of 218318 objects. Total size = 29058840 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  83769  38  8757984  30   8757984  30 str
     1  37993  17  3263768  11  12021752  41 tuple
     2   4256   2  2674304   9  14696056  51 dict (no owner)
     3    408   0  1434432   5  16130488  56 dict of module
     4  11263   5  1351560   5  17482048  60 function
     5   9703   4  1241984   4  18724032  64 types.CodeType
     6   1244   1  1125512   4  19849544  68 type
     7  13140   6  1084608   4  20934152  72 unicode
     8   1244   1   993056   3  21927208  75 dict of type
     9   4945   2   730888   3  22658096  78 list
<451 more rows. Type e.g. '_.more' to view.>
Thread-12::INFO::2014-08-26 17:23:08,414::sampling::533::heap::(run) Partition of a set of 218318 objects. Total size = 29058888 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  83769  38  8757984  30   8757984  30 str
     1  37993  17  3263768  11  12021752  41 tuple
     2   4256   2  2674304   9  14696056  51 dict (no owner)
     3    408   0  1434432   5  16130488  56 dict of module
     4  11263   5  1351560   5  17482048  60 function
     5   9703   4  1241984   4  18724032  64 types.CodeType
     6   1244   1  1125512   4  19849544  68 type
     7  13140   6  1084608   4  20934152  72 unicode
     8   1244   1   993056   3  21927208  75 dict of type
     9   4945   2   730888   3  22658096  78 list
<451 more rows. Type e.g. '_.more' to view.>
Thread-12::INFO::2014-08-26 17:23:10,826::sampling::533::heap::(run) Partition of a set of 218399 objects. Total size = 29067696 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  83768  38  8757872  30   8757872  30 str
     1  37991  17  3263624  11  12021496  41 tuple
     2   4260   2  2676960   9  14698456  51 dict (no owner)
     3    408   0  1434432   5  16132888  56 dict of module
     4  11263   5  1351560   5  17484448  60 function
     5   9703   4  1241984   4  18726432  64 types.CodeType
     6   1244   1  1125512   4  19851944  68 type
     7  13172   6  1087216   4  20939160  72 unicode
     8   1244   1   993056   3  21932216  75 dict of type
     9   4945   2   730888   3  22663104  78 list
<451 more rows. Type e.g. '_.more' to view.>
---

We can see here the overall size (objects number, total size) is slowly growing, but we must consider the GC intervals.

Here's a more explicit exceprt after ~50 minutes of runtime

Thread-12::INFO::2014-08-26 16:36:38,242::sampling::533::heap::(run) Partition of a set of 160330 objects. Total size = 20055064 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  74611  47  5956520  30   5956520  30 str
     1  36086  23  3098160  15   9054680  45 tuple
     2    400   0  1407616   7  10462296  52 dict of module
     3   1863   1  1288104   6  11750400  59 dict (no owner)
     4   9579   6  1226112   6  12976512  65 types.CodeType
     5   9947   6  1193640   6  14170152  71 function
     6   1246   1  1127320   6  15297472  76 type
     7   1246   1   989392   5  16286864  81 dict of type
     8   4412   3   615904   3  16902768  84 list
     9    442   0   440176   2  17342944  86 dict of class
<414 more rows. Type e.g. '_.more' to view.>
 
 
Thread-12::INFO::2014-08-26 17:26:47,470::sampling::533::heap::(run) Partition of a set of 221002 objects. Total size = 29355984 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  83768  38  8757872  30   8757872  30 str
     1  37991  17  3263624  11  12021496  41 tuple
     2   4386   2  2744496   9  14765992  50 dict (no owner)
     3    408   0  1434432   5  16200424  55 dict of module
     4  11263   5  1351560   5  17551984  60 function
     5   9703   4  1241984   4  18793968  64 types.CodeType
     6  14054   6  1158112   4  19952080  68 unicode
     7   1244   1  1125512   4  21077592  72 type
     8   1244   1   993056   3  22070648  75 dict of type
     9   4945   2   730888   2  22801536  78 list

The comparison is not enterly fair: at 16:36 VDSM was doing nothing, at 17:26 it was sampling 10 VMs. But indeed here we can clearly see a huge increase in unicode objects. A slower growth of unicode objects is also noticeable in the former excerpt.

The other kind of objects are maybe increasing at much slower rate -or maybe just floating around a stable value.

The first suspect by far is unicode, but unfortunatley I can't offer more data yet.
The heapy patches I made are defintely too hacky and dirty, so I took a different approach and crafted a more reusable patch leveraging a different profiler, which seems more suitable for VDSM (rationale in the commit message) here: http://gerrit.ovirt.org/#/c/32019/1

Comment 3 Francesco Romani 2014-08-27 09:23:04 UTC
According to profile gathered using http://gerrit.ovirt.org/#/c/32019/2 it seems there is a leak of ioprocess.CmdResult objects.

Comment 4 Dan Kenigsberg 2014-08-27 12:33:04 UTC
*** Bug 1124451 has been marked as a duplicate of this bug. ***

Comment 5 Antoni Segura Puimedon 2014-08-27 19:19:02 UTC
fromani and I found two memory leaks in the python bindings. I updated ykaplan on them. Basically they are:

(communicate thread): CmdResult objects are not being removed from the pendingrequests dictionary, so they will be held on forever and they accumulate. Changing .get to .pop would fix it

(sendCommand <-> communicate thread): When sendCommand stops waiting for completion due to timeout, that is, timeout finishes and the event is not set, just before raising Timeout it should notify the communicate thread so that the thread removes the task from pendingrequests. This is fine because communicate handles the case of an eventual unknown (in this case, no longer waited upon) task being received.

Comment 6 Daniel Helgenberger 2014-09-24 15:02:19 UTC
Will this be backported to the oVirt 3.4.x branch? Right now, I have a cron job restarting vdsmd once a week on my hosts (I am running into BZ 1124451, witch was flagged as duplicate)

Comment 7 Dan Kenigsberg 2014-09-24 15:29:16 UTC
If you are seeing a memory leak with ovirt-3.4, it's another bug. Both this bug and its dup related to ioprocess which is introduced as part of ovirt-3.5.

Could you explain more of your memory leak? Preferably elsewhere, since this bug is about the ioprocess issue.

Comment 8 Petr Kubica 2014-10-15 12:23:46 UTC
Verified in 3.5.0-0.0.master.20140923231936.git42065cc.el6

Comment 9 Sandro Bonazzola 2014-10-17 12:22:13 UTC
oVirt 3.5 has been released and should include the fix for this issue.