Bug 1414479

Summary:	[scale] - vdsm minor memory leak - 3Mbs in 12 hours (running on a host with 111 VMs)
Product:	[oVirt] vdsm	Reporter:	Eldad Marciano <emarcian>
Component:	Core	Assignee:	Francesco Romani <fromani>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	eberman
Severity:	low	Docs Contact:
Priority:	unspecified
Version:	4.18.17	CC:	bugs, emarcian, fromani, mperina, nsoffer, oourfali, tjelinek, ybronhei, ykaul
Target Milestone:	ovirt-4.2.0	Flags:	rule-engine: ovirt-4.2+
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-24 15:38:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eldad Marciano 2017-01-18 15:46:32 UTC

Description of problem:
when running scale up for 111 idle vms.
vdsm having a leakage of ~3mb in 12 hours.

we need to rerun with vdsm profile logs enable or investigating it tracemalloc.

currently, I just submitting the bug.

assets profile:
1 cluste
1 NFS SD
1 host
111 vms with 1 disk.

test profile:
ramp up 111 vms and let them run for 12 hours.

Version-Release number of selected component (if applicable):
4.1-4 zstream 

How reproducible:
100%

Steps to Reproduce:
1. running 111 vms for 12 hours.


Actual results:
vdsm memory have a leakage of 3mb in 12 hours.

Expected results:
stable memory utilization, once the vms are up & running.

Additional info:

Comment 2 Yaniv Kaul 2017-01-18 19:03:41 UTC

Is the rate correlated with the number of VMs?
How much is VDSM consuming overall?
Does restarting VDSM bring it back to the initial value?

Comment 3 Eldad Marciano 2017-01-19 12:27:10 UTC

(In reply to Yaniv Kaul from comment #2)
> Is the rate correlated with the number of VMs?
We need to test it to provide an answer.
but, once all the vms was populated and running, vdsm start leaking
other works, the leakage may come from vdsm monitoring.

> How much is VDSM consuming overall?
182 mb (was the last sample from the test duration)

after 3 days:
-=>>ps -eo pid,rss |grep 15189 |awk '{print $2 / 1024}'
194.59

vdsm res memory utilization (@test period):
min 79mb, max 185mb, avg 173mb, last 182mb.

> Does restarting VDSM bring it back to the initial value?
No, it's produce ± the same size of memory

before restart:
-=>>ps -eo pid,rss |grep 15189 |awk '{print $2 / 1024}'
194.59

after restart:
-=>>ps -eo pid,rss,cmd |grep '/usr/bin/python2 /usr/share/vdsm/vdsm' |awk '{print $2 / 1024}'
187.422

it's back us to you first comment, 
the initial memory footprint correlated to the number of vms:
- no vms = 79mb.
- 111 vms = 187mb.
but, it does not change the fact that vdsm has a leakage.

Comment 4 Oved Ourfali 2017-01-23 13:13:36 UTC

Moving to Virt as it seems related to the number of VMS.
Francesco, can you explore this one?
Move back to infra if you feel that it is an infra issue.

Comment 5 Tomas Jelinek 2017-01-25 10:13:59 UTC

Well, it is hard to see from this sample if this is even a leak. Especially considering that vdsm is eating here about 200MB, that 3MB difference can be caused by lots of factors.

Could you please try to let it run for about a week and make a sample every 12h (more often would be better) to see if it really grows the whole time or is it just going up and down?

If it indeed grows, please provide all due logs so we can look at the issue.

Comment 6 Eldad Marciano 2017-01-26 09:48:48 UTC

(In reply to Tomas Jelinek from comment #5)
> Well, it is hard to see from this sample if this is even a leak. Especially
> considering that vdsm is eating here about 200MB, that 3MB difference can be
> caused by lots of factors.
> 
> Could you please try to let it run for about a week and make a sample every
> 12h (more often would be better) to see if it really grows the whole time or
> is it just going up and down?
> 
> If it indeed grows, please provide all due logs so we can look at the issue.

Sure we can do so,
seems like week is too much, and we can't occupied the lab for that period.

lets start with weekend, and if it's not enough we can move forward.

Comment 7 Tomas Jelinek 2017-02-01 08:18:10 UTC

well, not sure if one weekend will help but lets try. Putting needinfo back

Comment 8 Eldad Marciano 2017-02-01 13:30:41 UTC

(In reply to Tomas Jelinek from comment #7)
> well, not sure if one weekend will help but lets try. Putting needinfo back

please set target release, severity, priority

Comment 9 Tomas Jelinek 2017-02-08 07:48:20 UTC

(In reply to Eldad Marciano from comment #8)
> (In reply to Tomas Jelinek from comment #7)
> > well, not sure if one weekend will help but lets try. Putting needinfo back
> 
> please set target release, severity, priority

I can not set any of these until I will get some data from which I will be able to asses if it is a bug and how severe it is.

Comment 10 Yaniv Kaul 2017-04-19 16:45:13 UTC

Please update the bug with latest findings.

Comment 11 Eldad Marciano 2017-04-24 15:01:10 UTC

(In reply to Yaniv Kaul from comment #10)
> Please update the bug with latest findings.

currently its targeted to 4.2, so we will hit it there.
if you think it needs higher priority so lets coordinate it to 4.1.x