Bug 1364149

Summary: VDSM memory leak in logging warnings
Product: [oVirt] vdsm Reporter: guy chen <guchen>
Component: GeneralAssignee: Francesco Romani <fromani>
Status: CLOSED CURRENTRELEASE QA Contact: eberman
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.18.6CC: bugs, fromani, gklein, guchen, michal.skrivanek, mkalinin, nsoffer, oourfali, pkliczew, ybronhei, ylavi
Target Milestone: ovirt-4.0.2Keywords: Regression
Target Release: ---Flags: rule-engine: ovirt-4.0.z+
rule-engine: blocker+
ylavi: planning_ack+
michal.skrivanek: devel_ack+
gklein: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-22 12:30:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
28 of June Active VMS vs. VDSM RAM
none
24 of July Active VMS vs. VDSM RAM none

Description guy chen 2016-08-04 14:35:15 UTC
Description of problem:

run the load with statistics open, the VDSM process memory raised over time and eventually got to 19 GB RAM usage (out of 64 GB RAM) , thus total server RAM utilization was 100% and SWAP trashing started to cause system failures ( VM's crashed etc` ).

Version-Release number of selected component (if applicable):

vdsm-python-4.18.6-1.el7ev.noarch
vdsm-hook-vmfex-dev-4.18.6-1.el7ev.noarch
vdsm-infra-4.18.6-1.el7ev.noarch
vdsm-xmlrpc-4.18.6-1.el7ev.noarch
vdsm-jsonrpc-4.18.6-1.el7ev.noarch
vdsm-4.18.6-1.el7ev.x86_64
vdsm-cli-4.18.6-1.el7ev.noarch
vdsm-yajsonrpc-4.18.6-1.el7ev.noarch
vdsm-api-4.18.6-1.el7ev.noarch

Steps to Reproduce:
1. started load creating 120 idle VMS on 1 host
2. VDSM process memory raised over time to 19 GB RAM

Actual results:

Server memory got to 100% utilization

Expected results:

Server memory should not exceed 100%

Comment 1 Yaniv Bronhaim 2016-08-04 19:56:31 UTC
Hi Guy,
What does "run the load with statistics open" mean? please provide the exact steps you did as part of the test. 
I can only assume that you meant that you configured metrics_enabled to true, if that's the case - in the version you use we send metrics only about host info and not about vms - so can't be that because of the metrics_enabled config you got the leak.
In addition, when you run a vm you start qemu process which locates memory in the amount that you declare for the vm. currently I see in your system 111 qemu processes - each defined for 512mb ram ,means 50GB for all together - can be that this is the memory usage you refer to?
If not, did you try to change the metrics_enable config to false and got different result?

Comment 3 guy chen 2016-08-06 20:40:49 UTC
I have configured metrics_enabled to true with statsd according to the steps provided at the blog https://bronhaim.wordpress.com/2016/06/26/ovirt-metrics/,

at /etc/vdsm/vdsm.conf i have configured : 

[vars]
ssl = true

[addresses]
management_port = 54321

[reports]
# Enable metric collector (default: false)
enabled = true

# Address to remote metric collector (default: localhost)
collector_address = bkr-hv01.dsal.lab.eng.rdu.redhat.com

# Type of metric collector (supporting: stastd, hawkular)
collector_type = statsd

[devel]
# Enable Vdsm health monitoring.
health_monitor_enable = true
	
# Number of seconds to wait between health checks.
health_check_interval = 60

I have then created 111 VM's on 1 host and turned them on, and the system stayed up for 10 day's.

After that period of time the VDSM process memory went up to 19 GB, not the qemu processes,but the VDSM itself.

Then, on Thursday August 4 , i configured metrics enabled to false, by changing the below parameters to false and restart the vdsmd service.

[reports]
# Enable metric collector (default: false)
enabled = false

[devel]
# Enable Vdsm health monitoring.
health_monitor_enable = false

Then again, i have created 111 VMS on 1 host and turned them on .
After the VMS where up, the VDSM process took 430 MB of RAM.
Now, 54 hours later, the VDSM takes 4.5 GB as seen below :

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                      
115737 vdsm       0 -20 10.833g 4.5g  11272 S  61.6  7.1 994:29.72 vdsm   

So, i suspect the memory leak is not due to metrics been enabled.

Comment 4 guy chen 2016-08-07 05:45:05 UTC
Created attachment 1188322 [details]
28 of June Active VMS vs. VDSM RAM

28 of June, load test of first 4.0 version, Active VMS vs. VDSM RAM utilization

Comment 5 guy chen 2016-08-07 05:48:26 UTC
Created attachment 1188323 [details]
24 of July Active VMS vs. VDSM RAM

24 of July, load test of 4.0.1.1-0.1.el7ev version, Active VMS vs. VDSM RAM utilization

Comment 6 guy chen 2016-08-07 06:02:11 UTC
I have extracted 2 comparisons graphs that are attached, one is from the load test from the 28 of June, and one is from load test from the weekend with the metrics disabled, we can see clearly that after the VMS are up, on the first load, the VDSM memory stabilizes around 350 MB, and on the second load,  it continue to raises up to 1.25 GB.

Comment 7 Oved Ourfali 2016-08-07 08:35:18 UTC
Michal - can you team investigate?
I know you've had fixes around this one in the not so long past.
Perhaps something got broken?

Comment 8 Francesco Romani 2016-08-08 12:59:21 UTC
(In reply to Oved Ourfali from comment #7)
> Michal - can you team investigate?
> I know you've had fixes around this one in the not so long past.
> Perhaps something got broken?

Last big thing I can remember is https://bugzilla.redhat.com/show_bug.cgi?id=1269424

But 4.18.6 has all the patches. So it could be a new thing.

Does anyone know if this happens with _all_ the 4.18.z releases? Let's say, it also happens with 4.18.2?

Comment 9 Francesco Romani 2016-08-08 13:00:17 UTC
(In reply to guy chen from comment #0)
> Description of problem:
> 
> run the load with statistics open, the VDSM process memory raised over time
> and eventually got to 19 GB RAM usage (out of 64 GB RAM) , thus total server
> RAM utilization was 100% and SWAP trashing started to cause system failures
> ( VM's crashed etc` ).
> 
> Version-Release number of selected component (if applicable):
> 
> vdsm-python-4.18.6-1.el7ev.noarch
> vdsm-hook-vmfex-dev-4.18.6-1.el7ev.noarch
> vdsm-infra-4.18.6-1.el7ev.noarch
> vdsm-xmlrpc-4.18.6-1.el7ev.noarch
> vdsm-jsonrpc-4.18.6-1.el7ev.noarch
> vdsm-4.18.6-1.el7ev.x86_64
> vdsm-cli-4.18.6-1.el7ev.noarch
> vdsm-yajsonrpc-4.18.6-1.el7ev.noarch
> vdsm-api-4.18.6-1.el7ev.noarch
> 
> Steps to Reproduce:
> 1. started load creating 120 idle VMS on 1 host
> 2. VDSM process memory raised over time to 19 GB RAM
> 
> Actual results:
> 
> Server memory got to 100% utilization
> 
> Expected results:
> 
> Server memory should not exceed 100%

Hi Guy, do you know by any chance if this happens also with older 4.18.z releases?

Thanks,

Comment 10 guy chen 2016-08-08 14:03:43 UTC
we do not have 4.18.z comparison but we see by the attached graph's at version 4.0.19 we did a month ago the same load and it didn't show the memory leak.

Comment 11 Francesco Romani 2016-08-08 14:39:45 UTC
(In reply to guy chen from comment #10)
> we do not have 4.18.z comparison but we see by the attached graph's at
> version 4.0.19 we did a month ago the same load and it didn't show the
> memory leak.

Thanks Guy, this is valuable information.
Only one more question, for my understanding: as far as I understood from the graphs, oVirt 4.0.1 behaved correctly with respect to memory consumption, could you please confirm this?

If I am right, then we could say also that Vdsm 4.18.4 behaved OK, because we shipped that version in oVirt 4.0.1.

So the faulty change should be in between to Vdsm 4.18.4 and 4.18.6

Comment 12 guy chen 2016-08-09 08:14:42 UTC
Yes, that is correct

Comment 13 Francesco Romani 2016-08-09 08:29:43 UTC
Thanks Guy. Taking the bug to do further investigation. Will dispatch to other vertical should we discover this is not virt issue.

Comment 14 Francesco Romani 2016-08-09 12:19:00 UTC
Confirmed with (shorter) independent check that resource usage of Vdsm 4.18.4.1 doe *NOT* grow constantly over time.

So the faulty change definitely lies in between 4.18.4.1 and 4.18.6.

Comment 15 Michal Skrivanek 2016-08-09 14:48:58 UTC
we confirmed a memory leak, likely introduced by https://gerrit.ovirt.org/#/c/59512/

Comment 16 Francesco Romani 2016-08-09 15:07:59 UTC
(In reply to Michal Skrivanek from comment #15)
> we confirmed a memory leak, likely introduced by
> https://gerrit.ovirt.org/#/c/59512/

To see the reason because it could cause a leak, please read
https://github.com/farcepest/MySQLdb1/issues/108
It is completely unrelated module Vdsm doesn't use, but the scenario is the same and the flow is the same.

Crosscheck with the code:
https://hg.python.org/cpython/file/2.7/Lib/warnings.py#l233


To have the definitive proof, I'm about to run v4.18.6 without that patch

Comment 17 Francesco Romani 2016-08-10 09:09:04 UTC
(In reply to Francesco Romani from comment #16)
> (In reply to Michal Skrivanek from comment #15)
> > we confirmed a memory leak, likely introduced by
> > https://gerrit.ovirt.org/#/c/59512/
> 
> To see the reason because it could cause a leak, please read
> https://github.com/farcepest/MySQLdb1/issues/108
> It is completely unrelated module Vdsm doesn't use, but the scenario is the
> same and the flow is the same.
> 
> Crosscheck with the code:
> https://hg.python.org/cpython/file/2.7/Lib/warnings.py#l233
> 
> 
> To have the definitive proof, I'm about to run v4.18.6 without that patch

Confirmed. A special version of 4.18.6, just without patch 59512, succesfully run for ~10 hrs with stable resource consumption, like 4.18.4 did.

We are now working on a mitigation patch and on a later, definitive fix.

There is one workaround available: actually enabling the warnings adding

  [devel]
  python_warnings_enable = true

in /etc/vdsm/vdsm.conf
makes the leak disappear, but also makes the logs much more verbose.

Comment 18 Francesco Romani 2016-08-10 09:25:10 UTC
better mitigation patch 62186 posted

Comment 19 Francesco Romani 2016-08-10 14:04:56 UTC
Nope, still POST: I need to backport the fix to ovirt-4.0.

Comment 20 Francesco Romani 2016-08-10 14:26:04 UTC
nope, we need https://gerrit.ovirt.org/#/c/62198/

Comment 21 Francesco Romani 2016-08-10 15:38:53 UTC
if we manage to ship this fix in 4.0.2, as it is planned, then we don't need doc_text, because this bug will affect only unreleased software, hence the documentation provided in the above comments is sufficient

Comment 22 Francesco Romani 2016-08-18 09:08:59 UTC
The bug was introduced in Vdsm 4.18.5.1 and fixed in Vdsm 4.18.11

Please note that no released version should have this bug, only betas and RCs.
Vdsm 4.18.4 was released in oVirt 4.0.1 final.
Vdsm 4.18.11 was released in oVirt 4.0.2 final.

Comment 23 guy chen 2016-08-19 07:07:42 UTC
Tested on 4.18.11 and fixed.
VDSM memory after 20 hours is 330 MB, while before the fix, after 20 hours it was about 1.5 GB.