Description of problem: run the load with statistics open, the VDSM process memory raised over time and eventually got to 19 GB RAM usage (out of 64 GB RAM) , thus total server RAM utilization was 100% and SWAP trashing started to cause system failures ( VM's crashed etc` ). Version-Release number of selected component (if applicable): vdsm-python-4.18.6-1.el7ev.noarch vdsm-hook-vmfex-dev-4.18.6-1.el7ev.noarch vdsm-infra-4.18.6-1.el7ev.noarch vdsm-xmlrpc-4.18.6-1.el7ev.noarch vdsm-jsonrpc-4.18.6-1.el7ev.noarch vdsm-4.18.6-1.el7ev.x86_64 vdsm-cli-4.18.6-1.el7ev.noarch vdsm-yajsonrpc-4.18.6-1.el7ev.noarch vdsm-api-4.18.6-1.el7ev.noarch Steps to Reproduce: 1. started load creating 120 idle VMS on 1 host 2. VDSM process memory raised over time to 19 GB RAM Actual results: Server memory got to 100% utilization Expected results: Server memory should not exceed 100%
Hi Guy, What does "run the load with statistics open" mean? please provide the exact steps you did as part of the test. I can only assume that you meant that you configured metrics_enabled to true, if that's the case - in the version you use we send metrics only about host info and not about vms - so can't be that because of the metrics_enabled config you got the leak. In addition, when you run a vm you start qemu process which locates memory in the amount that you declare for the vm. currently I see in your system 111 qemu processes - each defined for 512mb ram ,means 50GB for all together - can be that this is the memory usage you refer to? If not, did you try to change the metrics_enable config to false and got different result?
I have configured metrics_enabled to true with statsd according to the steps provided at the blog https://bronhaim.wordpress.com/2016/06/26/ovirt-metrics/, at /etc/vdsm/vdsm.conf i have configured : [vars] ssl = true [addresses] management_port = 54321 [reports] # Enable metric collector (default: false) enabled = true # Address to remote metric collector (default: localhost) collector_address = bkr-hv01.dsal.lab.eng.rdu.redhat.com # Type of metric collector (supporting: stastd, hawkular) collector_type = statsd [devel] # Enable Vdsm health monitoring. health_monitor_enable = true # Number of seconds to wait between health checks. health_check_interval = 60 I have then created 111 VM's on 1 host and turned them on, and the system stayed up for 10 day's. After that period of time the VDSM process memory went up to 19 GB, not the qemu processes,but the VDSM itself. Then, on Thursday August 4 , i configured metrics enabled to false, by changing the below parameters to false and restart the vdsmd service. [reports] # Enable metric collector (default: false) enabled = false [devel] # Enable Vdsm health monitoring. health_monitor_enable = false Then again, i have created 111 VMS on 1 host and turned them on . After the VMS where up, the VDSM process took 430 MB of RAM. Now, 54 hours later, the VDSM takes 4.5 GB as seen below : PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 115737 vdsm 0 -20 10.833g 4.5g 11272 S 61.6 7.1 994:29.72 vdsm So, i suspect the memory leak is not due to metrics been enabled.
Created attachment 1188322 [details] 28 of June Active VMS vs. VDSM RAM 28 of June, load test of first 4.0 version, Active VMS vs. VDSM RAM utilization
Created attachment 1188323 [details] 24 of July Active VMS vs. VDSM RAM 24 of July, load test of 4.0.1.1-0.1.el7ev version, Active VMS vs. VDSM RAM utilization
I have extracted 2 comparisons graphs that are attached, one is from the load test from the 28 of June, and one is from load test from the weekend with the metrics disabled, we can see clearly that after the VMS are up, on the first load, the VDSM memory stabilizes around 350 MB, and on the second load, it continue to raises up to 1.25 GB.
Michal - can you team investigate? I know you've had fixes around this one in the not so long past. Perhaps something got broken?
(In reply to Oved Ourfali from comment #7) > Michal - can you team investigate? > I know you've had fixes around this one in the not so long past. > Perhaps something got broken? Last big thing I can remember is https://bugzilla.redhat.com/show_bug.cgi?id=1269424 But 4.18.6 has all the patches. So it could be a new thing. Does anyone know if this happens with _all_ the 4.18.z releases? Let's say, it also happens with 4.18.2?
(In reply to guy chen from comment #0) > Description of problem: > > run the load with statistics open, the VDSM process memory raised over time > and eventually got to 19 GB RAM usage (out of 64 GB RAM) , thus total server > RAM utilization was 100% and SWAP trashing started to cause system failures > ( VM's crashed etc` ). > > Version-Release number of selected component (if applicable): > > vdsm-python-4.18.6-1.el7ev.noarch > vdsm-hook-vmfex-dev-4.18.6-1.el7ev.noarch > vdsm-infra-4.18.6-1.el7ev.noarch > vdsm-xmlrpc-4.18.6-1.el7ev.noarch > vdsm-jsonrpc-4.18.6-1.el7ev.noarch > vdsm-4.18.6-1.el7ev.x86_64 > vdsm-cli-4.18.6-1.el7ev.noarch > vdsm-yajsonrpc-4.18.6-1.el7ev.noarch > vdsm-api-4.18.6-1.el7ev.noarch > > Steps to Reproduce: > 1. started load creating 120 idle VMS on 1 host > 2. VDSM process memory raised over time to 19 GB RAM > > Actual results: > > Server memory got to 100% utilization > > Expected results: > > Server memory should not exceed 100% Hi Guy, do you know by any chance if this happens also with older 4.18.z releases? Thanks,
we do not have 4.18.z comparison but we see by the attached graph's at version 4.0.19 we did a month ago the same load and it didn't show the memory leak.
(In reply to guy chen from comment #10) > we do not have 4.18.z comparison but we see by the attached graph's at > version 4.0.19 we did a month ago the same load and it didn't show the > memory leak. Thanks Guy, this is valuable information. Only one more question, for my understanding: as far as I understood from the graphs, oVirt 4.0.1 behaved correctly with respect to memory consumption, could you please confirm this? If I am right, then we could say also that Vdsm 4.18.4 behaved OK, because we shipped that version in oVirt 4.0.1. So the faulty change should be in between to Vdsm 4.18.4 and 4.18.6
Yes, that is correct
Thanks Guy. Taking the bug to do further investigation. Will dispatch to other vertical should we discover this is not virt issue.
Confirmed with (shorter) independent check that resource usage of Vdsm 4.18.4.1 doe *NOT* grow constantly over time. So the faulty change definitely lies in between 4.18.4.1 and 4.18.6.
we confirmed a memory leak, likely introduced by https://gerrit.ovirt.org/#/c/59512/
(In reply to Michal Skrivanek from comment #15) > we confirmed a memory leak, likely introduced by > https://gerrit.ovirt.org/#/c/59512/ To see the reason because it could cause a leak, please read https://github.com/farcepest/MySQLdb1/issues/108 It is completely unrelated module Vdsm doesn't use, but the scenario is the same and the flow is the same. Crosscheck with the code: https://hg.python.org/cpython/file/2.7/Lib/warnings.py#l233 To have the definitive proof, I'm about to run v4.18.6 without that patch
(In reply to Francesco Romani from comment #16) > (In reply to Michal Skrivanek from comment #15) > > we confirmed a memory leak, likely introduced by > > https://gerrit.ovirt.org/#/c/59512/ > > To see the reason because it could cause a leak, please read > https://github.com/farcepest/MySQLdb1/issues/108 > It is completely unrelated module Vdsm doesn't use, but the scenario is the > same and the flow is the same. > > Crosscheck with the code: > https://hg.python.org/cpython/file/2.7/Lib/warnings.py#l233 > > > To have the definitive proof, I'm about to run v4.18.6 without that patch Confirmed. A special version of 4.18.6, just without patch 59512, succesfully run for ~10 hrs with stable resource consumption, like 4.18.4 did. We are now working on a mitigation patch and on a later, definitive fix. There is one workaround available: actually enabling the warnings adding [devel] python_warnings_enable = true in /etc/vdsm/vdsm.conf makes the leak disappear, but also makes the logs much more verbose.
better mitigation patch 62186 posted
Nope, still POST: I need to backport the fix to ovirt-4.0.
nope, we need https://gerrit.ovirt.org/#/c/62198/
if we manage to ship this fix in 4.0.2, as it is planned, then we don't need doc_text, because this bug will affect only unreleased software, hence the documentation provided in the above comments is sufficient
The bug was introduced in Vdsm 4.18.5.1 and fixed in Vdsm 4.18.11 Please note that no released version should have this bug, only betas and RCs. Vdsm 4.18.4 was released in oVirt 4.0.1 final. Vdsm 4.18.11 was released in oVirt 4.0.2 final.
Tested on 4.18.11 and fixed. VDSM memory after 20 hours is 330 MB, while before the fix, after 20 hours it was about 1.5 GB.