Created attachment 941828 [details] pmap -x of VDSM, run evrey 6hrs on both hosts Description of problem: The vdsm process memory size grows constantly; about 91.5MB/6hrs. If unchecked this will cause breakdown of the host and guests (OOM). Version-Release number of selected component (if applicable): oVirt 3.4.4 Hosted Engine HA Node Hosts: CentOS 6.5 - 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux vdsm-4.14.17-0.el6.x86_64 How reproducible: always Steps to Reproduce: 1. Add host to oOvirt 2. Start a VM on the host 3. Note VDSM RES size 4. Wait a few days 6. Compare VDSM RES size Actual results: VDSM process grows in mem size; eating up all the host's memory and causing OOM conditions and ultimately complete breakdown of the host. Expected results: VDSM mem size behaving more civilized. Workaround: Restart VDSM service (by weekly cron script). Additional info: IIRC this was introduced with oVirt 3.4.2 If it helps, I sampled the vdsm process with: pmap -x $VDSM_PID every six hours. Find the dull output attached. grep -w '2014\|total' vdsm-sample-nodehv0* vdsm-sample-nodehv01.lab.mbox.loc:Fr 26. Sep 15:23:14 CEST 2014 vdsm-sample-nodehv01.lab.mbox.loc:total kB 4442432 836276 826000 vdsm-sample-nodehv01.lab.mbox.loc:Fr 26. Sep 21:23:14 CEST 2014 vdsm-sample-nodehv01.lab.mbox.loc:total kB 4639040 927756 917480 vdsm-sample-nodehv01.lab.mbox.loc:Sa 27. Sep 03:23:14 CEST 2014 vdsm-sample-nodehv01.lab.mbox.loc:total kB 4770112 1019288 1009012 vdsm-sample-nodehv01.lab.mbox.loc:Sa 27. Sep 09:23:14 CEST 2014 vdsm-sample-nodehv01.lab.mbox.loc:total kB 4770112 1110808 1100532 vdsm-sample-nodehv02.lab.mbox.loc:Fr 26. Sep 15:21:56 CEST 2014 vdsm-sample-nodehv02.lab.mbox.loc:total kB 4777292 805392 795148 vdsm-sample-nodehv02.lab.mbox.loc:Fr 26. Sep 21:21:56 CEST 2014 vdsm-sample-nodehv02.lab.mbox.loc:total kB 4908364 900368 890124 vdsm-sample-nodehv02.lab.mbox.loc:Sa 27. Sep 03:21:56 CEST 2014 vdsm-sample-nodehv02.lab.mbox.loc:total kB 5039436 995936 985692 vdsm-sample-nodehv02.lab.mbox.loc:Sa 27. Sep 09:21:56 CEST 2014 vdsm-sample-nodehv02.lab.mbox.loc:total kB 5104972 1088712 1078468 VDSM and related rpms: rpm -qa|grep vdsm vdsm-hook-vmdisk-4.14.11.2-0.el6.noarch vdsm-hook-sriov-4.14.11.2-0.el6.noarch vdsm-hook-promisc-4.14.11.2-0.el6.noarch vdsm-hook-smbios-4.14.11.2-0.el6.noarch vdsm-hook-scratchpad-4.14.11.2-0.el6.noarch vdsm-xmlrpc-4.14.17-0.el6.noarch vdsm-cli-4.14.17-0.el6.noarch vdsm-hook-vmfex-4.14.11.2-0.el6.noarch vdsm-hook-hostusb-4.14.11.2-0.el6.noarch vdsm-hook-pincpu-4.14.11.2-0.el6.noarch vdsm-hook-faqemu-4.14.11.2-0.el6.noarch vdsm-hook-isolatedprivatevlan-4.14.11.2-0.el6.noarch vdsm-hook-checkimages-4.14.11.2-0.el6.noarch vdsm-hook-directlun-4.14.11.2-0.el6.noarch vdsm-hook-openstacknet-4.14.11.2-0.el6.noarch vdsm-hook-macspoof-4.14.11.2-0.el6.noarch vdsm-hook-fileinject-4.14.11.2-0.el6.noarch vdsm-python-4.14.17-0.el6.x86_64 vdsm-4.14.17-0.el6.x86_64 vdsm-hook-vmfex-dev-4.14.17-0.el6.noarch vdsm-hook-qos-4.14.11.2-0.el6.noarch vdsm-hook-hugepages-4.14.11.2-0.el6.noarch vdsm-hook-qemucmdline-4.14.11.2-0.el6.noarch vdsm-hook-floppy-4.14.11.2-0.el6.noarch vdsm-hook-numa-4.14.11.2-0.el6.noarch vdsm-python-zombiereaper-4.14.17-0.el6.noarch vdsm-hook-extnet-4.14.17-0.el6.noarch
Can you check if the leak goes on when Engine does not poll the host (i.e. iptables blocks out port)? Does it happen when there are no VMs running? Is the leak rate related to number of VMs? Is is specific to one type of storage?
I'm very interested in this issue, so some additional information to implement Dan's suggestions in https://bugzilla.redhat.com/show_bug.cgi?id=1147148#c1 * to disable polling from Engine just set up things, for example activate the host if needed, run VM or whatever, then either shutdown Engine (brutal yet effective!) or block traffic to port 54321 on the hypervisor host running VDSM * you can test if the leak is related to the amount of *sampling* made by VDSM - compare with the *polling* which Engine dows on VDSM to collect the samplings done by VDSM - the rates are different here by tuning these options on /etc/vdsm/vdsm.conf: vm_sample_cpu_interval vm_sample_disk_interval vm_sample_disk_latency_interval vm_sample_net_interval vm_sample_balloon_interval value is _seconds_ between polls: the lower, the more frequent the polling. Do not use zero! Be aware that increasing these values may impact significantly on the hypervisor host.
Just a heads up: Running NO VMs: VDSM process grows at a rate of ~ +15MB/h. Procedure: 1. Set the host to maintenance 2. Reboot the host 3. Activate host 4. Start sampling I will block port 54321 on one host now and provide additional info later on. Sorry, I need the engine in production and cannot disrupt it right now.
Ok, after blocking Engine with iptables: iptables -I INPUT 1 -p tcp --destination-port 54321 -j REJECT and engine reporting the host as unresponsive, the vdsm process RSS size is NOT GROWING any more. To make shure I will leave the sampling on overnight and report back tomorrow if the RSS size did just grow very slowly (but I expect not, the sits there exactly the same to the last byte since three hrs).
After disabling SSL on the Engine as suggested by Dan suspecting M2Crypto with the help of Piotr, basically following the procedure at [1]: # sudo -u postgres psql -U postgres engine -c "update vdc_options set \ option_value = 'false' where option_name = 'EncryptHostCommunication';" and subsequently in vdsm.conf: ssl = false libvirtd.conf: listen_tcp=1, auth_tcp="none", qemu.conf: spice_tls=0 followed by a restart of the engine as well as vdsm on the host the process seems NOT to grow significantly any more. However, a small increase is is still beeing observed (~ 100 - 200 KiB/h). In particular this looks like this, RSS being the middle value: HOST1 (running guests): Di 30. Sep 13:18:43 CEST 2014 total kB 3731000 78540 68912 Di 30. Sep 14:18:43 CEST 2014 total kB 3731000 78708 69080 Di 30. Sep 15:18:43 CEST 2014 total kB 3731000 78776 69148 Di 30. Sep 16:18:43 CEST 2014 total kB 3731000 78808 69180 Di 30. Sep 17:18:43 CEST 2014 total kB 3731000 78840 69212 HOST2 (no guests running): Di 30. Sep 12:08:54 CEST 2014 total kB 2994688 55596 46128 Di 30. Sep 13:08:54 CEST 2014 total kB 3277324 57112 47644 Di 30. Sep 14:08:54 CEST 2014 total kB 3277324 58132 48664 Di 30. Sep 15:08:54 CEST 2014 total kB 3277324 58268 48800 Di 30. Sep 16:08:54 CEST 2014 total kB 3277324 57572 48104 Di 30. Sep 17:08:54 CEST 2014 total kB 3277324 57708 48240 [1] http://www.ovirt.org/Developers_All_In_One
Could you specify the full version of your m2crypto.rpm? We'd need to find a light-weight reproducer of this leak, and clone this bug to m2crypto.
No problem: # rpm -qa |grep m2crypt m2crypto-0.20.2-9.el6.x86_64
What OS are you running on ? CentOS ? Fedora? ver ?
Please see above; hosts are CentOS6.5: Hosts: CentOS 6.5 - 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Engine runs the EL6, same kernel (if this should matter). Please note, in the process of uograding to ovirt 3.5 I've also upgraded the hosts to vdsm-4.16.7 However, there I have the same issue [1]; I suspect this might also be related to m2crypto? [1] BZ1154624
Just some additional information from an OVirt 3.4 with FC20 hypervisor installation: VDSM started 8 weeks ago: ps -ef | grep 2773 | grep -v remote vdsm 2773 1 11 Sep01 ? 6-09:39:18 /usr/bin/python /usr/share/vdsm/vdsm Memory consumption is 6GB - correct me if I'm wrong. pmap -x 2773 | sort -n -k 2 ---------------- ------- ------- ------- 2773: /usr/bin/python /usr/share/vdsm/vdsm Address Kbytes RSS Dirty Mode Mapping total kB 6865196 180592 159980 ... 00007f5518021000 65404 0 0 ----- [ anon ] 00007f551c021000 65404 0 0 ----- [ anon ] 00007f5524021000 65404 0 0 ----- [ anon ] 00007f5530021000 65404 0 0 ----- [ anon ] 00007f5538021000 65404 0 0 ----- [ anon ] 00007f553c021000 65404 0 0 ----- [ anon ] 00007f5590021000 65404 0 0 ----- [ anon ] 00007f55d8021000 65404 0 0 ----- [ anon ] 00007f5608021000 65404 0 0 ----- [ anon ] 00007f5610021000 65404 0 0 ----- [ anon ] 00007f5618021000 65404 0 0 ----- [ anon ] 00007f561c021000 65404 0 0 ----- [ anon ] 00007f5624021000 65404 0 0 ----- [ anon ] 00007f5634021000 65404 0 0 ----- [ anon ] 00007f565c021000 65404 0 0 ----- [ anon ] 00007f566c021000 65404 0 0 ----- [ anon ] 00007f5674021000 65404 0 0 ----- [ anon ] 00007f5680021000 65404 0 0 ----- [ anon ] vdsm is 4.14.11.2 yum list installed | grep vdsm vdsm.x86_64 4.14.11.2-0.fc20 @ovirt-3.4-stable vdsm-cli.noarch 4.14.11.2-0.fc20 @ovirt-3.4-stable vdsm-python.x86_64 4.14.11.2-0.fc20 @ovirt-3.4-stable vdsm-python-zombiereaper.noarch 4.14.11.2-0.fc20 @ovirt-3.4-stable vdsm-xmlrpc.noarch 4.14.11.2-0.fc20 @ovirt-3.4-stable and m2crypto is 0.21.1-13 yum list installed | grep crypto m2crypto.x86_64 0.21.1-13.fc20 @updates That would make 5MB/hour. @Daniel: Did I get it right that the upgrade did not make the memory leak worse?
Grmpf... just recognized you are speaking about RSS. With only 180MB that is neglectable in our environment after 8 weeks uptime.
No problem Markus - and thanks! Knowing m2crypto in FC20 does not behave badly is a step forward. I went ahead, compiling the latest FC20 m2crypto.srpm and installed it on one of my hosts. I will report back as soon as there are any news.
*** Bug 1154624 has been marked as a duplicate of this bug. ***
I am happy to report the issue got fixed in m2crpyto somewhere on the way between m2crypto-0.20.2 and m2crypto-0.21.1. I think it may be advisable to file this against the RHEL6 m2crypto package as to seems not to be oVirt's fault? However I have no superscription and cannot check the version in the RH repos. In the meantime, it might be advisable to add m2crypto-0.21.1-12.el6.x86_64.rpm to the ovirt 3.4 / 3.5 EL6 Repos. Sampling done hourly in the same timeframe. Engine was polling hosts via SSL and the same number of VMs running on both hosts. Host A; running m2crypto-0.21.1-12.el6.x86_64.rpm Kbytes RSS Dirty total kB 2188076 46400 36444 total kB 4182932 68536 58420 total kB 4182932 68852 58736 total kB 4182932 69120 59004 total kB 4182932 69244 59128 total kB 4248468 70088 59972 total kB 4248468 70312 60196 total kB 4248468 70712 60596 total kB 4248468 70908 60792 total kB 4248468 71176 61060 total kB 4248468 71392 61276 total kB 4248468 71636 61520 total kB 4248468 72000 61884 Host B, running CentOS 6.5 m2crypto-0.20.2-9.el6.x86_64 Kbytes RSS Dirty total kB 4355928 82944 72840 total kB 4552536 99084 88980 total kB 4552536 111532 101428 total kB 4552536 125024 114920 total kB 4552536 136944 126840 total kB 4552536 149496 139392 total kB 4618072 163068 152964 total kB 4618072 175624 165520 total kB 4618072 187192 177088 total kB 4618072 199676 189572 total kB 4618072 212248 202144 total kB 4683608 224696 214592
I can confirm this bug already existed in ovirt 3.3.2 here is an example "ps" output, rss mem is in column number 6: vdsm 7101 15.7 0.3 14043004 844796 ? S<l Jun17 30162:41 \_ /usr/bin/python /usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid being roughly 840 mb! versions: rpm -q m2crypto m2crypto-0.20.2-9.el6.x86_64 rpm -q vdsm vdsm-4.13.3-3.el6.x86_64 I didn't notice this yet as I have many vms consuming way more ram than those 840 MB, and the machine has plenty of ram, so I never looked into this detail. Thanks to Daniel for pointing this bug out, I hope it get's fixed asap.
(In reply to Sven Kieske from comment #15) > I can confirm this bug already existed in ovirt 3.3.2 So it is at least in some way oVirt - related. But Sven, I think you encounter some different unfixed leak. As you can see from my one - hour samples above even with the 'new' m2crypto VDSM slowly grows; about 300KiB/h. A quick calculation (133 [days since June 17th] * 24 * 0.3MiB = 957.6) gets me roughly to your 840MiB RSS. The problem (at least with me) got much severe in oVirt 3.4.2+. While I would say 840 MiB might be neglectable (since June 17th!) since then the process was growing at a rate of 15MiB/h. as you can see in my samples: After only 11 hrs the RSS site was already 224MiB > Thanks to Daniel for pointing this bug out, I hope it get's fixed asap. Any time!
Created attachment 951432 [details] RPM witch fixed the issue for me. Build from FC20 srpm.
Contrary to my previous comment the issue still persists for me; though the growth is slightly less (13MB/h now). I spent the last two days trying to recreate the situation where I was not experiencing this issue to report something useful at least; so far without any success. I will update this BZ asap I have more news and suppose this bug should be re-targeted gain against VDSM. I am really sorry for the confusion; easy to jump to conclusions if it fits my goals.
Dan - seems like the m2crypyo removal work will fix that, right?
Of course. In a Gordian-knot kind of "fix".
This is an automated message. oVirt 3.6.0 RC1 has been released. This bug has no target release and still have target milestone set to 3.6.0-rc. Please review this bug and set target milestone and release to one of the next releases.
Yaniv I can see that you removed Target release. Can you please set it back?
Pioter - now target release is the vdsm tag which will include the patch - it will probably be part of 4.17.10 - once the patch gets it you need to update that
ok, 4.17.11 codechange, anyway no leaks seen
According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.