1147148 – M2Crypto usage in vdsm leaks memory

Bug 1147148 - M2Crypto usage in vdsm leaks memory

Summary: M2Crypto usage in vdsm leaks memory

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-3.6.1
Target Release:	4.17.11
Assignee:	Piotr Kliczewski
QA Contact:	Jiri Belka
Docs Contact:
URL:
Whiteboard:	infra
Duplicates (1):	1154624 (view as bug list)
Depends On:
Blocks:	1179273
TreeView+	depends on / blocked

Reported:	2014-09-27 11:12 UTC by Daniel Helgenberger
Modified:	2016-02-10 19:13 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-16 12:22:09 UTC
oVirt Team:	Infra
Embargoed:
Dependent Products:
Flags:	oourfali: ovirt-3.6.z? oourfali: ovirt-4.0.0? rule-engine: planning_ack? oourfali: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)
pmap -x of VDSM, run evrey 6hrs on both hosts (325.81 KB, text/plain) 2014-09-27 11:12 UTC, Daniel Helgenberger	no flags	Details
RPM witch fixed the issue for me. Build from FC20 srpm. (469.07 KB, application/x-rpm) 2014-10-28 15:41 UTC, Daniel Helgenberger	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	39990	0	master	ABANDONED	ssl: m2crypto removal	Never

Description Daniel Helgenberger 2014-09-27 11:12:56 UTC

Created attachment 941828 [details]
pmap -x of VDSM, run evrey 6hrs on both hosts

Description of problem:
The vdsm process memory size grows constantly; about 91.5MB/6hrs. If unchecked this will cause breakdown of the host and guests (OOM).

Version-Release number of selected component (if applicable):
oVirt 3.4.4 Hosted Engine HA Node

Hosts: CentOS 6.5 - 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

vdsm-4.14.17-0.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. Add host to oOvirt
2. Start a VM on the host
3. Note VDSM RES size
4. Wait a few days
6. Compare VDSM RES size

Actual results:
VDSM process grows in mem size; eating up all the host's memory and causing OOM conditions and ultimately complete breakdown of the host.

Expected results:
VDSM mem size behaving more civilized.

Workaround:
Restart VDSM service (by weekly cron script).

Additional info:
IIRC this was introduced with oVirt 3.4.2
If it helps, I sampled the vdsm process with:
pmap -x $VDSM_PID every six hours. Find the dull output attached.

grep -w '2014\|total' vdsm-sample-nodehv0*
vdsm-sample-nodehv01.lab.mbox.loc:Fr 26. Sep 15:23:14 CEST 2014
vdsm-sample-nodehv01.lab.mbox.loc:total kB         4442432  836276  826000
vdsm-sample-nodehv01.lab.mbox.loc:Fr 26. Sep 21:23:14 CEST 2014
vdsm-sample-nodehv01.lab.mbox.loc:total kB         4639040  927756  917480
vdsm-sample-nodehv01.lab.mbox.loc:Sa 27. Sep 03:23:14 CEST 2014
vdsm-sample-nodehv01.lab.mbox.loc:total kB         4770112 1019288 1009012
vdsm-sample-nodehv01.lab.mbox.loc:Sa 27. Sep 09:23:14 CEST 2014
vdsm-sample-nodehv01.lab.mbox.loc:total kB         4770112 1110808 1100532

vdsm-sample-nodehv02.lab.mbox.loc:Fr 26. Sep 15:21:56 CEST 2014
vdsm-sample-nodehv02.lab.mbox.loc:total kB         4777292  805392  795148
vdsm-sample-nodehv02.lab.mbox.loc:Fr 26. Sep 21:21:56 CEST 2014
vdsm-sample-nodehv02.lab.mbox.loc:total kB         4908364  900368  890124
vdsm-sample-nodehv02.lab.mbox.loc:Sa 27. Sep 03:21:56 CEST 2014
vdsm-sample-nodehv02.lab.mbox.loc:total kB         5039436  995936  985692
vdsm-sample-nodehv02.lab.mbox.loc:Sa 27. Sep 09:21:56 CEST 2014
vdsm-sample-nodehv02.lab.mbox.loc:total kB         5104972 1088712 1078468

VDSM and related rpms:

rpm -qa|grep vdsm
vdsm-hook-vmdisk-4.14.11.2-0.el6.noarch
vdsm-hook-sriov-4.14.11.2-0.el6.noarch
vdsm-hook-promisc-4.14.11.2-0.el6.noarch
vdsm-hook-smbios-4.14.11.2-0.el6.noarch
vdsm-hook-scratchpad-4.14.11.2-0.el6.noarch
vdsm-xmlrpc-4.14.17-0.el6.noarch
vdsm-cli-4.14.17-0.el6.noarch
vdsm-hook-vmfex-4.14.11.2-0.el6.noarch
vdsm-hook-hostusb-4.14.11.2-0.el6.noarch
vdsm-hook-pincpu-4.14.11.2-0.el6.noarch
vdsm-hook-faqemu-4.14.11.2-0.el6.noarch
vdsm-hook-isolatedprivatevlan-4.14.11.2-0.el6.noarch
vdsm-hook-checkimages-4.14.11.2-0.el6.noarch
vdsm-hook-directlun-4.14.11.2-0.el6.noarch
vdsm-hook-openstacknet-4.14.11.2-0.el6.noarch
vdsm-hook-macspoof-4.14.11.2-0.el6.noarch
vdsm-hook-fileinject-4.14.11.2-0.el6.noarch
vdsm-python-4.14.17-0.el6.x86_64
vdsm-4.14.17-0.el6.x86_64
vdsm-hook-vmfex-dev-4.14.17-0.el6.noarch
vdsm-hook-qos-4.14.11.2-0.el6.noarch
vdsm-hook-hugepages-4.14.11.2-0.el6.noarch
vdsm-hook-qemucmdline-4.14.11.2-0.el6.noarch
vdsm-hook-floppy-4.14.11.2-0.el6.noarch
vdsm-hook-numa-4.14.11.2-0.el6.noarch
vdsm-python-zombiereaper-4.14.17-0.el6.noarch
vdsm-hook-extnet-4.14.17-0.el6.noarch

Comment 1 Dan Kenigsberg 2014-09-29 09:39:20 UTC

Can you check if the leak goes on when Engine does not poll the host (i.e. iptables blocks out port)? Does it happen when there are no VMs running? Is the leak rate related to number of VMs? Is is specific to one type of storage?

Comment 2 Francesco Romani 2014-09-29 12:14:39 UTC

I'm very interested in this issue, so some additional information to implement
Dan's suggestions in https://bugzilla.redhat.com/show_bug.cgi?id=1147148#c1

* to disable polling from Engine just set up things, for example activate the host if needed, run VM or whatever, then either shutdown Engine (brutal yet effective!) or block traffic to port 54321 on the hypervisor host running VDSM

* you can test if the leak is related to the amount of *sampling* made by VDSM
- compare with the *polling* which Engine dows on VDSM to collect the samplings done by VDSM - the rates are different here by tuning these options on
/etc/vdsm/vdsm.conf:

vm_sample_cpu_interval
vm_sample_disk_interval
vm_sample_disk_latency_interval
vm_sample_net_interval
vm_sample_balloon_interval

value is _seconds_ between polls: the lower, the more frequent the polling.
Do not use zero!

Be aware that increasing these values may impact significantly on the hypervisor host.

Comment 3 Daniel Helgenberger 2014-09-29 12:34:21 UTC

Just a heads up:
Running NO VMs: VDSM process grows at a rate of ~ +15MB/h.

Procedure: 
1. Set the host to maintenance
2. Reboot the host
3. Activate host
4. Start sampling

I will block port 54321 on one host now and provide additional info later on. Sorry, I need the engine in production and cannot disrupt it right now.

Comment 4 Daniel Helgenberger 2014-09-29 17:05:15 UTC

Ok, 
after blocking Engine with iptables:
iptables -I INPUT 1 -p tcp --destination-port 54321 -j REJECT

and engine reporting the host as unresponsive, the vdsm process RSS size is NOT GROWING any more.

To make shure I will leave the sampling on overnight and report back tomorrow if the RSS size did just grow very slowly (but I expect not, the sits there exactly the same to the last byte since three hrs).

Comment 5 Daniel Helgenberger 2014-09-30 15:50:31 UTC

After disabling SSL on the Engine as suggested by Dan suspecting M2Crypto with the help of Piotr, basically following the procedure at [1]:

# sudo -u postgres psql -U postgres engine -c "update vdc_options set \
  option_value = 'false' where option_name = 'EncryptHostCommunication';"

and subsequently in 
vdsm.conf: ssl = false
libvirtd.conf: listen_tcp=1, auth_tcp="none",
qemu.conf: spice_tls=0

followed by a restart of the engine as well as vdsm on the host the process seems NOT to grow significantly any more. However, a small increase is is still beeing observed (~ 100 - 200 KiB/h). 

In particular this looks like this, RSS being the middle value:

HOST1 (running guests):
Di 30. Sep 13:18:43 CEST 2014
total kB         3731000   78540   68912
Di 30. Sep 14:18:43 CEST 2014
total kB         3731000   78708   69080
Di 30. Sep 15:18:43 CEST 2014
total kB         3731000   78776   69148
Di 30. Sep 16:18:43 CEST 2014
total kB         3731000   78808   69180
Di 30. Sep 17:18:43 CEST 2014
total kB         3731000   78840   69212

HOST2 (no guests running):
Di 30. Sep 12:08:54 CEST 2014
total kB         2994688   55596   46128
Di 30. Sep 13:08:54 CEST 2014
total kB         3277324   57112   47644
Di 30. Sep 14:08:54 CEST 2014
total kB         3277324   58132   48664
Di 30. Sep 15:08:54 CEST 2014
total kB         3277324   58268   48800
Di 30. Sep 16:08:54 CEST 2014
total kB         3277324   57572   48104
Di 30. Sep 17:08:54 CEST 2014
total kB         3277324   57708   48240

[1] http://www.ovirt.org/Developers_All_In_One

Comment 6 Dan Kenigsberg 2014-10-01 22:00:14 UTC

Could you specify the full version of your m2crypto.rpm?

We'd need to find a light-weight reproducer of this leak, and clone this bug to m2crypto.

Comment 7 Daniel Helgenberger 2014-10-02 11:28:06 UTC

No problem:
# rpm -qa |grep m2crypt
m2crypto-0.20.2-9.el6.x86_64

Comment 8 Barak 2014-10-21 18:09:28 UTC

What OS are you running on ? CentOS ? Fedora? ver ?

Comment 9 Daniel Helgenberger 2014-10-21 19:54:34 UTC

Please see above; hosts are CentOS6.5:

Hosts: CentOS 6.5 - 2.6.32-431.29.2.el6.x86_64 #1 SMP Tue Sep 9 21:36:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Engine runs the EL6, same kernel (if this should matter). Please note, in the process of uograding to ovirt 3.5 I've also upgraded the hosts to vdsm-4.16.7

However, there I have the same issue [1]; I suspect this might also be related to m2crypto? 

[1] BZ1154624

Comment 10 Markus Stockhausen 2014-10-26 07:48:06 UTC

Just some additional information from an OVirt 3.4 with FC20 hypervisor installation:

VDSM started 8 weeks ago:

ps -ef | grep 2773 | grep -v remote
vdsm      2773     1 11 Sep01 ?  6-09:39:18 /usr/bin/python /usr/share/vdsm/vdsm

Memory consumption is 6GB - correct me if I'm wrong.

pmap -x 2773 | sort -n -k 2
---------------- ------- ------- -------
2773:   /usr/bin/python /usr/share/vdsm/vdsm
Address           Kbytes     RSS   Dirty Mode  Mapping
total kB         6865196  180592  159980
...
00007f5518021000   65404       0       0 -----   [ anon ]
00007f551c021000   65404       0       0 -----   [ anon ]
00007f5524021000   65404       0       0 -----   [ anon ]
00007f5530021000   65404       0       0 -----   [ anon ]
00007f5538021000   65404       0       0 -----   [ anon ]
00007f553c021000   65404       0       0 -----   [ anon ]
00007f5590021000   65404       0       0 -----   [ anon ]
00007f55d8021000   65404       0       0 -----   [ anon ]
00007f5608021000   65404       0       0 -----   [ anon ]
00007f5610021000   65404       0       0 -----   [ anon ]
00007f5618021000   65404       0       0 -----   [ anon ]
00007f561c021000   65404       0       0 -----   [ anon ]
00007f5624021000   65404       0       0 -----   [ anon ]
00007f5634021000   65404       0       0 -----   [ anon ]
00007f565c021000   65404       0       0 -----   [ anon ]
00007f566c021000   65404       0       0 -----   [ anon ]
00007f5674021000   65404       0       0 -----   [ anon ]
00007f5680021000   65404       0       0 -----   [ anon ]

vdsm is 4.14.11.2

yum list installed | grep vdsm
vdsm.x86_64                      4.14.11.2-0.fc20               @ovirt-3.4-stable
vdsm-cli.noarch                  4.14.11.2-0.fc20               @ovirt-3.4-stable
vdsm-python.x86_64               4.14.11.2-0.fc20               @ovirt-3.4-stable
vdsm-python-zombiereaper.noarch  4.14.11.2-0.fc20               @ovirt-3.4-stable
vdsm-xmlrpc.noarch               4.14.11.2-0.fc20               @ovirt-3.4-stable

and m2crypto is 0.21.1-13

yum list installed | grep crypto
m2crypto.x86_64                  0.21.1-13.fc20                 @updates

That would make 5MB/hour. 

@Daniel: Did I get it right that the upgrade did not make the memory leak worse?

Comment 11 Markus Stockhausen 2014-10-26 08:10:47 UTC

Grmpf... just recognized you are speaking about RSS. With only 180MB that is neglectable in our environment after 8 weeks uptime.

Comment 12 Daniel Helgenberger 2014-10-27 18:22:02 UTC

No problem Markus - and thanks! 
Knowing m2crypto in FC20 does not behave badly is a step forward. I went ahead, compiling the latest FC20 m2crypto.srpm and installed it on one of my hosts. I will report back as soon as there are any news.

Comment 13 Daniel Helgenberger 2014-10-28 09:50:04 UTC

*** Bug 1154624 has been marked as a duplicate of this bug. ***

Comment 14 Daniel Helgenberger 2014-10-28 09:54:50 UTC

I am happy to report the issue got fixed in m2crpyto somewhere on the way between m2crypto-0.20.2 and m2crypto-0.21.1. 

I think it may be advisable to file this against the RHEL6 m2crypto package as to seems not to be oVirt's fault?
However I have no superscription and cannot check the version in the RH repos.

In the meantime, it might be advisable to add m2crypto-0.21.1-12.el6.x86_64.rpm to the ovirt 3.4 / 3.5  EL6 Repos.

Sampling done hourly in the same timeframe. Engine was polling hosts via SSL and the same number of VMs running on both hosts.

Host A; running m2crypto-0.21.1-12.el6.x86_64.rpm
                 Kbytes    RSS     Dirty
total kB         2188076   46400   36444
total kB         4182932   68536   58420
total kB         4182932   68852   58736
total kB         4182932   69120   59004
total kB         4182932   69244   59128
total kB         4248468   70088   59972
total kB         4248468   70312   60196
total kB         4248468   70712   60596
total kB         4248468   70908   60792
total kB         4248468   71176   61060
total kB         4248468   71392   61276
total kB         4248468   71636   61520
total kB         4248468   72000   61884


Host B, running CentOS 6.5 m2crypto-0.20.2-9.el6.x86_64
                 Kbytes    RSS     Dirty
total kB         4355928   82944   72840
total kB         4552536   99084   88980
total kB         4552536  111532  101428
total kB         4552536  125024  114920
total kB         4552536  136944  126840
total kB         4552536  149496  139392
total kB         4618072  163068  152964
total kB         4618072  175624  165520
total kB         4618072  187192  177088
total kB         4618072  199676  189572
total kB         4618072  212248  202144
total kB         4683608  224696  214592

Comment 15 Sven Kieske 2014-10-28 13:27:27 UTC

I can confirm this bug already existed in ovirt 3.3.2
here is an example "ps" output, rss mem is in column number 6:

vdsm      7101 15.7  0.3 14043004 844796 ?     S<l  Jun17 30162:41  \_ /usr/bin/python /usr/share/vdsm/vdsm --pidfile /var/run/vdsm/vdsmd.pid

being roughly 840 mb!

versions:

rpm -q m2crypto
m2crypto-0.20.2-9.el6.x86_64
rpm -q vdsm
vdsm-4.13.3-3.el6.x86_64


I didn't notice this yet as I have many vms consuming way more ram than those
840 MB, and the machine has plenty of ram, so I never looked into this detail.

Thanks to Daniel for pointing this bug out, I hope it get's fixed asap.

Comment 16 Daniel Helgenberger 2014-10-28 13:47:29 UTC

(In reply to Sven Kieske from comment #15)
> I can confirm this bug already existed in ovirt 3.3.2
So it is at least in some way oVirt - related. 

But Sven, I think you encounter some different unfixed leak. As you can see from my one - hour samples above even with the 'new' m2crypto VDSM slowly grows; about 300KiB/h. A quick calculation (133 [days since June 17th] * 24 * 0.3MiB = 957.6) gets me roughly to your 840MiB RSS.

The problem (at least with me) got much severe in oVirt 3.4.2+.
While I would say 840 MiB might be neglectable  (since June 17th!) since then the process was growing at a rate of 15MiB/h. as you can see in my samples: After only 11 hrs the RSS site was already 224MiB

> Thanks to Daniel for pointing this bug out, I hope it get's fixed asap.
Any time!

Comment 18 Daniel Helgenberger 2014-10-28 15:41:30 UTC

Created attachment 951432 [details]
RPM witch fixed the issue for me. Build from FC20 srpm.

Comment 21 Daniel Helgenberger 2014-10-31 12:09:55 UTC

Contrary to my previous comment the issue still persists for me; though the growth is slightly less (13MB/h now).

I spent the last two days trying to recreate the situation where I was not experiencing this issue to report something useful at least; so far without any success.

I will update this BZ asap I have more news and suppose this bug should be re-targeted gain against VDSM.

I am really sorry for the confusion; easy to jump to conclusions if it fits my goals.

Comment 23 Oved Ourfali 2015-06-11 07:26:46 UTC

Dan - seems like the m2crypyo removal work will fix that, right?

Comment 24 Dan Kenigsberg 2015-06-11 14:48:46 UTC

Of course. In a Gordian-knot kind of "fix".

Comment 25 Sandro Bonazzola 2015-09-29 09:09:59 UTC

This is an automated message.
oVirt 3.6.0 RC1 has been released. This bug has no target release and still have target milestone set to 3.6.0-rc.
Please review this bug and set target milestone and release to one of the next releases.

Comment 26 Piotr Kliczewski 2015-09-29 10:06:25 UTC

Yaniv I can see that you removed Target release. Can you please set it back?

Comment 27 Yaniv Bronhaim 2015-10-13 16:01:25 UTC

Pioter - now target release is the vdsm tag which will include the patch - it will probably be part of 4.17.10 - once the patch gets it you need to update that

Comment 28 Jiri Belka 2015-12-10 08:57:48 UTC

ok, 4.17.11

codechange, anyway no leaks seen

Comment 29 Sandro Bonazzola 2015-12-16 12:22:09 UTC

According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release.

Note You need to log in before you can comment on or make changes to this bug.

bazulay
bilias
bugs
daniel.helgenberger
danken
ecohen
fromani
gklein
herrold
lsurette
mgoldboi
mitr
mst
oourfali
pkliczew
pstehlik
rbalakri
redhat-bugzilla
redhat
sbonazzo
s.kieske
ybronhei
yeylon