Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1372205

Summary:

Memory leak when starting and stopping a vm

Product:

[oVirt] vdsm

Reporter:

Nir Soffer <nsoffer>

Component:

Core

Assignee:

Dan Kenigsberg <danken>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

sefi litmanovich <slitmano>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.18.0

CC:

bugs, emarcian, fsayd, michal.skrivanek, mzamazal, pstehlik, s.kieske, slitmano, tjelinek

Target Milestone:

---

Flags:

pstehlik: testing_plan_complete?
rule-engine: planning_ack?
rule-engine: devel_ack?
pstehlik: testing_ack+

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-11-18 07:47:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Vdsm log for this run	none
Engine log for this run	none
Object dump created by objects.py	none
Script for dumping all objects in python process	none
vdsm.conf	none
50-health.conf, only file in /etc/vdsm/vdsm.conf.d	none
vdsm.log 1	none
vdsm.log 2	none
vdsm.log 3	none
vdsm.log 4	none

Description Nir Soffer 2016-09-01 07:48:36 UTC

Description of problem:

Each time we start a vm, we leak about 1MiB.

Version-Release number of selected component (if applicable):
master (785f00dcd67b57d170eaec98dddfc0827be67ea7)

How reproducible:
Always

Steps to Reproduce:
1. Enable health monitor ([devel] health_monitor_enable=true)
2. Start vdsm
3. Start and stop many vms
4. Watch health logs rss= value

Actual results:
Memory usage grow ~1MiB per started vm

Expected results:
Memory usage does not grow after starting and stopping vm.

Additional info:

This was reported on the user mailing list for vdsm-4.18.11:
http://lists.ovirt.org/pipermail/users/2016-August/042298.html

Example log:

# tail -f /var/log/vdsm/vdsm.log | grep rss=

Starting and stopping 5 vms...

Thread-14::DEBUG::2016-09-01 09:38:12,522::health::122::health::(_check_resources) user=2.38%, sys=1.07%, rss=126168 kB (+9408), threads=56
Thread-14::DEBUG::2016-09-01 09:39:12,559::health::122::health::(_check_resources) user=4.67%, sys=1.85%, rss=126488 kB (+320), threads=56
Thread-14::DEBUG::2016-09-01 09:40:12,596::health::122::health::(_check_resources) user=3.02%, sys=1.35%, rss=126928 kB (+440), threads=56

Starting and stopping 5 vms...

Thread-14::DEBUG::2016-09-01 09:41:12,643::health::122::health::(_check_resources) user=2.60%, sys=1.13%, rss=134216 kB (+7288), threads=57
Thread-14::DEBUG::2016-09-01 09:42:12,681::health::122::health::(_check_resources) user=5.30%, sys=2.20%, rss=135224 kB (+1008), threads=69
Thread-14::DEBUG::2016-09-01 09:43:12,719::health::122::health::(_check_resources) user=2.55%, sys=1.20%, rss=135444 kB (+220), threads=57
Thread-14::DEBUG::2016-09-01 09:44:12,752::health::122::health::(_check_resources) user=1.35%, sys=0.82%, rss=134432 kB (-1012), threads=57

Starting and stopping 5 vms...

Thread-14::DEBUG::2016-09-01 09:45:12,787::health::122::health::(_check_resources) user=2.45%, sys=1.05%, rss=142984 kB (+8552), threads=57
Thread-14::DEBUG::2016-09-01 09:46:12,823::health::122::health::(_check_resources) user=4.63%, sys=1.70%, rss=141380 kB (-1604), threads=57
Thread-14::DEBUG::2016-09-01 09:47:12,860::health::122::health::(_check_resources) user=3.22%, sys=1.62%, rss=143888 kB (+2508), threads=69
Thread-14::DEBUG::2016-09-01 09:48:12,888::health::122::health::(_check_resources) user=1.40%, sys=0.80%, rss=141488 kB (-2400), threads=57

Starting and stopping 5 vms...

Thread-14::DEBUG::2016-09-01 09:49:12,935::health::122::health::(_check_resources) user=3.00%, sys=1.27%, rss=150508 kB (+9020), threads=57
Thread-14::DEBUG::2016-09-01 09:50:12,968::health::122::health::(_check_resources) user=3.70%, sys=1.68%, rss=150572 kB (+64), threads=57
Thread-14::DEBUG::2016-09-01 09:51:13,005::health::122::health::(_check_resources) user=4.83%, sys=1.98%, rss=149624 kB (-948), threads=57
Thread-14::DEBUG::2016-09-01 09:52:13,042::health::122::health::(_check_resources) user=3.60%, sys=1.63%, rss=150872 kB (+1248), threads=69
Thread-14::DEBUG::2016-09-01 09:53:13,075::health::122::health::(_check_resources) user=7.53%, sys=2.83%, rss=150576 kB (-296), threads=56
Thread-14::DEBUG::2016-09-01 09:54:13,112::health::122::health::(_check_resources) user=3.08%, sys=1.50%, rss=150600 kB (+24), threads=56
Thread-14::DEBUG::2016-09-01 09:55:13,140::health::122::health::(_check_resources) user=1.33%, sys=0.88%, rss=150600 kB (+0), threads=56
Thread-14::DEBUG::2016-09-01 09:56:13,173::health::122::health::(_check_resources) user=1.35%, sys=0.82%, rss=149592 kB (-1008), threads=56

Starting and stopping 5 vms...

Thread-14::DEBUG::2016-09-01 09:57:13,217::health::122::health::(_check_resources) user=2.95%, sys=1.35%, rss=155500 kB (+5908), threads=68
Thread-14::DEBUG::2016-09-01 09:58:13,253::health::122::health::(_check_resources) user=4.03%, sys=1.73%, rss=155256 kB (-244), threads=56
Thread-14::DEBUG::2016-09-01 09:59:13,289::health::122::health::(_check_resources) user=5.15%, sys=2.08%, rss=156240 kB (+984), threads=57
Thread-14::DEBUG::2016-09-01 10:00:13,326::health::122::health::(_check_resources) user=1.73%, sys=0.98%, rss=156416 kB (+176), threads=57
Thread-14::DEBUG::2016-09-01 10:01:13,353::health::122::health::(_check_resources) user=1.33%, sys=0.92%, rss=155408 kB (-1008), threads=57
Thread-14::DEBUG::2016-09-01 10:02:13,404::health::122::health::(_check_resources) user=2.72%, sys=1.32%, rss=156692 kB (+1284), threads=69
Thread-14::DEBUG::2016-09-01 10:03:13,441::health::122::health::(_check_resources) user=3.88%, sys=1.70%, rss=156456 kB (-236), threads=57
Thread-14::DEBUG::2016-09-01 10:04:13,478::health::122::health::(_check_resources) user=5.12%, sys=2.03%, rss=153696 kB (-2760), threads=57
Thread-14::DEBUG::2016-09-01 10:05:13,512::health::122::health::(_check_resources) user=3.50%, sys=1.50%, rss=153932 kB (+236), threads=57
Thread-14::DEBUG::2016-09-01 10:06:13,549::health::122::health::(_check_resources) user=6.93%, sys=2.63%, rss=153976 kB (+44), threads=57
Thread-14::DEBUG::2016-09-01 10:07:13,587::health::122::health::(_check_resources) user=7.40%, sys=2.88%, rss=154260 kB (+284), threads=69
Thread-14::DEBUG::2016-09-01 10:08:13,609::health::122::health::(_check_resources) user=7.32%, sys=2.78%, rss=154032 kB (-228), threads=59
Thread-14::DEBUG::2016-09-01 10:09:13,647::health::122::health::(_check_resources) user=6.63%, sys=2.65%, rss=154036 kB (+4), threads=57
Thread-14::DEBUG::2016-09-01 10:10:13,684::health::122::health::(_check_resources) user=7.28%, sys=2.77%, rss=154044 kB (+8), threads=57
Thread-14::DEBUG::2016-09-01 10:11:13,721::health::122::health::(_check_resources) user=6.33%, sys=2.35%, rss=154044 kB (+0), threads=57
Thread-14::DEBUG::2016-09-01 10:12:13,746::health::122::health::(_check_resources) user=4.28%, sys=1.85%, rss=155440 kB (+1396), threads=69
Thread-14::DEBUG::2016-09-01 10:13:13,781::health::122::health::(_check_resources) user=3.35%, sys=1.47%, rss=154132 kB (-1308), threads=57
Thread-14::DEBUG::2016-09-01 10:14:13,819::health::122::health::(_check_resources) user=1.38%, sys=0.95%, rss=154132 kB (+0), threads=57
Thread-14::DEBUG::2016-09-01 10:15:13,846::health::122::health::(_check_resources) user=1.40%, sys=0.88%, rss=154128 kB (-4), threads=57

Starting and stopping 5 vms...

Thread-14::DEBUG::2016-09-01 10:16:13,892::health::122::health::(_check_resources) user=2.47%, sys=1.15%, rss=161472 kB (+7344), threads=57
Thread-14::DEBUG::2016-09-01 10:17:13,930::health::122::health::(_check_resources) user=7.03%, sys=2.77%, rss=161884 kB (+412), threads=69
Thread-14::DEBUG::2016-09-01 10:18:13,969::health::122::health::(_check_resources) user=4.60%, sys=2.13%, rss=159404 kB (-2480), threads=56
Thread-14::DEBUG::2016-09-01 10:19:13,996::health::122::health::(_check_resources) user=1.28%, sys=0.87%, rss=158400 kB (-1004), threads=56
Thread-14::DEBUG::2016-09-01 10:20:14,024::health::122::health::(_check_resources) user=1.35%, sys=0.87%, rss=158400 kB (+0), threads=56
Thread-14::DEBUG::2016-09-01 10:21:14,052::health::122::health::(_check_resources) user=1.27%, sys=0.92%, rss=158400 kB (+0), threads=56
Thread-14::DEBUG::2016-09-01 10:22:14,081::health::122::health::(_check_resources) user=1.45%, sys=1.10%, rss=160692 kB (+2292), threads=68
Thread-14::DEBUG::2016-09-01 10:23:14,107::health::122::health::(_check_resources) user=1.38%, sys=0.93%, rss=158400 kB (-2292), threads=56
Thread-14::DEBUG::2016-09-01 10:24:14,135::health::122::health::(_check_resources) user=1.30%, sys=0.88%, rss=158400 kB (+0), threads=56

Comment 1 Nir Soffer 2016-09-01 07:53:57 UTC

Created attachment 1196611 [details]
Vdsm log for this run

Comment 2 Nir Soffer 2016-09-01 07:56:28 UTC

Created attachment 1196614 [details]
Engine log for this run

Comment 3 Nir Soffer 2016-09-01 15:33:05 UTC

Created attachment 1196870 [details]
Object dump created by objects.py

This is a dump of all objects in vdsm - created by:

    import objects
    objects.dump('objects.out')

The format is:

    type count
        object repr...

Comment 4 Nir Soffer 2016-09-01 15:33:48 UTC

Created attachment 1196871 [details]
Script for dumping all objects in python process

Comment 5 Nir Soffer 2016-09-01 15:36:18 UTC

Forgot to mention that the object dump was creating using manhole:

1. Enable manhole

[devel]
manhole_enable = true

2. Restart vdsm

3. Open manhole shell

$ nc -U /run/vdsm/vdsmd.manhole
> import objects
> objects.dump('objects.out')

Comment 6 Milan Zamazal 2016-09-02 15:06:12 UTC

I tried to find the leak but I've failed so far.  All I can reliably observe is the already known Python multiprocessing leak (related to supervdsm communication), which should be much smaller than this supposed leak (a couple of tens of KB per single VM start-stop action).

After stopping MoM and inserting gc.collect() + reset of the leaking Python multiprocessing data structures into API.VM.create (to get a more predictable environment), I'm having trouble to reproduce the leak at all.  While it's not uncommon to have RSS increased by ~1 MB per a VM start-stop, it doesn't happen all the time and RSS sometimes lowers after this action.  While initially the memory usage grows, I wasn't able to exceed 140 MB even after >200 VM start-stop cycles of different kinds (jsonrpc calls, REST API calls, manual actions in Engine UI).

I also inspected counts of Python objects between single VM start-stop cycles where RSS increased.  All I could see was a single leaking lambda instance in the multiprocessing leak.  I don't think the lambda can waste anything close to ~1 MB of space, so I suspected something like C object leak.  So I tried to create a really big Vdsm process to make the leaked data very visible but I couldn't.

I could get once a Vdsm instance of >400 MB size, but that was in a quite messed up environment where I spawned too many VMs in parallel by mistake and many stale (not actually running) VM instances were present in Vdsm.

Can you reproduce the leak to the extent of let's say at least 300 MB RSS? Even with MoM stopped and the following code inserted at the beginning of API.VM.create?

  import multiprocessing.util
  import weakref
  multiprocessing.util._afterfork_registry = \
      weakref.WeakValueDictionary()
  import gc
  gc.collect()

Comment 7 Federico Sayd 2016-09-02 16:15:48 UTC

I originally reported the issue with vdsm and possible memory leak

This is a extract of vdsm logs of a production system. I have noted a increment in rss when vm's are migrated between hosts. For approximately one day the host doesn't run any vm and rss hasn't grew since.

Please tell me if you need more info about my setup

xzcat vdsm.log.13.xz | grep rss=

...

Thread-33::DEBUG::2016-09-01 10:32:26,269::health::122::health::(_check_resources) user=2.95%, sys=1.47%, rss=590596 kB (+0), threads=52
Thread-33::DEBUG::2016-09-01 10:33:26,327::health::122::health::(_check_resources) user=39.57%, sys=2.80%, rss=749108 kB (+158512), threads=55
Thread-33::DEBUG::2016-09-01 10:34:26,386::health::122::health::(_check_resources) user=2.00%, sys=1.07%, rss=749020 kB (-88), threads=50
Thread-33::DEBUG::2016-09-01 10:35:26,441::health::122::health::(_check_resources) user=1.98%, sys=1.05%, rss=748948 kB (-72), threads=48
Thread-33::DEBUG::2016-09-01 10:36:26,477::health::122::health::(_check_resources) user=1.15%, sys=0.93%, rss=746516 kB (-2432), threads=34
Thread-33::DEBUG::2016-09-01 10:37:26,496::health::122::health::(_check_resources) user=0.62%, sys=0.20%, rss=746516 kB (+0), threads=34
Thread-33::DEBUG::2016-09-01 10:38:26,518::health::122::health::(_check_resources) user=1.57%, sys=1.25%, rss=747252 kB (+736), threads=48
Thread-33::DEBUG::2016-09-01 10:39:26,549::health::122::health::(_check_resources) user=1.73%, sys=1.17%, rss=743516 kB (-3736), threads=47
Thread-33::DEBUG::2016-09-01 10:40:26,588::health::122::health::(_check_resources) user=4.43%, sys=1.70%, rss=743516 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 10:41:26,618::health::122::health::(_check_resources) user=3.85%, sys=1.63%, rss=743668 kB (+152), threads=45
Thread-33::DEBUG::2016-09-01 10:42:26,649::health::122::health::(_check_resources) user=1.55%, sys=0.87%, rss=743668 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 10:43:26,681::health::122::health::(_check_resources) user=1.78%, sys=1.27%, rss=743700 kB (+32), threads=47
Thread-33::DEBUG::2016-09-01 10:44:26,712::health::122::health::(_check_resources) user=1.85%, sys=0.95%, rss=743700 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 10:45:26,743::health::122::health::(_check_resources) user=1.80%, sys=1.00%, rss=743700 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 10:46:26,773::health::122::health::(_check_resources) user=1.53%, sys=0.92%, rss=743700 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 10:47:26,804::health::122::health::(_check_resources) user=1.73%, sys=0.90%, rss=743700 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 10:48:26,837::health::122::health::(_check_resources) user=1.88%, sys=1.27%, rss=743708 kB (+8), threads=47
Thread-33::DEBUG::2016-09-01 10:49:26,976::health::122::health::(_check_resources) user=22.77%, sys=1.62%, rss=847896 kB (+104188), threads=50
Thread-33::DEBUG::2016-09-01 10:50:27,037::health::122::health::(_check_resources) user=6.47%, sys=1.33%, rss=868908 kB (+21012), threads=46
Thread-33::DEBUG::2016-09-01 10:51:27,074::health::122::health::(_check_resources) user=1.03%, sys=0.98%, rss=868884 kB (-24), threads=45
Thread-33::DEBUG::2016-09-01 10:52:27,105::health::122::health::(_check_resources) user=1.72%, sys=1.28%, rss=868900 kB (+16), threads=48
Thread-33::DEBUG::2016-09-01 10:53:27,141::health::122::health::(_check_resources) user=3.23%, sys=1.98%, rss=868912 kB (+12), threads=48
Thread-33::DEBUG::2016-09-01 10:54:27,201::health::122::health::(_check_resources) user=2.83%, sys=1.37%, rss=868880 kB (-32), threads=45


xzcat vdsm.log.11.xz | grep rss=

...

Thread-33::DEBUG::2016-09-01 13:01:31,100::health::122::health::(_check_resources) user=1.02%, sys=1.07%, rss=991372 kB (-20), threads=45
Thread-33::DEBUG::2016-09-01 13:02:31,123::health::122::health::(_check_resources) user=0.82%, sys=0.98%, rss=991372 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:03:31,146::health::122::health::(_check_resources) user=0.83%, sys=0.98%, rss=991372 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:04:31,186::health::122::health::(_check_resources) user=2.58%, sys=2.13%, rss=980888 kB (-10484), threads=49
Thread-33::DEBUG::2016-09-01 13:05:31,310::health::122::health::(_check_resources) user=6.00%, sys=2.77%, rss=978896 kB (-1992), threads=47
Thread-33::DEBUG::2016-09-01 13:06:31,366::health::122::health::(_check_resources) user=8.48%, sys=3.43%, rss=979020 kB (+124), threads=47
Thread-33::DEBUG::2016-09-01 13:07:31,436::health::122::health::(_check_resources) user=5.92%, sys=2.68%, rss=979168 kB (+148), threads=47
Thread-33::DEBUG::2016-09-01 13:08:31,516::health::122::health::(_check_resources) user=2.98%, sys=1.90%, rss=973556 kB (-5612), threads=47
Thread-33::DEBUG::2016-09-01 13:09:31,581::health::122::health::(_check_resources) user=2.50%, sys=1.97%, rss=975640 kB (+2084), threads=47
Thread-33::DEBUG::2016-09-01 13:10:31,638::health::122::health::(_check_resources) user=2.37%, sys=1.35%, rss=973572 kB (-2068), threads=45
Thread-33::DEBUG::2016-09-01 13:11:31,699::health::122::health::(_check_resources) user=2.28%, sys=1.27%, rss=973588 kB (+16), threads=45
Thread-33::DEBUG::2016-09-01 13:12:32,319::health::122::health::(_check_resources) user=80.22%, sys=5.18%, rss=1172364 kB (+198776), threads=64
Thread-33::DEBUG::2016-09-01 13:13:32,500::health::122::health::(_check_resources) user=26.02%, sys=3.07%, rss=1331032 kB (+158668), threads=60
Thread-33::DEBUG::2016-09-01 13:14:32,603::health::122::health::(_check_resources) user=2.15%, sys=2.28%, rss=1328332 kB (-2700), threads=58
Thread-33::DEBUG::2016-09-01 13:15:32,690::health::122::health::(_check_resources) user=1.78%, sys=1.85%, rss=1326188 kB (-2144), threads=52
Thread-33::DEBUG::2016-09-01 13:16:32,767::health::122::health::(_check_resources) user=1.73%, sys=1.82%, rss=1324104 kB (-2084), threads=48
Thread-33::DEBUG::2016-09-01 13:17:32,812::health::122::health::(_check_resources) user=1.07%, sys=1.60%, rss=1324068 kB (-36), threads=46
Thread-33::DEBUG::2016-09-01 13:18:32,843::health::122::health::(_check_resources) user=0.92%, sys=1.30%, rss=1324044 kB (-24), threads=45
Thread-33::DEBUG::2016-09-01 13:19:32,867::health::122::health::(_check_resources) user=0.92%, sys=1.67%, rss=1324044 kB (+0), threads=47
Thread-33::DEBUG::2016-09-01 13:20:32,894::health::122::health::(_check_resources) user=1.43%, sys=1.67%, rss=1324060 kB (+16), threads=47
Thread-33::DEBUG::2016-09-01 13:21:32,938::health::122::health::(_check_resources) user=5.35%, sys=2.85%, rss=1324088 kB (+28), threads=47
Thread-33::DEBUG::2016-09-01 13:22:32,996::health::122::health::(_check_resources) user=7.47%, sys=3.92%, rss=1324112 kB (+24), threads=47
Thread-33::DEBUG::2016-09-01 13:23:33,051::health::122::health::(_check_resources) user=3.85%, sys=2.22%, rss=1324116 kB (+4), threads=47
Thread-33::DEBUG::2016-09-01 13:24:33,115::health::122::health::(_check_resources) user=4.75%, sys=3.02%, rss=1326248 kB (+2132), threads=49
Thread-33::DEBUG::2016-09-01 13:25:33,177::health::122::health::(_check_resources) user=3.47%, sys=2.25%, rss=1324684 kB (-1564), threads=47
Thread-33::DEBUG::2016-09-01 13:26:33,247::health::122::health::(_check_resources) user=2.63%, sys=1.60%, rss=1313864 kB (-10820), threads=45
Thread-33::DEBUG::2016-09-01 13:27:33,308::health::122::health::(_check_resources) user=2.22%, sys=1.63%, rss=1313864 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:28:33,370::health::122::health::(_check_resources) user=2.18%, sys=1.63%, rss=1313864 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:29:33,433::health::122::health::(_check_resources) user=2.53%, sys=2.10%, rss=1313864 kB (+0), threads=47
Thread-33::DEBUG::2016-09-01 13:30:33,493::health::122::health::(_check_resources) user=2.48%, sys=1.65%, rss=1313864 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:31:33,558::health::122::health::(_check_resources) user=2.55%, sys=1.68%, rss=1313864 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:32:33,618::health::122::health::(_check_resources) user=2.35%, sys=1.63%, rss=1313864 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:33:33,679::health::122::health::(_check_resources) user=2.22%, sys=1.58%, rss=1313864 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:34:33,742::health::122::health::(_check_resources) user=2.25%, sys=2.10%, rss=1313864 kB (+0), threads=47
Thread-33::DEBUG::2016-09-01 13:35:33,803::health::122::health::(_check_resources) user=2.18%, sys=1.67%, rss=1313864 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:36:33,863::health::122::health::(_check_resources) user=2.35%, sys=1.62%, rss=1313880 kB (+16), threads=45
Thread-33::DEBUG::2016-09-01 13:37:33,923::health::122::health::(_check_resources) user=2.50%, sys=1.58%, rss=1313880 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 13:38:34,043::health::122::health::(_check_resources) user=12.50%, sys=2.05%, rss=1333268 kB (+19388), threads=52
Thread-33::DEBUG::2016-09-01 13:39:34,193::health::122::health::(_check_resources) user=91.82%, sys=6.33%, rss=1378920 kB (+45652), threads=64
Thread-33::DEBUG::2016-09-01 13:40:34,356::health::122::health::(_check_resources) user=3.77%, sys=1.98%, rss=1375128 kB (-3792), threads=58
Thread-33::DEBUG::2016-09-01 13:41:34,454::health::122::health::(_check_resources) user=1.98%, sys=1.87%, rss=1373112 kB (-2016), threads=55
Thread-33::DEBUG::2016-09-01 13:42:34,537::health::122::health::(_check_resources) user=1.70%, sys=1.93%, rss=1373040 kB (-72), threads=52
Thread-33::DEBUG::2016-09-01 13:43:34,604::health::122::health::(_check_resources) user=1.47%, sys=1.75%, rss=1370920 kB (-2120), threads=48
Thread-33::DEBUG::2016-09-01 13:44:34,643::health::122::health::(_check_resources) user=1.12%, sys=2.05%, rss=1370900 kB (-20), threads=47
Thread-33::DEBUG::2016-09-01 13:45:34,667::health::122::health::(_check_resources) user=0.85%, sys=1.30%, rss=1370876 kB (-24), threads=45
Thread-33::DEBUG::2016-09-01 13:46:34,691::health::122::health::(_check_resources) user=0.87%, sys=1.28%, rss=1370876 kB (+0), threads=45

...

Thread-33::DEBUG::2016-09-01 14:34:35,817::health::122::health::(_check_resources) user=0.88%, sys=1.78%, rss=1370880 kB (+0), threads=47
Thread-33::DEBUG::2016-09-01 14:35:35,838::health::122::health::(_check_resources) user=0.87%, sys=1.28%, rss=1370880 kB (+0), threads=45
Thread-33::DEBUG::2016-09-01 14:36:35,862::health::122::health::(_check_resources) user=0.88%, sys=1.27%, rss=1368176 kB (-2704), threads=45
Thread-33::DEBUG::2016-09-01 14:37:35,886::health::122::health::(_check_resources) user=0.83%, sys=1.30%, rss=1368176 kB (+0), threads=45

Comment 8 Milan Zamazal 2016-09-05 11:42:00 UTC

I checked migrations on my system and they occasionally increase RSS by ~40 MB. But once it happens 2-3 times, RSS no longer increases. That means although migrations increase memory usage initially a few times, it doesn't look like a leak on my system.

So I'm still unable to reproduce the leak.

Comment 9 Milan Zamazal 2016-09-08 14:23:08 UTC

(In reply to Federico Sayd from comment #7)

> Please tell me if you need more info about my setup

Unfortunately we have been unable to reproduce the leak so far. Let's try to get closer to your environment. Could you please tell us on what setup do you experience the problem, namely:

- What Vdsm version do you use?
- How long is your Vdsm instance running?
- What kind of storage do you use (iSCSI, NFS, ...)?
- How many VMs are you running on the host?
- Can you observe significant memory increase on other occasions than migrations?
- Any other information that might help us to reproduce the leak?

Thank you.

Comment 10 Nir Soffer 2016-09-08 15:12:42 UTC

Federico,

Please also share:
/etc/vdsm/vdsm.conf

If you run 4.0, please also share:
/etc/vdsm/vdsm.conf.d/*.conf
/usr/lib/vdsm/vdsm.conf.d/*.conf

Comment 11 Federico Sayd 2016-09-12 14:23:24 UTC

(In reply to Milan Zamazal from comment #9)

> - What Vdsm version do you use?

vdsm-4.18.11-1.el7.centos also others host with vdsm-4.18.12-1.el7 with the same behavior

> - How long is your Vdsm instance running?

Host running vdsm 4.18.11-1 since sep  2 17:59

> - What kind of storage do you use (iSCSI, NFS, ...)?

FC (4 FC domains) ISO (NFS Connected using NFSv4)

> - How many VMs are you running on the host?

20

> - Can you observe significant memory increase on other occasions than
> migrations?

3 VM's powered up in this host

Thread-13::DEBUG::2016-09-12 09:38:09,625::health::122::health::(_check_resources) user=2.77%, sys=1.77%, rss=910560 kB (+8), threads=45
Thread-13::DEBUG::2016-09-12 09:39:09,696::health::122::health::(_check_resources) user=2.52%, sys=1.65%, rss=910564 kB (+4), threads=47
Thread-13::DEBUG::2016-09-12 09:40:09,769::health::122::health::(_check_resources) user=2.92%, sys=2.20%, rss=910572 kB (+8), threads=45
Thread-13::DEBUG::2016-09-12 09:41:09,843::health::122::health::(_check_resources) user=2.58%, sys=1.60%, rss=910580 kB (+8), threads=45
Thread-13::DEBUG::2016-09-12 09:42:09,915::health::122::health::(_check_resources) user=2.47%, sys=1.50%, rss=910588 kB (+8), threads=45
...
Thread-13::DEBUG::2016-09-12 09:45:10,132::health::122::health::(_check_resources) user=2.65%, sys=2.10%, rss=910600 kB (+12), threads=45
...
Thread-13::DEBUG::2016-09-12 09:50:10,510::health::122::health::(_check_resources) user=2.48%, sys=2.22%, rss=910864 kB (+264), threads=45


> - Any other information that might help us to reproduce the leak?

1 Datacenter, 
1 Cluster (Version Compatibility 4.0), 
8 hosts in the cluster (All Full Blow CentOS Linux release 7.2.1511 install)
43 Logical Networks

> 
> Thank you.

Comment 12 Federico Sayd 2016-09-12 16:48:42 UTC

Created attachment 1200251 [details]
vdsm.conf

vdsm.conf of one of the hosts

Comment 13 Federico Sayd 2016-09-12 16:51:43 UTC

Created attachment 1200252 [details]
50-health.conf, only file in /etc/vdsm/vdsm.conf.d

This is the only file in /etc/vdsm/vdsm.conf.d

Created to debug vdsm memory consumption

Comment 14 Nir Soffer 2016-09-12 16:55:22 UTC

Thanks Federico, according to attachment 1200251 [details] and attachment 1200252 [details] vdsm is
using default configuration.

Comment 15 Tomas Jelinek 2016-09-21 08:16:26 UTC

@Eldad: have you seen anything like this in your scale tests? Any chance you could try to simulate a setup similar to the reported one and check if it reproduces?

Comment 17 Milan Zamazal 2016-10-12 09:09:45 UTC

If the problem still exists, could you please provide us with a full Vdsm log? The places where the number of the threads temporarily raises and the memory usage grows permanently at the same moment are most interesting. Logs after several days of running and with very high memory usage are more valuable, to exclude cases where memory usage grows initially but not afterwards.

The logs should help us to find out what all happens when the number of threads and the memory usage grow.

Comment 18 Tomas Jelinek 2016-11-01 10:10:20 UTC

@Federico: any chance you could provide us with this info?

Comment 19 Federico Sayd 2016-11-07 12:21:45 UTC

Created attachment 1217981 [details]
vdsm.log 1

Comment 20 Federico Sayd 2016-11-07 12:25:16 UTC

Created attachment 1217982 [details]
vdsm.log 2

Comment 21 Federico Sayd 2016-11-07 12:27:28 UTC

Created attachment 1217983 [details]
vdsm.log 3

Comment 22 Federico Sayd 2016-11-07 12:39:21 UTC

Created attachment 1217987 [details]
vdsm.log 4

Sorry for the late answer. I added 4 files with vdsm logs for approximately 5 days.

Comment 23 Milan Zamazal 2016-11-07 13:10:04 UTC

Thank you for the logs. If I'm looking correctly into them, Vdsm memory usage doesn't exceed 200 MB. This is consistent with what I observed when I tried to (unsuccessfully) reproduce the leak.

Does it mean that the problem is gone or do you still have Vdsm instances with very high RAM consumption? If you still have problems, we would need the logs exposing the high memory usage.

Comment 24 Tomas Jelinek 2016-11-09 12:35:37 UTC

@Sefi, any chance you could try to reproduce this?

Comment 25 Michal Skrivanek 2016-11-11 09:32:46 UTC

decreasing severity since it doesn't seem to cause much issues (if it even reproduces while running long-term)

Comment 26 sefi litmanovich 2016-11-17 08:54:44 UTC

So far I wasn't able to re produce the leak as well.
I am running 5 vms starting and stopping them constantly with 1 minute between the actions (waiting for vms go up, and them down).
Using vdsm version which Federico reported - vdsm-4.18.12-1.el7.
So far after 1.5 days running, vdsm memory consumption stays within a 2MB range and does not gradually increase.
Script is still running in case we feel we want to give it more time, but it doesn't seem like we're going to hit the bug.

Comment 27 Tomas Jelinek 2016-11-18 07:47:43 UTC

Considering that the latest logs provided by Federico don't contain any memory leak and no one was able to reproduce it, we have no way of going forward with this issue.
Closing now as insufficient data, please reopen in case you will hit it again and provide all information as stated in Comment 17.

Federico, thank you for the cooperation!