| Summary: | Memory leak when starting and stopping a vm | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Nir Soffer <nsoffer> | ||||||||||||||||||||||
| Component: | Core | Assignee: | Dan Kenigsberg <danken> | ||||||||||||||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | sefi litmanovich <slitmano> | ||||||||||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||||||||
| Version: | 4.18.0 | CC: | bugs, emarcian, fsayd, michal.skrivanek, mzamazal, pstehlik, s.kieske, slitmano, tjelinek | ||||||||||||||||||||||
| Target Milestone: | --- | Flags: | pstehlik:
testing_plan_complete?
rule-engine: planning_ack? rule-engine: devel_ack? pstehlik: testing_ack+ |
||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||
| Last Closed: | 2016-11-18 07:47:43 UTC | Type: | Bug | ||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||
|
Description
Nir Soffer
2016-09-01 07:48:36 UTC
Created attachment 1196611 [details]
Vdsm log for this run
Created attachment 1196614 [details]
Engine log for this run
Created attachment 1196870 [details]
Object dump created by objects.py
This is a dump of all objects in vdsm - created by:
import objects
objects.dump('objects.out')
The format is:
type count
object repr...
Created attachment 1196871 [details]
Script for dumping all objects in python process
Forgot to mention that the object dump was creating using manhole:
1. Enable manhole
[devel]
manhole_enable = true
2. Restart vdsm
3. Open manhole shell
$ nc -U /run/vdsm/vdsmd.manhole
> import objects
> objects.dump('objects.out')
I tried to find the leak but I've failed so far. All I can reliably observe is the already known Python multiprocessing leak (related to supervdsm communication), which should be much smaller than this supposed leak (a couple of tens of KB per single VM start-stop action).
After stopping MoM and inserting gc.collect() + reset of the leaking Python multiprocessing data structures into API.VM.create (to get a more predictable environment), I'm having trouble to reproduce the leak at all. While it's not uncommon to have RSS increased by ~1 MB per a VM start-stop, it doesn't happen all the time and RSS sometimes lowers after this action. While initially the memory usage grows, I wasn't able to exceed 140 MB even after >200 VM start-stop cycles of different kinds (jsonrpc calls, REST API calls, manual actions in Engine UI).
I also inspected counts of Python objects between single VM start-stop cycles where RSS increased. All I could see was a single leaking lambda instance in the multiprocessing leak. I don't think the lambda can waste anything close to ~1 MB of space, so I suspected something like C object leak. So I tried to create a really big Vdsm process to make the leaked data very visible but I couldn't.
I could get once a Vdsm instance of >400 MB size, but that was in a quite messed up environment where I spawned too many VMs in parallel by mistake and many stale (not actually running) VM instances were present in Vdsm.
Can you reproduce the leak to the extent of let's say at least 300 MB RSS? Even with MoM stopped and the following code inserted at the beginning of API.VM.create?
import multiprocessing.util
import weakref
multiprocessing.util._afterfork_registry = \
weakref.WeakValueDictionary()
import gc
gc.collect()
I originally reported the issue with vdsm and possible memory leak This is a extract of vdsm logs of a production system. I have noted a increment in rss when vm's are migrated between hosts. For approximately one day the host doesn't run any vm and rss hasn't grew since. Please tell me if you need more info about my setup xzcat vdsm.log.13.xz | grep rss= ... Thread-33::DEBUG::2016-09-01 10:32:26,269::health::122::health::(_check_resources) user=2.95%, sys=1.47%, rss=590596 kB (+0), threads=52 Thread-33::DEBUG::2016-09-01 10:33:26,327::health::122::health::(_check_resources) user=39.57%, sys=2.80%, rss=749108 kB (+158512), threads=55 Thread-33::DEBUG::2016-09-01 10:34:26,386::health::122::health::(_check_resources) user=2.00%, sys=1.07%, rss=749020 kB (-88), threads=50 Thread-33::DEBUG::2016-09-01 10:35:26,441::health::122::health::(_check_resources) user=1.98%, sys=1.05%, rss=748948 kB (-72), threads=48 Thread-33::DEBUG::2016-09-01 10:36:26,477::health::122::health::(_check_resources) user=1.15%, sys=0.93%, rss=746516 kB (-2432), threads=34 Thread-33::DEBUG::2016-09-01 10:37:26,496::health::122::health::(_check_resources) user=0.62%, sys=0.20%, rss=746516 kB (+0), threads=34 Thread-33::DEBUG::2016-09-01 10:38:26,518::health::122::health::(_check_resources) user=1.57%, sys=1.25%, rss=747252 kB (+736), threads=48 Thread-33::DEBUG::2016-09-01 10:39:26,549::health::122::health::(_check_resources) user=1.73%, sys=1.17%, rss=743516 kB (-3736), threads=47 Thread-33::DEBUG::2016-09-01 10:40:26,588::health::122::health::(_check_resources) user=4.43%, sys=1.70%, rss=743516 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 10:41:26,618::health::122::health::(_check_resources) user=3.85%, sys=1.63%, rss=743668 kB (+152), threads=45 Thread-33::DEBUG::2016-09-01 10:42:26,649::health::122::health::(_check_resources) user=1.55%, sys=0.87%, rss=743668 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 10:43:26,681::health::122::health::(_check_resources) user=1.78%, sys=1.27%, rss=743700 kB (+32), threads=47 Thread-33::DEBUG::2016-09-01 10:44:26,712::health::122::health::(_check_resources) user=1.85%, sys=0.95%, rss=743700 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 10:45:26,743::health::122::health::(_check_resources) user=1.80%, sys=1.00%, rss=743700 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 10:46:26,773::health::122::health::(_check_resources) user=1.53%, sys=0.92%, rss=743700 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 10:47:26,804::health::122::health::(_check_resources) user=1.73%, sys=0.90%, rss=743700 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 10:48:26,837::health::122::health::(_check_resources) user=1.88%, sys=1.27%, rss=743708 kB (+8), threads=47 Thread-33::DEBUG::2016-09-01 10:49:26,976::health::122::health::(_check_resources) user=22.77%, sys=1.62%, rss=847896 kB (+104188), threads=50 Thread-33::DEBUG::2016-09-01 10:50:27,037::health::122::health::(_check_resources) user=6.47%, sys=1.33%, rss=868908 kB (+21012), threads=46 Thread-33::DEBUG::2016-09-01 10:51:27,074::health::122::health::(_check_resources) user=1.03%, sys=0.98%, rss=868884 kB (-24), threads=45 Thread-33::DEBUG::2016-09-01 10:52:27,105::health::122::health::(_check_resources) user=1.72%, sys=1.28%, rss=868900 kB (+16), threads=48 Thread-33::DEBUG::2016-09-01 10:53:27,141::health::122::health::(_check_resources) user=3.23%, sys=1.98%, rss=868912 kB (+12), threads=48 Thread-33::DEBUG::2016-09-01 10:54:27,201::health::122::health::(_check_resources) user=2.83%, sys=1.37%, rss=868880 kB (-32), threads=45 xzcat vdsm.log.11.xz | grep rss= ... Thread-33::DEBUG::2016-09-01 13:01:31,100::health::122::health::(_check_resources) user=1.02%, sys=1.07%, rss=991372 kB (-20), threads=45 Thread-33::DEBUG::2016-09-01 13:02:31,123::health::122::health::(_check_resources) user=0.82%, sys=0.98%, rss=991372 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:03:31,146::health::122::health::(_check_resources) user=0.83%, sys=0.98%, rss=991372 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:04:31,186::health::122::health::(_check_resources) user=2.58%, sys=2.13%, rss=980888 kB (-10484), threads=49 Thread-33::DEBUG::2016-09-01 13:05:31,310::health::122::health::(_check_resources) user=6.00%, sys=2.77%, rss=978896 kB (-1992), threads=47 Thread-33::DEBUG::2016-09-01 13:06:31,366::health::122::health::(_check_resources) user=8.48%, sys=3.43%, rss=979020 kB (+124), threads=47 Thread-33::DEBUG::2016-09-01 13:07:31,436::health::122::health::(_check_resources) user=5.92%, sys=2.68%, rss=979168 kB (+148), threads=47 Thread-33::DEBUG::2016-09-01 13:08:31,516::health::122::health::(_check_resources) user=2.98%, sys=1.90%, rss=973556 kB (-5612), threads=47 Thread-33::DEBUG::2016-09-01 13:09:31,581::health::122::health::(_check_resources) user=2.50%, sys=1.97%, rss=975640 kB (+2084), threads=47 Thread-33::DEBUG::2016-09-01 13:10:31,638::health::122::health::(_check_resources) user=2.37%, sys=1.35%, rss=973572 kB (-2068), threads=45 Thread-33::DEBUG::2016-09-01 13:11:31,699::health::122::health::(_check_resources) user=2.28%, sys=1.27%, rss=973588 kB (+16), threads=45 Thread-33::DEBUG::2016-09-01 13:12:32,319::health::122::health::(_check_resources) user=80.22%, sys=5.18%, rss=1172364 kB (+198776), threads=64 Thread-33::DEBUG::2016-09-01 13:13:32,500::health::122::health::(_check_resources) user=26.02%, sys=3.07%, rss=1331032 kB (+158668), threads=60 Thread-33::DEBUG::2016-09-01 13:14:32,603::health::122::health::(_check_resources) user=2.15%, sys=2.28%, rss=1328332 kB (-2700), threads=58 Thread-33::DEBUG::2016-09-01 13:15:32,690::health::122::health::(_check_resources) user=1.78%, sys=1.85%, rss=1326188 kB (-2144), threads=52 Thread-33::DEBUG::2016-09-01 13:16:32,767::health::122::health::(_check_resources) user=1.73%, sys=1.82%, rss=1324104 kB (-2084), threads=48 Thread-33::DEBUG::2016-09-01 13:17:32,812::health::122::health::(_check_resources) user=1.07%, sys=1.60%, rss=1324068 kB (-36), threads=46 Thread-33::DEBUG::2016-09-01 13:18:32,843::health::122::health::(_check_resources) user=0.92%, sys=1.30%, rss=1324044 kB (-24), threads=45 Thread-33::DEBUG::2016-09-01 13:19:32,867::health::122::health::(_check_resources) user=0.92%, sys=1.67%, rss=1324044 kB (+0), threads=47 Thread-33::DEBUG::2016-09-01 13:20:32,894::health::122::health::(_check_resources) user=1.43%, sys=1.67%, rss=1324060 kB (+16), threads=47 Thread-33::DEBUG::2016-09-01 13:21:32,938::health::122::health::(_check_resources) user=5.35%, sys=2.85%, rss=1324088 kB (+28), threads=47 Thread-33::DEBUG::2016-09-01 13:22:32,996::health::122::health::(_check_resources) user=7.47%, sys=3.92%, rss=1324112 kB (+24), threads=47 Thread-33::DEBUG::2016-09-01 13:23:33,051::health::122::health::(_check_resources) user=3.85%, sys=2.22%, rss=1324116 kB (+4), threads=47 Thread-33::DEBUG::2016-09-01 13:24:33,115::health::122::health::(_check_resources) user=4.75%, sys=3.02%, rss=1326248 kB (+2132), threads=49 Thread-33::DEBUG::2016-09-01 13:25:33,177::health::122::health::(_check_resources) user=3.47%, sys=2.25%, rss=1324684 kB (-1564), threads=47 Thread-33::DEBUG::2016-09-01 13:26:33,247::health::122::health::(_check_resources) user=2.63%, sys=1.60%, rss=1313864 kB (-10820), threads=45 Thread-33::DEBUG::2016-09-01 13:27:33,308::health::122::health::(_check_resources) user=2.22%, sys=1.63%, rss=1313864 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:28:33,370::health::122::health::(_check_resources) user=2.18%, sys=1.63%, rss=1313864 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:29:33,433::health::122::health::(_check_resources) user=2.53%, sys=2.10%, rss=1313864 kB (+0), threads=47 Thread-33::DEBUG::2016-09-01 13:30:33,493::health::122::health::(_check_resources) user=2.48%, sys=1.65%, rss=1313864 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:31:33,558::health::122::health::(_check_resources) user=2.55%, sys=1.68%, rss=1313864 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:32:33,618::health::122::health::(_check_resources) user=2.35%, sys=1.63%, rss=1313864 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:33:33,679::health::122::health::(_check_resources) user=2.22%, sys=1.58%, rss=1313864 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:34:33,742::health::122::health::(_check_resources) user=2.25%, sys=2.10%, rss=1313864 kB (+0), threads=47 Thread-33::DEBUG::2016-09-01 13:35:33,803::health::122::health::(_check_resources) user=2.18%, sys=1.67%, rss=1313864 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:36:33,863::health::122::health::(_check_resources) user=2.35%, sys=1.62%, rss=1313880 kB (+16), threads=45 Thread-33::DEBUG::2016-09-01 13:37:33,923::health::122::health::(_check_resources) user=2.50%, sys=1.58%, rss=1313880 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 13:38:34,043::health::122::health::(_check_resources) user=12.50%, sys=2.05%, rss=1333268 kB (+19388), threads=52 Thread-33::DEBUG::2016-09-01 13:39:34,193::health::122::health::(_check_resources) user=91.82%, sys=6.33%, rss=1378920 kB (+45652), threads=64 Thread-33::DEBUG::2016-09-01 13:40:34,356::health::122::health::(_check_resources) user=3.77%, sys=1.98%, rss=1375128 kB (-3792), threads=58 Thread-33::DEBUG::2016-09-01 13:41:34,454::health::122::health::(_check_resources) user=1.98%, sys=1.87%, rss=1373112 kB (-2016), threads=55 Thread-33::DEBUG::2016-09-01 13:42:34,537::health::122::health::(_check_resources) user=1.70%, sys=1.93%, rss=1373040 kB (-72), threads=52 Thread-33::DEBUG::2016-09-01 13:43:34,604::health::122::health::(_check_resources) user=1.47%, sys=1.75%, rss=1370920 kB (-2120), threads=48 Thread-33::DEBUG::2016-09-01 13:44:34,643::health::122::health::(_check_resources) user=1.12%, sys=2.05%, rss=1370900 kB (-20), threads=47 Thread-33::DEBUG::2016-09-01 13:45:34,667::health::122::health::(_check_resources) user=0.85%, sys=1.30%, rss=1370876 kB (-24), threads=45 Thread-33::DEBUG::2016-09-01 13:46:34,691::health::122::health::(_check_resources) user=0.87%, sys=1.28%, rss=1370876 kB (+0), threads=45 ... Thread-33::DEBUG::2016-09-01 14:34:35,817::health::122::health::(_check_resources) user=0.88%, sys=1.78%, rss=1370880 kB (+0), threads=47 Thread-33::DEBUG::2016-09-01 14:35:35,838::health::122::health::(_check_resources) user=0.87%, sys=1.28%, rss=1370880 kB (+0), threads=45 Thread-33::DEBUG::2016-09-01 14:36:35,862::health::122::health::(_check_resources) user=0.88%, sys=1.27%, rss=1368176 kB (-2704), threads=45 Thread-33::DEBUG::2016-09-01 14:37:35,886::health::122::health::(_check_resources) user=0.83%, sys=1.30%, rss=1368176 kB (+0), threads=45 I checked migrations on my system and they occasionally increase RSS by ~40 MB. But once it happens 2-3 times, RSS no longer increases. That means although migrations increase memory usage initially a few times, it doesn't look like a leak on my system. So I'm still unable to reproduce the leak. (In reply to Federico Sayd from comment #7) > Please tell me if you need more info about my setup Unfortunately we have been unable to reproduce the leak so far. Let's try to get closer to your environment. Could you please tell us on what setup do you experience the problem, namely: - What Vdsm version do you use? - How long is your Vdsm instance running? - What kind of storage do you use (iSCSI, NFS, ...)? - How many VMs are you running on the host? - Can you observe significant memory increase on other occasions than migrations? - Any other information that might help us to reproduce the leak? Thank you. Federico, Please also share: /etc/vdsm/vdsm.conf If you run 4.0, please also share: /etc/vdsm/vdsm.conf.d/*.conf /usr/lib/vdsm/vdsm.conf.d/*.conf (In reply to Milan Zamazal from comment #9) > - What Vdsm version do you use? vdsm-4.18.11-1.el7.centos also others host with vdsm-4.18.12-1.el7 with the same behavior > - How long is your Vdsm instance running? Host running vdsm 4.18.11-1 since sep 2 17:59 > - What kind of storage do you use (iSCSI, NFS, ...)? FC (4 FC domains) ISO (NFS Connected using NFSv4) > - How many VMs are you running on the host? 20 > - Can you observe significant memory increase on other occasions than > migrations? 3 VM's powered up in this host Thread-13::DEBUG::2016-09-12 09:38:09,625::health::122::health::(_check_resources) user=2.77%, sys=1.77%, rss=910560 kB (+8), threads=45 Thread-13::DEBUG::2016-09-12 09:39:09,696::health::122::health::(_check_resources) user=2.52%, sys=1.65%, rss=910564 kB (+4), threads=47 Thread-13::DEBUG::2016-09-12 09:40:09,769::health::122::health::(_check_resources) user=2.92%, sys=2.20%, rss=910572 kB (+8), threads=45 Thread-13::DEBUG::2016-09-12 09:41:09,843::health::122::health::(_check_resources) user=2.58%, sys=1.60%, rss=910580 kB (+8), threads=45 Thread-13::DEBUG::2016-09-12 09:42:09,915::health::122::health::(_check_resources) user=2.47%, sys=1.50%, rss=910588 kB (+8), threads=45 ... Thread-13::DEBUG::2016-09-12 09:45:10,132::health::122::health::(_check_resources) user=2.65%, sys=2.10%, rss=910600 kB (+12), threads=45 ... Thread-13::DEBUG::2016-09-12 09:50:10,510::health::122::health::(_check_resources) user=2.48%, sys=2.22%, rss=910864 kB (+264), threads=45 > - Any other information that might help us to reproduce the leak? 1 Datacenter, 1 Cluster (Version Compatibility 4.0), 8 hosts in the cluster (All Full Blow CentOS Linux release 7.2.1511 install) 43 Logical Networks > > Thank you. Created attachment 1200251 [details]
vdsm.conf
vdsm.conf of one of the hosts
Created attachment 1200252 [details]
50-health.conf, only file in /etc/vdsm/vdsm.conf.d
This is the only file in /etc/vdsm/vdsm.conf.d
Created to debug vdsm memory consumption
Thanks Federico, according to attachment 1200251 [details] and attachment 1200252 [details] vdsm is using default configuration. @Eldad: have you seen anything like this in your scale tests? Any chance you could try to simulate a setup similar to the reported one and check if it reproduces? If the problem still exists, could you please provide us with a full Vdsm log? The places where the number of the threads temporarily raises and the memory usage grows permanently at the same moment are most interesting. Logs after several days of running and with very high memory usage are more valuable, to exclude cases where memory usage grows initially but not afterwards. The logs should help us to find out what all happens when the number of threads and the memory usage grow. @Federico: any chance you could provide us with this info? Created attachment 1217981 [details]
vdsm.log 1
Created attachment 1217982 [details]
vdsm.log 2
Created attachment 1217983 [details]
vdsm.log 3
Created attachment 1217987 [details]
vdsm.log 4
Sorry for the late answer. I added 4 files with vdsm logs for approximately 5 days.
Thank you for the logs. If I'm looking correctly into them, Vdsm memory usage doesn't exceed 200 MB. This is consistent with what I observed when I tried to (unsuccessfully) reproduce the leak. Does it mean that the problem is gone or do you still have Vdsm instances with very high RAM consumption? If you still have problems, we would need the logs exposing the high memory usage. @Sefi, any chance you could try to reproduce this? decreasing severity since it doesn't seem to cause much issues (if it even reproduces while running long-term) So far I wasn't able to re produce the leak as well. I am running 5 vms starting and stopping them constantly with 1 minute between the actions (waiting for vms go up, and them down). Using vdsm version which Federico reported - vdsm-4.18.12-1.el7. So far after 1.5 days running, vdsm memory consumption stays within a 2MB range and does not gradually increase. Script is still running in case we feel we want to give it more time, but it doesn't seem like we're going to hit the bug. Considering that the latest logs provided by Federico don't contain any memory leak and no one was able to reproduce it, we have no way of going forward with this issue. Closing now as insufficient data, please reopen in case you will hit it again and provide all information as stated in Comment 17. Federico, thank you for the cooperation! |