Bug 1570840

Summary: Memory leak in vdsmd
Product: [oVirt] vdsm Reporter: Chris Adams <linux>
Component: CoreAssignee: Marcin Sobczyk <msobczyk>
Status: CLOSED WORKSFORME QA Contact: Nikolai Sednev <nsednev>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.23CC: bugs, fromani, linux, michal.skrivanek, mperina, msobczyk, nsednev, nsoffer, pkliczew
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-29 13:50:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm log from node1
none
vdsm log from node2
none
vdsm log from node3
none
process_log.py utility
none
'node2' logs grouped by delta and occurence of log lines none

Description Chris Adams 2018-04-23 14:04:47 UTC
I have a two-node dev cluster, running CentOS 7.4 and oVirt 4.2.2, updated last week. It is using hosted engine and iSCSI storage. On one node (the node currently hosting the engine), the vdsmd process is using over 900MB RAM and is growing.

Comment 1 Michal Skrivanek 2018-04-24 04:30:06 UTC
How fast is it growing? Only on one of them? Any more concrete evidence? Anything in logs?

Comment 2 Chris Adams 2018-04-24 12:51:06 UTC
Due to an unfortunate power failure, followed by a breaker failure, followed by a generator failure... my dev cluster was power-cycled yesterday afternoon (UPS batteries weren't expected to last through that much failure).

I'm not sure why, but I ended up with SPM and all VMs running on the same node after the power returned (and after I fixed the storage - the iSCSI is a "home brew" SAN, Linux running targetcli, and it came back from power cycle with the LUNs not attached to the storage).  After about 14 hours, vdsmd on node1 (no VMs) is using about 98MB, and vdsmd on node2 (SPM, 6 VMs including hosted engine) is up to almost 220MB.

Nothing jumps out at me in the logs - let me know what logs you'd like to see and I'll attach them.  It's been a while since I needed to debug a memory leak, so I'm not sure of the best way to see where the RAM is going.

I have several production oVirt clusters with similar setups (execpt iSCSI is to a real SAN); they're still on 4.1.9.  I see similar vdsmd memory leaks on a couple of them; it seems to follow the node with the hosted engine.  On one for example, I restarted vdsmd last week (April 16), and it is currently using 1.8GB RAM.

Comment 3 Michal Skrivanek 2018-04-24 17:30:39 UTC
(In reply to Chris Adams from comment #2)

"using" is too vague. Are you looking at RSS? Python allocation works in mysterious ways...and it's normal for it to grow initially (for hours/days), as long as it stops eventually, and it's also ok to see a very high total virtual memory size (tens of GB)
There are also few known leaks but they are considered small enough to be relevant.
It's not easy to profile remotely I'm afraid...you'd have to do it yourself or provide some repeatable scenario where we can see that in our lab.

Comment 4 Chris Adams 2018-04-24 17:57:12 UTC
Sorry - I'm looking at resident size, not virtual.  I had a vdsmd last week on one node one one of the 4.1 clusters that was sitting at 15GB resident (I didn't open a ticket then because I needed to restart it ASAP and because I didn't figure 4.1 was getting much look at this point).

The 4.2.2 dev cluster has vdsmd resident up to 314MB now on the one node, so it has grown over 90MB in 4 hours.  It seems to grow fairly steadily when I've watched it.

If you can give me some tips, I can try to see what's happening (just I need some pointers because it's been a while since I had to do that).

I'm guessing it must be something related to how I installed these, since it seems to happen on multiple clusters for me, but I thought I went pretty much right by the guides, so... I don't know?  These were all installed at previous versions and updated over time (all 3.5 or 3.6 initially, did go major-release by major-release updating).

Comment 5 Chris Adams 2018-04-25 00:01:30 UTC
Just to see if I could narrow down causes, I wrote a script to write the RSS of vdsmd to a file every five minutes.  node1, with no VMs, was flat, while node2, with 6 VMs (including the engine) and also the SPM, increased every five minutes.

I moved SPM to node1, and it took one jump, but then stayed level, while node2 continued increasing.

I moved the engine VM to node1, and it started increasing steadily, while node2 almost leveled off (still increased a small amount).  node1 vdsmd RSS is up about 14MB in the last hour.

I'm still reading how to diagnose python memory leaks - python is not a language I'm strong in.

Comment 6 Michal Skrivanek 2018-04-25 15:53:22 UTC
(In reply to Chris Adams from comment #5)
> Just to see if I could narrow down causes, I wrote a script to write the RSS
> of vdsmd to a file every five minutes.  node1, with no VMs, was flat, while
> node2, with 6 VMs (including the engine) and also the SPM, increased every
> five minutes.
> 
> I moved SPM to node1, and it took one jump, but then stayed level, while
> node2 continued increasing.

SPM does a lot more than regular nodes. Maybe if you can try to see if the pattern is really related to that role or rather to how many VMs it run or if it runs HostedEngine VM. A node running HE VM is also doing a bit more/other stuff than regular non-SPM node.
 
> I moved the engine VM to node1, and it started increasing steadily, while
> node2 almost leveled off (still increased a small amount).  node1 vdsmd RSS
> is up about 14MB in the last hour.

this woulds still sound ok. As long as it stops in about a day if you leave it like that.
 
> I'm still reading how to diagnose python memory leaks - python is not a
> language I'm strong in.

it's not easy. Most of what you can find around is not working on vdsm because it's much more complex(especially the threading) than average python apps

Nir Soffer and Francesco Romani were haunting leaks about a year or two ago. Please bring it up on user/devel list, that's easier to follow up than here

Comment 7 Chris Adams 2018-05-16 13:20:11 UTC
I brought this up on the devel list and was referred back to BZ.  Per the list, I enabled the vdsm health checker, and it shows the RSS for vdsm increasing every minute, by an average of about 190K, but only on the node with the hosted engine.

As a test, I put the engine HA in global maintenance mode, and even stopped ovirt-ha-broker and ovirt-ha-agent on both nodes.  Watching that for a short time seemed to show the RSS increase slow; most minutes it was 24K-44K (but still increasing every minute).

I am not sure what information you'd need to try to replicate my setup, please let me know.  This two-node dev cluster was installed with CentOS 7+oVirt 3.5 (with engine on CentOS 6), then upgraded 3.5->3.6->4.0->4.1->4.2 (with the CentOS 7 engine appliance deployed when CentOS 6 engine support was dropped).  It is using iSCSI storage, with separate LUNs for hosted engine and regular VMs (the regular VM storage is using multipath).

Alternately, if someone can send me notes/patches/etc. to try to dump where vdsm is using RAM, I can give that a try.  I'm just not that good with python to do it myself I don't think (was reading about the tracemalloc module, but not sure where  to start with it).

Comment 8 Chris Adams 2018-05-16 17:39:09 UTC
And after a few hours, vdsm RSS is back to growing at a rate of 200-450K per minute.

Comment 9 Michal Skrivanek 2018-08-15 17:37:27 UTC
technically this is Infra

but I'm not sure who can help with this. Nir, Francesco, I'm afraid you'd need to look at it or we have to close this. I do not see any more details in the mail thread to be able to recreate this in the lab. 200-400KB/s sounds serious enough to look at it. OTOH no one else is reporting such problem.

Comment 10 Chris Adams 2018-08-15 17:45:29 UTC
Just to add - I'm guessing it is something about how I set things up, since I see it on multiple clusters... just not sure what that could have been (sorry for being a typical clueless end-user in that way!).  My dev cluster is out of service right now, but I do see this in production on several 4.1.9 setups (completely independent systems).

If there is any specific info I can provide, please just ask.

Here's how they were set up:

- installed CentOS 7 from a basic kickstart, added oVirt 3.6 repos, installed oVirt with hosted engine on iSCSI (installed engine VM from same basic kickstart when prompted by "hosted-engine -deploy")
- updated within 3.6 a few times, then upgraded to 4.0 and then 4.1

They're each 2-3 servers with iSCSI storage (one has multipath but the others don't so not that).  I'm not sure of anything "unique" about how I set them up (wish I did).

Comment 11 Martin Perina 2018-08-27 12:46:34 UTC
(In reply to Michal Skrivanek from comment #9)
> technically this is Infra
> 
> but I'm not sure who can help with this. Nir, Francesco, I'm afraid you'd
> need to look at it or we have to close this. I do not see any more details
> in the mail thread to be able to recreate this in the lab. 200-400KB/s
> sounds serious enough to look at it. OTOH no one else is reporting such
> problem.

Nir/Francesco, did you have a time to take a look into it?

Comment 12 Francesco Romani 2018-08-27 12:48:14 UTC
(In reply to Martin Perina from comment #11)
> (In reply to Michal Skrivanek from comment #9)
> > technically this is Infra
> > 
> > but I'm not sure who can help with this. Nir, Francesco, I'm afraid you'd
> > need to look at it or we have to close this. I do not see any more details
> > in the mail thread to be able to recreate this in the lab. 200-400KB/s
> > sounds serious enough to look at it. OTOH no one else is reporting such
> > problem.
> 
> Nir/Francesco, did you have a time to take a look into it?

Nope, but I will have some time hopefully later today or tomorrow.

Comment 13 Nir Soffer 2018-08-27 13:14:15 UTC
(In reply to Martin Perina from comment #11)
> Nir/Francesco, did you have a time to take a look into it?

No, and I will not have time for this. If this leak is in storage, please move
the bug to storage.

Comment 14 Francesco Romani 2018-08-27 14:04:59 UTC
(In reply to Chris Adams from comment #5)
> Just to see if I could narrow down causes, I wrote a script to write the RSS
> of vdsmd to a file every five minutes.  node1, with no VMs, was flat, while
> node2, with 6 VMs (including the engine) and also the SPM, increased every
> five minutes.
> 
> I moved SPM to node1, and it took one jump, but then stayed level, while
> node2 continued increasing.
> 
> I moved the engine VM to node1, and it started increasing steadily, while
> node2 almost leveled off (still increased a small amount).  node1 vdsmd RSS
> is up about 14MB in the last hour.
> 
> I'm still reading how to diagnose python memory leaks - python is not a
> language I'm strong in.

No need for custom scripts, btw. You can use the builtin Vdsm "health" reporting: 
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/common/config.py.in#L565

the health reporting will periodically dump the memstats, including deltas from last check, thread count and other infos. Find them in the logs.

Comment 15 Francesco Romani 2018-08-27 14:07:47 UTC
From what I've learned from the comments, it seems somehow related to hosted engine. Chris, could you confirm this behaviour is showing up only with hosted engine? In other words, does it happen if you use non-hosted Engine?

Comment 16 Chris Adams 2018-08-27 20:15:02 UTC
I haven't used oVirt with a non-hosted engine, so don't know about that or have a good way to test it.  I did run a test with the engine HA agent/broker shut down, and the RSS increase appeared to slow somewhat, but it still continued to grow, so I don't know if that was just something coincidental or what.

The vdsm RSS growth does seem to be on the node hosting the engine, so if it isn't the engine HA agent/broker, could it be related to the engine storage domain?  I can't really think of much else that would be different between a node hosting the engine and other nodes.  I'm trying to remember if I did anything "off script" in setting up the engine storage, but I don't think I did.  It is an iSCSI LUN, dedicated to just the engine.

Comment 17 Nir Soffer 2018-08-27 20:23:50 UTC
Chris, can you enable the health monitor, run with DEBUG log level for couple of days, and attach vdsm logs to the bug?

Comment 18 Chris Adams 2018-08-28 17:38:04 UTC
Just an FYI: I did see your request, but I'm on vacation for the next week. I will look at that when I get back.

I did enable the health monitor on my dev cluster a while back - it did show the steadily growing RSS (but not really any additional info).

Comment 19 Nir Soffer 2018-08-28 17:45:35 UTC
(In reply to Chris Adams from comment #18)
Great, please attach logs with enabled health monitor to the bug. It will help
to understand the nature of this issue.

Comment 20 Chris Adams 2018-09-12 19:29:18 UTC
Created attachment 1482831 [details]
vdsm log from node1

Comment 21 Chris Adams 2018-09-12 19:29:40 UTC
Created attachment 1482832 [details]
vdsm log from node2

Comment 22 Chris Adams 2018-09-12 19:30:00 UTC
Created attachment 1482833 [details]
vdsm log from node3

Comment 23 Chris Adams 2018-09-12 19:31:15 UTC
I enabled the health checker on one cluster; here are the logs from around the same time (didn't rotate at the same time so not exactly lined up).  node2 is the node currently hosting the Hosted Engine VM and is the one with vdsm's RSS growing.

Comment 24 Chris Adams 2018-10-09 14:36:41 UTC
Just checking in - was that sufficient log info, do I need to send more, ??

Comment 31 Michal Skrivanek 2018-11-20 12:58:37 UTC
Nikolai, would be best if you run it for some time (few hours at least) and please enable health monitoring(comment #14) and attach vdsm.log. Thanks

Comment 32 Nikolai Sednev 2018-11-20 14:56:25 UTC
systemctl restart vdsmd

puma18 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          32065       17679       10706         489        3679       13077
Swap:          2047           0        2047


Every 2.0s: ps -o etime,rss,vsz 1618                                                          Tue Nov 20 15:16:06 2018

    ELAPSED   RSS    VSZ
      00:21 82356 4413804


puma18 ~]# while sleep 60; do echo; date; ps -o rsz= 1618; done

Tue Nov 20 15:18:17 IST 2018
98724


puma18 ~]# top
top - 15:19:24 up 5 days, 21:15,  2 users,  load average: 1.26, 1.03, 0.90
Tasks: 327 total,   1 running, 326 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.0 us,  2.6 sy,  0.0 ni, 83.9 id,  0.5 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 32835032 total, 10974984 free, 18091540 used,  3768508 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 13404044 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                           
 1618 vdsm       0 -20 4635000  98760  13540 S   4.3  0.3   0:12.36 vdsmd                                             
18808 root      15  -5 1547364  36516   9936 S   4.0  0.1  60:07.53 supervdsmd 

puma18 ~]# while sleep 360; do echo; date; ps -o rsz= 1618; done

Tue Nov 20 15:24:32 IST 2018
99076

Tue Nov 20 16:24:33 IST 2018
100092


Tue Nov 20 15:30:32 IST 2018
99136

Tue Nov 20 16:30:33 IST 2018
100104


Tue Nov 20 15:42:32 IST 2018
99292

Tue Nov 20 16:42:33 IST 2018
100132

Comment 34 Marcin Sobczyk 2019-01-18 12:59:46 UTC
I can't really reproduce the issue. I have my HE environment up and running, there are some VMs launched, VDSM's memory usage ascends to ~100 MB but after ~30 hours it stalls on the same level.

I was trying to analyze 'node2' log that Chris provided. I wrote a script that:

- searches for lines containing delta information given by health check (i.e. "rss=408532 kB (+152)")
- takes all the log lines that were printed between the current check and the previous one
- groups the lines by occurence count so we get something like this:

  31 x (Reactor thread) [Broker.StompAdapter] Processing CONNECT request (stompreactor:103)
  31 x (JsonRpc (StompReactor)) [Broker.StompAdapter] Subscribe command received (stompreactor:132)
   4 x (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:539)
 ...

- prints the grouped data into files named after the delta value (i.e. "440", "320", etc.) with contents like above

The deltas span is huge - the smallest are ~80 kB and the largest ~1000 kB. I was hoping that some pattern would emerge (i.e. for the big ones the log file would contain calls to some stuff that doesn't occur for the small ones) but I can't find a correlation.

Nikolai's log unfortunately doesn't have the health check info, so I can't really work with it.
'node2', on the other hand, seems to be written with default logging levels, so there are no debug messages.
If someone could use my utility in an environment that has both all the loggers set to debug, and the health checks enabled, that would be great.

I'm attaching this utility along with the statistics I've created for the 'node2' log file.
The usage is simple:

 ./process_log.py vdsm.log

Please note that this utility will create *a lot* of files in the directory you're currently in.

Comment 35 Marcin Sobczyk 2019-01-18 13:02:00 UTC
Created attachment 1521512 [details]
process_log.py utility

Comment 36 Marcin Sobczyk 2019-01-18 13:03:09 UTC
Created attachment 1521513 [details]
'node2' logs grouped by delta and occurence of log lines

Comment 37 Martin Perina 2019-01-18 13:36:08 UTC
(In reply to Marcin Sobczyk from comment #34)
> I can't really reproduce the issue. I have my HE environment up and running,
> there are some VMs launched, VDSM's memory usage ascends to ~100 MB but
> after ~30 hours it stalls on the same level.
> 
> I was trying to analyze 'node2' log that Chris provided. I wrote a script
> that:
> 
> - searches for lines containing delta information given by health check
> (i.e. "rss=408532 kB (+152)")
> - takes all the log lines that were printed between the current check and
> the previous one
> - groups the lines by occurence count so we get something like this:
> 
>   31 x (Reactor thread) [Broker.StompAdapter] Processing CONNECT request
> (stompreactor:103)
>   31 x (JsonRpc (StompReactor)) [Broker.StompAdapter] Subscribe command
> received (stompreactor:132)
>    4 x (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo
> succeeded in 0.00 seconds (__init__:539)
>  ...
> 
> - prints the grouped data into files named after the delta value (i.e.
> "440", "320", etc.) with contents like above
> 
> The deltas span is huge - the smallest are ~80 kB and the largest ~1000 kB.
> I was hoping that some pattern would emerge (i.e. for the big ones the log
> file would contain calls to some stuff that doesn't occur for the small
> ones) but I can't find a correlation.
> 
> Nikolai's log unfortunately doesn't have the health check info, so I can't
> really work with it.
> 'node2', on the other hand, seems to be written with default logging levels,
> so there are no debug messages.
> If someone could use my utility in an environment that has both all the
> loggers set to debug, and the health checks enabled, that would be great.
> 
> I'm attaching this utility along with the statistics I've created for the
> 'node2' log file.
> The usage is simple:
> 
>  ./process_log.py vdsm.log
> 
> Please note that this utility will create *a lot* of files in the directory
> you're currently in.

Nikolai, could you please try to reproduce in your environemnt with debug log enabled on all hosts?

Comment 38 Nikolai Sednev 2019-01-20 07:46:26 UTC
(In reply to Martin Perina from comment #37)
> (In reply to Marcin Sobczyk from comment #34)
> > I can't really reproduce the issue. I have my HE environment up and running,
> > there are some VMs launched, VDSM's memory usage ascends to ~100 MB but
> > after ~30 hours it stalls on the same level.
> > 
> > I was trying to analyze 'node2' log that Chris provided. I wrote a script
> > that:
> > 
> > - searches for lines containing delta information given by health check
> > (i.e. "rss=408532 kB (+152)")
> > - takes all the log lines that were printed between the current check and
> > the previous one
> > - groups the lines by occurence count so we get something like this:
> > 
> >   31 x (Reactor thread) [Broker.StompAdapter] Processing CONNECT request
> > (stompreactor:103)
> >   31 x (JsonRpc (StompReactor)) [Broker.StompAdapter] Subscribe command
> > received (stompreactor:132)
> >    4 x (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo
> > succeeded in 0.00 seconds (__init__:539)
> >  ...
> > 
> > - prints the grouped data into files named after the delta value (i.e.
> > "440", "320", etc.) with contents like above
> > 
> > The deltas span is huge - the smallest are ~80 kB and the largest ~1000 kB.
> > I was hoping that some pattern would emerge (i.e. for the big ones the log
> > file would contain calls to some stuff that doesn't occur for the small
> > ones) but I can't find a correlation.
> > 
> > Nikolai's log unfortunately doesn't have the health check info, so I can't
> > really work with it.
> > 'node2', on the other hand, seems to be written with default logging levels,
> > so there are no debug messages.
> > If someone could use my utility in an environment that has both all the
> > loggers set to debug, and the health checks enabled, that would be great.
> > 
> > I'm attaching this utility along with the statistics I've created for the
> > 'node2' log file.
> > The usage is simple:
> > 
> >  ./process_log.py vdsm.log
> > 
> > Please note that this utility will create *a lot* of files in the directory
> > you're currently in.
> 
> Nikolai, could you please try to reproduce in your environemnt with debug
> log enabled on all hosts?

I already provided all relevant logs including debug logs in comment #33. At this point due to capacity I can't supply any more data.

Comment 39 Nikolai Sednev 2019-01-21 07:42:13 UTC
Tested on latest 4.3.0-0.8 with vdsm-4.30.6-1.el7ev.x86_64 it stabilized on 106600:
alma03 ~]# while sleep 60; do echo; date; ps -o rsz= 23584; done

Sun Jan 20 15:57:55 IST 2019
105504

Sun Jan 20 16:06:56 IST 2019
105540

Sun Jan 20 16:55:01 IST 2019
105588

Mon Jan 21 07:54:35 IST 2019
106596

Mon Jan 21 07:55:35 IST 2019
106600

Mon Jan 21 07:56:35 IST 2019
106600

Mon Jan 21 09:39:46 IST 2019
106600

alma03 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          32010       19749        5255          97        7005       11584
Swap:          2047           0        2047

I think that up to this point its stopped where it is and not going to rise.
Please consider to close this bug as it works as designed.

Comment 41 Marcin Sobczyk 2019-01-29 13:50:44 UTC
Since the issue is not reproducible by me or QE, we're closing the bug.