Bug 1142647 - supervdsm leaks memory when using glusterfs
Summary: supervdsm leaks memory when using glusterfs
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.5.1
Assignee: Darshan
QA Contact: Gil Klein
URL:
Whiteboard: gluster
Depends On: 1093594
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-17 07:15 UTC by zhang guoqing
Modified: 2016-02-10 19:28 UTC (History)
17 users (show)

Fixed In Version: ovirt-3.5.1_rc1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-01-21 16:02:54 UTC
oVirt Team: Gluster
Embargoed:


Attachments (Terms of Use)
vdsm-logs and some pngs (12.95 MB, application/octet-stream)
2014-09-18 01:01 UTC, zhang guoqing
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 33312 0 master MERGED gluster: Temporary fix for supervdsm memory leak. Never
oVirt gerrit 33950 0 ovirt-3.5 MERGED gluster: Temporary fix for supervdsm memory leak. Never

Description zhang guoqing 2014-09-17 07:15:36 UTC
Description of problem:
I have deployed a ovirt test bed. Cluster with GlusterFS Storage Domain,supervdsmServer deamon  on which all vdsm nodes showed memory leaks,maybe! But cluster use NFS  Storage Domain is normal.Please see "reproducible" detailed!

Version-Release number of selected component (if applicable):
[root@node-01 ~]# rpm -qa | grep vdsm
vdsm-python-zombiereaper-4.16.4-0.el6.noarch
vdsm-xmlrpc-4.16.4-0.el6.noarch
vdsm-jsonrpc-4.16.4-0.el6.noarch
vdsm-4.16.4-0.el6.x86_64
vdsm-cli-4.16.4-0.el6.noarch
vdsm-python-4.16.4-0.el6.noarch
vdsm-yajsonrpc-4.16.4-0.el6.noarch
vdsm-gluster-4.16.4-0.el6.noarch

[root@node-01 ~]# rpm -qa | grep gluster
glusterfs-cli-3.5.2-1.el6.x86_64
glusterfs-libs-3.5.2-1.el6.x86_64
glusterfs-3.5.2-1.el6.x86_64
glusterfs-rdma-3.5.2-1.el6.x86_64
glusterfs-server-3.5.2-1.el6.x86_64
glusterfs-api-3.5.2-1.el6.x86_64
glusterfs-fuse-3.5.2-1.el6.x86_64
vdsm-gluster-4.16.4-0.el6.noarch

[root@node-01 ~]# rpm -qa | grep ioprocess
ioprocess-0.12.0-2.el6.x86_64
python-ioprocess-0.12.0-2.el6.noarch
Exclusion bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1130045
https://bugzilla.redhat.com/show_bug.cgi?id=1124369


How reproducible:

Steps to Reproduce:
1.The datacenter‘s cluster is “Enable Gluster Service” when created.
2.Add two nodes by the ovirt-engine dashboard.
3.Creating two Storage Domains,one is Data(Master) which type is GlusterFS,and another is ISO which type is POSIX compliant FS.
4.Creating some VMs.
5.Wating for few minutes and observe the the memory of supervdsmServer deamon,on which any node by top commond. 

Actual results:
As time goes on,the supervdsmServer deamon may occupy more system memory until cann't allocate a little. As a result,the node's status change to be "NonOperational" on ovirt-eng WEBUI. It means that I cann't do any other efficient things on this cluster.

In addition, it is noteworthy that I restart vdsm deamon and supervdsmServer deamon when "NonOperational", and hosts will run normally. But this situation may come back reproduce as time goes on.

In addition another, if the cluster without GlusterFS Storage Domain, everything will quite natural! 

Expected results:


Additional info:
I cann't catch some helpful log info from vdsm.log and supervdsm.log.

Comment 1 Dan Kenigsberg 2014-09-17 09:35:18 UTC
Could you attach supervdsm.log anyway? Do you spot anything different in the log, relative to the cluster that has no glusterFS?

Comment 2 zhang guoqing 2014-09-18 01:01:40 UTC
Created attachment 938700 [details]
vdsm-logs and some pngs

Thanks,first of all!
I really cann't find helpful log, so I patch all logs of nodes' vdsm  here, and some pngs that may productive to analyze the bug-1142647.

Comment 3 Dan Kenigsberg 2014-09-18 08:54:12 UTC
I see that supervdsm is asked to call

   /usr/sbin/gluster --mode=script volume info --xml

every 5 seconds. Is this expected?


Also (and unrelated to the leak),

 MainProcess|Thread-51::DEBUG::2014-09-17 10:16:59,274::supervdsmServer::101::SuperVdsm.ServerCallback::(wrapper) call wrapper with (None,) {}

does not show the called function name, only "wrapper".

Comment 4 Darshan 2014-09-19 05:13:00 UTC
supervdsm calling "/usr/sbin/gluster --mode=script volume info --xml"
every 5 sec is expected behaviour.

Comment 5 zhang guoqing 2014-09-29 08:53:40 UTC
Please note, the nodes are VM that be created from my OpenStack environment, in which hypervisor is KVM too. The node(VM) "cat /proc/cpuinfo  | grep vmx" is not None, so I take it as ovirt-node.

Therefore, I'm not sure that, whether above situation makes a difference to this Bug-1142647 or not. 

Thanks all!

Comment 6 Dan Kenigsberg 2014-10-08 21:06:43 UTC
Darshan, can post this to the ovirt-3.5 branch? It's a nasty regression that I'd like to avoid.

Comment 7 Darshan 2014-10-09 11:20:07 UTC
(In reply to Dan Kenigsberg from comment #6)
> Darshan, can post this to the ovirt-3.5 branch? It's a nasty regression that
> I'd like to avoid.

Done.

Comment 8 gabicr 2014-10-22 08:12:36 UTC
I have also glusterfs

After upgrade from 3.4.4. to 3.5.0 I can see n all my 3 nodes

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND


nod1 SPM  running  0 VM
 753 root           15  -5 17.208g 7.737g  10832 S   0.0       49.4   1:28.46 supervdsmServer

nod2  running 3 VM
 641 root      15  -5 17.573g 7.888g  10768 S   0.0 33.5   1:17.09 supervdsmServer

nod3  running 2 VM
 6391 root      15  -5 19.072g 8.646g  10844 S   9.3 44.1  38:17.05 supervdsmServer


So Supervdsm server ocupy around 33-49% of memory alone!

Also I've got
systemctl status supervdsmd


supervdsmd.service - "Auxiliary vdsm service for running helper functions as root"
   Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
   Active: active (running) since Tue 2014-10-21 11:32:40 EEST; 23h ago
 Main PID: 753 (supervdsmServer)
   CGroup: name=systemd:/system/supervdsmd.service
           ââ753 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock

Oct 21 11:39:51 nod1 daemonAdapter[753]: Process Process-4:
Oct 21 11:39:51 nod1 daemonAdapter[753]: Traceback (most recent call last):
Oct 21 11:39:51 nod1 daemonAdapter[753]: File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Oct 21 11:39:51 nod1 daemonAdapter[753]: self.run()
Oct 21 11:39:51 nod1 daemonAdapter[753]: File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
Oct 21 11:39:51 nod1 daemonAdapter[753]: self._target(*self._args, **self._kwargs)
Oct 21 11:39:51 nod1 daemonAdapter[753]: File "/usr/share/vdsm/supervdsmServer", line 242, in child
Oct 21 11:39:51 nod1 daemonAdapter[753]: pipe.recv()
Oct 21 11:39:51 nod1 daemonAdapter[753]: IOError: [Errno 4] Interrupted system call

Comment 9 Dan Kenigsberg 2014-10-22 09:38:38 UTC
Thanks for your report. This bug is destined to be hacked-away in ovirt-3.5.1 release.

Comment 10 Sandro Bonazzola 2015-01-15 14:25:39 UTC
This is an automated message: 
This bug should be fixed in oVirt 3.5.1 RC1, moving to QA

Comment 11 Sandro Bonazzola 2015-01-21 16:02:54 UTC
oVirt 3.5.1 has been released. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.