Bug 1142647

Summary: supervdsm leaks memory when using glusterfs
Product: [Retired] oVirt Reporter: zhang guoqing <zhangguoqingas>
Component: vdsmAssignee: Darshan <dnarayan>
Status: CLOSED CURRENTRELEASE QA Contact: Gil Klein <gklein>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.5CC: amureini, bazulay, bugs, danken, dnarayan, ecohen, gabicr, gklein, iheim, mgoldboi, mmorgan, rbalakri, sabose, s.kieske, tjeyasin, yeylon, zhangguoqingas
Target Milestone: ---Keywords: Regression
Target Release: 3.5.1   
Hardware: x86_64   
OS: Linux   
Whiteboard: gluster
Fixed In Version: ovirt-3.5.1_rc1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-21 16:02:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1093594    
Bug Blocks:    
Attachments:
Description Flags
vdsm-logs and some pngs none

Description zhang guoqing 2014-09-17 07:15:36 UTC
Description of problem:
I have deployed a ovirt test bed. Cluster with GlusterFS Storage Domain,supervdsmServer deamon  on which all vdsm nodes showed memory leaks,maybe! But cluster use NFS  Storage Domain is normal.Please see "reproducible" detailed!

Version-Release number of selected component (if applicable):
[root@node-01 ~]# rpm -qa | grep vdsm
vdsm-python-zombiereaper-4.16.4-0.el6.noarch
vdsm-xmlrpc-4.16.4-0.el6.noarch
vdsm-jsonrpc-4.16.4-0.el6.noarch
vdsm-4.16.4-0.el6.x86_64
vdsm-cli-4.16.4-0.el6.noarch
vdsm-python-4.16.4-0.el6.noarch
vdsm-yajsonrpc-4.16.4-0.el6.noarch
vdsm-gluster-4.16.4-0.el6.noarch

[root@node-01 ~]# rpm -qa | grep gluster
glusterfs-cli-3.5.2-1.el6.x86_64
glusterfs-libs-3.5.2-1.el6.x86_64
glusterfs-3.5.2-1.el6.x86_64
glusterfs-rdma-3.5.2-1.el6.x86_64
glusterfs-server-3.5.2-1.el6.x86_64
glusterfs-api-3.5.2-1.el6.x86_64
glusterfs-fuse-3.5.2-1.el6.x86_64
vdsm-gluster-4.16.4-0.el6.noarch

[root@node-01 ~]# rpm -qa | grep ioprocess
ioprocess-0.12.0-2.el6.x86_64
python-ioprocess-0.12.0-2.el6.noarch
Exclusion bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1130045
https://bugzilla.redhat.com/show_bug.cgi?id=1124369


How reproducible:

Steps to Reproduce:
1.The datacenter‘s cluster is “Enable Gluster Service” when created.
2.Add two nodes by the ovirt-engine dashboard.
3.Creating two Storage Domains,one is Data(Master) which type is GlusterFS,and another is ISO which type is POSIX compliant FS.
4.Creating some VMs.
5.Wating for few minutes and observe the the memory of supervdsmServer deamon,on which any node by top commond. 

Actual results:
As time goes on,the supervdsmServer deamon may occupy more system memory until cann't allocate a little. As a result,the node's status change to be "NonOperational" on ovirt-eng WEBUI. It means that I cann't do any other efficient things on this cluster.

In addition, it is noteworthy that I restart vdsm deamon and supervdsmServer deamon when "NonOperational", and hosts will run normally. But this situation may come back reproduce as time goes on.

In addition another, if the cluster without GlusterFS Storage Domain, everything will quite natural! 

Expected results:


Additional info:
I cann't catch some helpful log info from vdsm.log and supervdsm.log.

Comment 1 Dan Kenigsberg 2014-09-17 09:35:18 UTC
Could you attach supervdsm.log anyway? Do you spot anything different in the log, relative to the cluster that has no glusterFS?

Comment 2 zhang guoqing 2014-09-18 01:01:40 UTC
Created attachment 938700 [details]
vdsm-logs and some pngs

Thanks,first of all!
I really cann't find helpful log, so I patch all logs of nodes' vdsm  here, and some pngs that may productive to analyze the bug-1142647.

Comment 3 Dan Kenigsberg 2014-09-18 08:54:12 UTC
I see that supervdsm is asked to call

   /usr/sbin/gluster --mode=script volume info --xml

every 5 seconds. Is this expected?


Also (and unrelated to the leak),

 MainProcess|Thread-51::DEBUG::2014-09-17 10:16:59,274::supervdsmServer::101::SuperVdsm.ServerCallback::(wrapper) call wrapper with (None,) {}

does not show the called function name, only "wrapper".

Comment 4 Darshan 2014-09-19 05:13:00 UTC
supervdsm calling "/usr/sbin/gluster --mode=script volume info --xml"
every 5 sec is expected behaviour.

Comment 5 zhang guoqing 2014-09-29 08:53:40 UTC
Please note, the nodes are VM that be created from my OpenStack environment, in which hypervisor is KVM too. The node(VM) "cat /proc/cpuinfo  | grep vmx" is not None, so I take it as ovirt-node.

Therefore, I'm not sure that, whether above situation makes a difference to this Bug-1142647 or not. 

Thanks all!

Comment 6 Dan Kenigsberg 2014-10-08 21:06:43 UTC
Darshan, can post this to the ovirt-3.5 branch? It's a nasty regression that I'd like to avoid.

Comment 7 Darshan 2014-10-09 11:20:07 UTC
(In reply to Dan Kenigsberg from comment #6)
> Darshan, can post this to the ovirt-3.5 branch? It's a nasty regression that
> I'd like to avoid.

Done.

Comment 8 gabicr 2014-10-22 08:12:36 UTC
I have also glusterfs

After upgrade from 3.4.4. to 3.5.0 I can see n all my 3 nodes

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND


nod1 SPM  running  0 VM
 753 root           15  -5 17.208g 7.737g  10832 S   0.0       49.4   1:28.46 supervdsmServer

nod2  running 3 VM
 641 root      15  -5 17.573g 7.888g  10768 S   0.0 33.5   1:17.09 supervdsmServer

nod3  running 2 VM
 6391 root      15  -5 19.072g 8.646g  10844 S   9.3 44.1  38:17.05 supervdsmServer


So Supervdsm server ocupy around 33-49% of memory alone!

Also I've got
systemctl status supervdsmd


supervdsmd.service - "Auxiliary vdsm service for running helper functions as root"
   Loaded: loaded (/usr/lib/systemd/system/supervdsmd.service; static)
   Active: active (running) since Tue 2014-10-21 11:32:40 EEST; 23h ago
 Main PID: 753 (supervdsmServer)
   CGroup: name=systemd:/system/supervdsmd.service
           ââ753 /usr/bin/python /usr/share/vdsm/supervdsmServer --sockfile /var/run/vdsm/svdsm.sock

Oct 21 11:39:51 nod1 daemonAdapter[753]: Process Process-4:
Oct 21 11:39:51 nod1 daemonAdapter[753]: Traceback (most recent call last):
Oct 21 11:39:51 nod1 daemonAdapter[753]: File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Oct 21 11:39:51 nod1 daemonAdapter[753]: self.run()
Oct 21 11:39:51 nod1 daemonAdapter[753]: File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
Oct 21 11:39:51 nod1 daemonAdapter[753]: self._target(*self._args, **self._kwargs)
Oct 21 11:39:51 nod1 daemonAdapter[753]: File "/usr/share/vdsm/supervdsmServer", line 242, in child
Oct 21 11:39:51 nod1 daemonAdapter[753]: pipe.recv()
Oct 21 11:39:51 nod1 daemonAdapter[753]: IOError: [Errno 4] Interrupted system call

Comment 9 Dan Kenigsberg 2014-10-22 09:38:38 UTC
Thanks for your report. This bug is destined to be hacked-away in ovirt-3.5.1 release.

Comment 10 Sandro Bonazzola 2015-01-15 14:25:39 UTC
This is an automated message: 
This bug should be fixed in oVirt 3.5.1 RC1, moving to QA

Comment 11 Sandro Bonazzola 2015-01-21 16:02:54 UTC
oVirt 3.5.1 has been released. If problems still persist, please make note of it in this bug report.