Bug 871355

Summary: 3.1 - [vdsm] Zombie VDSM processes remain when Export NFS or ISO NFS domain is blocked
Product: Red Hat Enterprise Linux 6 Reporter: vvyazmin <vvyazmin>
Component: vdsmAssignee: Saggi Mizrahi <smizrahi>
Status: CLOSED ERRATA QA Contact: vvyazmin <vvyazmin>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.3CC: abaron, bazulay, dyasny, hateya, iheim, ilvovsky, jbiddle, lpeer, Rhev-m-bugs, sgrinber, smizrahi, thildred, ybronhei, yeylon, ykaul
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: vdsm-4.9.6-41.0 Doc Type: Bug Fix
Doc Text:
Previously, blocking export NFS or ISO NFS domains caused zombie processes that would eventually overrun VDSM and crash the whole data center. This patch corrects the issue, allowing export domains to be blocked and defunct processes to be cleaned up automatically.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 19:13:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
## Logs vdsm, rhevm none

Description vvyazmin@redhat.com 2012-10-30 10:15:16 UTC
Created attachment 635463 [details]
## Logs vdsm, rhevm

Description of problem: Zombie VDSM processes born, when  Export NFS or  ISO NFS domain is blocked


Version-Release number of selected component (if applicable):
RHEVM 3.1 - SI22

RHEVM: rhevm-3.1.0-22.el6ev.noarch
VDSM: vdsm-4.9.6-39.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create  iSCSI DC with 2 hosts, with one iSCSI SD, Export NFS
2. Block Export NFS domain via iptables (on both hosts) for simulation Export NFS Domain disconnection
  
Actual results:
After 15 hours I have 1025  vdsm processes with defunct status
Every 2 minutes a new VDSM process with defunct status born 
After 3 days, will be DC crashed
Host enter in overload CPU
VDSM processes with defunct status, not clean with restart VDSMD process


Expected results:
If you configure iSCSI or FC DC, and you configure there a Export NFS or ISO NFS, in disconnection scenario of NFS domain, influence on all system must be minimised.
If system have huge mount of processes with defunct status, system will kill / clear all those processes.
System continue functional normally, although Export NFS or ISO NFS disconnected.
Events or Workings will be in UI.

Workaround:
Reboot all host in DC

Additional info:

logs attached with following command: 
mount
ps aux | grep vdsm
ps -elf | grep vdsm

[root@cougar08 ~]# date
Tue Oct 30 11:08:17 IST 2012
[root@cougar08 ~]# ps aux | grep vdsm   | wc -l
1027
[root@cougar08 ~]# date
Tue Oct 30 11:09:41 IST 2012
[root@cougar08 ~]# ps aux | grep vdsm   | wc -l
1028

[root@cougar08 ~]# ps aux | grep vdsm
vdsm       346  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       381  0.0  0.0      0     0 ?        Z<   01:27   0:00 [python] <defunct>
vdsm       408  0.0  0.0      0     0 ?        Z<   07:07   0:00 [python] <defunct>
vdsm       415  0.0  0.0      0     0 ?        Z<   07:07   0:00 [python] <defunct>
vdsm       438  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       467  0.0  0.0      0     0 ?        Z<   01:28   0:00 [python] <defunct>
vdsm       516  0.0  0.0      0     0 ?        Z<   07:08   0:00 [python] <defunct>
vdsm       544  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       578  0.0  0.0      0     0 ?        Z<   01:29   0:00 [python] <defunct>
vdsm       604  0.0  0.0      0     0 ?        Z<   07:10   0:00 [python] <defunct>
vdsm       638  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       675  0.0  0.0      0     0 ?        Z<   01:30   0:00 [python] <defunct>
vdsm       714  0.0  0.0      0     0 ?        Z<   07:11   0:00 [python] <defunct>
vdsm       734  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       765  0.0  0.0      0     0 ?        Z<   01:31   0:00 [python] <defunct>
vdsm       809  0.0  0.0      0     0 ?        Z<   07:12   0:00 [python] <defunct>
vdsm       820  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       861  0.0  0.0      0     0 ?        Z<   01:32   0:00 [python] <defunct>
vdsm       904  0.0  0.0      0     0 ?        Z<   07:13   0:00 [python] <defunct>
vdsm       910  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       947  0.0  0.0      0     0 ?        Z<   01:33   0:00 [python] <defunct>
vdsm      1005  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm      1047  0.0  0.0      0     0 ?        Z<   01:34   0:00 [python] <defunct>
vdsm      1090  0.0  0.0      0     0 ?        Z<   07:15   0:00 [python] <defunct>
vdsm      1097  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm      1099  0.0  0.0      0     0 ?        Z<   07:15   0:00 [python] <defunct>
vdsm      1142  0.0  0.0      0     0 ?        Z<   01:35   0:00 [python] <defunct>
vdsm      1203  0.0  0.0      0     0 ?        Z<   07:16   0:00 [python] <defunct>
vdsm      1214  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm      1255  0.0  0.0      0     0 ?        Z<   01:36   0:00 [python] <defunct>
vdsm      1299  0.0  0.0      0     0 ?        Z<   07:17   0:00 [python] <defunct>


Thread-107827::WARNING::2012-10-30 04:01:50,525::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll
    raise Timeout()
Timeout
Thread-79712::ERROR::2012-10-30 04:01:50,527::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 27fedd2e-d04e-4a16-a9f7-714f2931e6d3 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'27fedd2e-d04e-4a16-a9f7-714f2931e6d3',)
Thread-107837::DEBUG::2012-10-30 04:01:50,527::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-107837::DEBUG::2012-10-30 04:01:50,528::__init__::1164::Storage.Misc.excCmd::(_log) u'/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"a%3514f0c5610000080|3514f0c5610000081|3514f0c5610000082|3514f0c5610000083|3514f0c5610000084|3514f0c5610000087|3514f0c5610000088%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 27fedd2e-d04e-4a16-a9f7-714f2931e6d3' (cwd None)

Comment 2 vvyazmin@redhat.com 2012-10-30 14:52:08 UTC
Yes. it's regression. I run same scenario on RHEVM 3.0 - IC158.2, and now problems are found.

Comment 5 Saggi Mizrahi 2012-10-30 16:56:43 UTC
http://gerrit.ovirt.org/#/c/8907/

Comment 8 vvyazmin@redhat.com 2012-11-06 10:14:45 UTC
Verified on RHEVM 3.1 - SI24

RHEVM: rhevm-3.1.0-26.el6ev.noarch
VDSM: vdsm-4.9.6-41.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.4.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

Comment 10 errata-xmlrpc 2012-12-04 19:13:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html