871355 – 3.1 - [vdsm] Zombie VDSM processes remain when Export NFS or ISO NFS domain is blocked

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 871355 - 3.1 - [vdsm] Zombie VDSM processes remain when Export NFS or ISO NFS domain is blocked

Summary: 3.1 - [vdsm] Zombie VDSM processes remain when Export NFS or ISO NFS domain i...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Saggi Mizrahi
QA Contact:	vvyazmin@redhat.com
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-30 10:15 UTC by vvyazmin@redhat.com
Modified:	2022-07-09 05:40 UTC (History)
CC List:	15 users (show)
Fixed In Version:	vdsm-4.9.6-41.0
Doc Type:	Bug Fix
Doc Text:	Previously, blocking export NFS or ISO NFS domains caused zombie processes that would eventually overrun VDSM and crash the whole data center. This patch corrects the issue, allowing export domains to be blocked and defunct processes to be cleaned up automatically.
Clone Of:
Environment:
Last Closed:	2012-12-04 19:13:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
## Logs vdsm, rhevm (3.04 MB, application/x-gzip) 2012-10-30 10:15 UTC, vvyazmin@redhat.com	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2012:1508	0	normal	SHIPPED_LIVE	Important: rhev-3.1.0 vdsm security, bug fix, and enhancement update	2012-12-04 23:48:05 UTC

Description vvyazmin@redhat.com 2012-10-30 10:15:16 UTC

Created attachment 635463 [details]
## Logs vdsm, rhevm

Description of problem: Zombie VDSM processes born, when  Export NFS or  ISO NFS domain is blocked


Version-Release number of selected component (if applicable):
RHEVM 3.1 - SI22

RHEVM: rhevm-3.1.0-22.el6ev.noarch
VDSM: vdsm-4.9.6-39.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create  iSCSI DC with 2 hosts, with one iSCSI SD, Export NFS
2. Block Export NFS domain via iptables (on both hosts) for simulation Export NFS Domain disconnection
  
Actual results:
After 15 hours I have 1025  vdsm processes with defunct status
Every 2 minutes a new VDSM process with defunct status born 
After 3 days, will be DC crashed
Host enter in overload CPU
VDSM processes with defunct status, not clean with restart VDSMD process


Expected results:
If you configure iSCSI or FC DC, and you configure there a Export NFS or ISO NFS, in disconnection scenario of NFS domain, influence on all system must be minimised.
If system have huge mount of processes with defunct status, system will kill / clear all those processes.
System continue functional normally, although Export NFS or ISO NFS disconnected.
Events or Workings will be in UI.

Workaround:
Reboot all host in DC

Additional info:

logs attached with following command: 
mount
ps aux | grep vdsm
ps -elf | grep vdsm

[root@cougar08 ~]# date
Tue Oct 30 11:08:17 IST 2012
[root@cougar08 ~]# ps aux | grep vdsm   | wc -l
1027
[root@cougar08 ~]# date
Tue Oct 30 11:09:41 IST 2012
[root@cougar08 ~]# ps aux | grep vdsm   | wc -l
1028

[root@cougar08 ~]# ps aux | grep vdsm
vdsm       346  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       381  0.0  0.0      0     0 ?        Z<   01:27   0:00 [python] <defunct>
vdsm       408  0.0  0.0      0     0 ?        Z<   07:07   0:00 [python] <defunct>
vdsm       415  0.0  0.0      0     0 ?        Z<   07:07   0:00 [python] <defunct>
vdsm       438  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       467  0.0  0.0      0     0 ?        Z<   01:28   0:00 [python] <defunct>
vdsm       516  0.0  0.0      0     0 ?        Z<   07:08   0:00 [python] <defunct>
vdsm       544  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       578  0.0  0.0      0     0 ?        Z<   01:29   0:00 [python] <defunct>
vdsm       604  0.0  0.0      0     0 ?        Z<   07:10   0:00 [python] <defunct>
vdsm       638  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       675  0.0  0.0      0     0 ?        Z<   01:30   0:00 [python] <defunct>
vdsm       714  0.0  0.0      0     0 ?        Z<   07:11   0:00 [python] <defunct>
vdsm       734  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       765  0.0  0.0      0     0 ?        Z<   01:31   0:00 [python] <defunct>
vdsm       809  0.0  0.0      0     0 ?        Z<   07:12   0:00 [python] <defunct>
vdsm       820  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       861  0.0  0.0      0     0 ?        Z<   01:32   0:00 [python] <defunct>
vdsm       904  0.0  0.0      0     0 ?        Z<   07:13   0:00 [python] <defunct>
vdsm       910  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm       947  0.0  0.0      0     0 ?        Z<   01:33   0:00 [python] <defunct>
vdsm      1005  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm      1047  0.0  0.0      0     0 ?        Z<   01:34   0:00 [python] <defunct>
vdsm      1090  0.0  0.0      0     0 ?        Z<   07:15   0:00 [python] <defunct>
vdsm      1097  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm      1099  0.0  0.0      0     0 ?        Z<   07:15   0:00 [python] <defunct>
vdsm      1142  0.0  0.0      0     0 ?        Z<   01:35   0:00 [python] <defunct>
vdsm      1203  0.0  0.0      0     0 ?        Z<   07:16   0:00 [python] <defunct>
vdsm      1214  0.0  0.0      0     0 ?        Z<   Oct29   0:00 [python] <defunct>
vdsm      1255  0.0  0.0      0     0 ?        Z<   01:36   0:00 [python] <defunct>
vdsm      1299  0.0  0.0      0     0 ?        Z<   07:17   0:00 [python] <defunct>


Thread-107827::WARNING::2012-10-30 04:01:50,525::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction
    rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout)
  File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll
    raise Timeout()
Timeout
Thread-79712::ERROR::2012-10-30 04:01:50,527::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 27fedd2e-d04e-4a16-a9f7-714f2931e6d3 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'27fedd2e-d04e-4a16-a9f7-714f2931e6d3',)
Thread-107837::DEBUG::2012-10-30 04:01:50,527::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-107837::DEBUG::2012-10-30 04:01:50,528::__init__::1164::Storage.Misc.excCmd::(_log) u'/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"a%3514f0c5610000080|3514f0c5610000081|3514f0c5610000082|3514f0c5610000083|3514f0c5610000084|3514f0c5610000087|3514f0c5610000088%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 27fedd2e-d04e-4a16-a9f7-714f2931e6d3' (cwd None)

Comment 2 vvyazmin@redhat.com 2012-10-30 14:52:08 UTC

Yes. it's regression. I run same scenario on RHEVM 3.0 - IC158.2, and now problems are found.

Comment 5 Saggi Mizrahi 2012-10-30 16:56:43 UTC

http://gerrit.ovirt.org/#/c/8907/

Comment 8 vvyazmin@redhat.com 2012-11-06 10:14:45 UTC

Verified on RHEVM 3.1 - SI24

RHEVM: rhevm-3.1.0-26.el6ev.noarch
VDSM: vdsm-4.9.6-41.0.el6_3.x86_64
LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.4.x86_64
SANLOCK: sanlock-2.3-4.el6_3.x86_64

Comment 10 errata-xmlrpc 2012-12-04 19:13:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html

Note You need to log in before you can comment on or make changes to this bug.