Previously, blocking export NFS or ISO NFS domains caused zombie processes that would eventually overrun VDSM and crash the whole data center. This patch corrects the issue, allowing export domains to be blocked and defunct processes to be cleaned up automatically.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
http://rhn.redhat.com/errata/RHSA-2012-1508.html
Created attachment 635463 [details] ## Logs vdsm, rhevm Description of problem: Zombie VDSM processes born, when Export NFS or ISO NFS domain is blocked Version-Release number of selected component (if applicable): RHEVM 3.1 - SI22 RHEVM: rhevm-3.1.0-22.el6ev.noarch VDSM: vdsm-4.9.6-39.0.el6_3.x86_64 LIBVIRT: libvirt-0.9.10-21.el6_3.5.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.295.el6_3.2.x86_64 SANLOCK: sanlock-2.3-4.el6_3.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create iSCSI DC with 2 hosts, with one iSCSI SD, Export NFS 2. Block Export NFS domain via iptables (on both hosts) for simulation Export NFS Domain disconnection Actual results: After 15 hours I have 1025 vdsm processes with defunct status Every 2 minutes a new VDSM process with defunct status born After 3 days, will be DC crashed Host enter in overload CPU VDSM processes with defunct status, not clean with restart VDSMD process Expected results: If you configure iSCSI or FC DC, and you configure there a Export NFS or ISO NFS, in disconnection scenario of NFS domain, influence on all system must be minimised. If system have huge mount of processes with defunct status, system will kill / clear all those processes. System continue functional normally, although Export NFS or ISO NFS disconnected. Events or Workings will be in UI. Workaround: Reboot all host in DC Additional info: logs attached with following command: mount ps aux | grep vdsm ps -elf | grep vdsm [root@cougar08 ~]# date Tue Oct 30 11:08:17 IST 2012 [root@cougar08 ~]# ps aux | grep vdsm | wc -l 1027 [root@cougar08 ~]# date Tue Oct 30 11:09:41 IST 2012 [root@cougar08 ~]# ps aux | grep vdsm | wc -l 1028 [root@cougar08 ~]# ps aux | grep vdsm vdsm 346 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 381 0.0 0.0 0 0 ? Z< 01:27 0:00 [python] <defunct> vdsm 408 0.0 0.0 0 0 ? Z< 07:07 0:00 [python] <defunct> vdsm 415 0.0 0.0 0 0 ? Z< 07:07 0:00 [python] <defunct> vdsm 438 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 467 0.0 0.0 0 0 ? Z< 01:28 0:00 [python] <defunct> vdsm 516 0.0 0.0 0 0 ? Z< 07:08 0:00 [python] <defunct> vdsm 544 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 578 0.0 0.0 0 0 ? Z< 01:29 0:00 [python] <defunct> vdsm 604 0.0 0.0 0 0 ? Z< 07:10 0:00 [python] <defunct> vdsm 638 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 675 0.0 0.0 0 0 ? Z< 01:30 0:00 [python] <defunct> vdsm 714 0.0 0.0 0 0 ? Z< 07:11 0:00 [python] <defunct> vdsm 734 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 765 0.0 0.0 0 0 ? Z< 01:31 0:00 [python] <defunct> vdsm 809 0.0 0.0 0 0 ? Z< 07:12 0:00 [python] <defunct> vdsm 820 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 861 0.0 0.0 0 0 ? Z< 01:32 0:00 [python] <defunct> vdsm 904 0.0 0.0 0 0 ? Z< 07:13 0:00 [python] <defunct> vdsm 910 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 947 0.0 0.0 0 0 ? Z< 01:33 0:00 [python] <defunct> vdsm 1005 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 1047 0.0 0.0 0 0 ? Z< 01:34 0:00 [python] <defunct> vdsm 1090 0.0 0.0 0 0 ? Z< 07:15 0:00 [python] <defunct> vdsm 1097 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 1099 0.0 0.0 0 0 ? Z< 07:15 0:00 [python] <defunct> vdsm 1142 0.0 0.0 0 0 ? Z< 01:35 0:00 [python] <defunct> vdsm 1203 0.0 0.0 0 0 ? Z< 07:16 0:00 [python] <defunct> vdsm 1214 0.0 0.0 0 0 ? Z< Oct29 0:00 [python] <defunct> vdsm 1255 0.0 0.0 0 0 ? Z< 01:36 0:00 [python] <defunct> vdsm 1299 0.0 0.0 0 0 ? Z< 07:17 0:00 [python] <defunct> Thread-107827::WARNING::2012-10-30 04:01:50,525::remoteFileHandler::185::Storage.CrabRPCProxy::(callCrabRPCFunction) Problem with handler, treating as timeout Traceback (most recent call last): File "/usr/share/vdsm/storage/remoteFileHandler.py", line 177, in callCrabRPCFunction rawLength = self._recvAll(LENGTH_STRUCT_LENGTH, timeout) File "/usr/share/vdsm/storage/remoteFileHandler.py", line 143, in _recvAll raise Timeout() Timeout Thread-79712::ERROR::2012-10-30 04:01:50,527::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 27fedd2e-d04e-4a16-a9f7-714f2931e6d3 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'27fedd2e-d04e-4a16-a9f7-714f2931e6d3',) Thread-107837::DEBUG::2012-10-30 04:01:50,527::lvm::352::OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-107837::DEBUG::2012-10-30 04:01:50,528::__init__::1164::Storage.Misc.excCmd::(_log) u'/usr/bin/sudo -n /sbin/lvm vgs --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ \\"a%3514f0c5610000080|3514f0c5610000081|3514f0c5610000082|3514f0c5610000083|3514f0c5610000084|3514f0c5610000087|3514f0c5610000088%\\", \\"r%.*%\\" ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 } backup { retain_min = 50 retain_days = 0 } " --noheadings --units b --nosuffix --separator | -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free 27fedd2e-d04e-4a16-a9f7-714f2931e6d3' (cwd None)