Created attachment 708891 [details] vdsm log Description of problem: RHEVM - sf9 vdsm-4.10.2-10.0.el6ev.x86_64 libvirt-0.10.2-18.el6.x86_64 sanlock-2.6-2.el6.x86_64 [rhevm][scale] Attaching a big number of NFS Storage Domain fails. Scale environment testing maximum number of NFS storage domains. We created 100 NFS Storage domains. Creation was successful. Attach operation failed after 67 domains were successfully attached. How reproducible: 100% Steps to Reproduce: 1.Create 100 NFS Storage domain 2.Attach them Actual results: Attach SD operation failed. Not all 100 domains were attached. Expected results: All 100 domains successfully attached. Additional info: vdsm.log attached. Thread-121754::ERROR::2013-03-11 01:01:01,695::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `0c69c276-c7c1-4ab9-9fce-e7b57fbce2d5` Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain return mod.findDomain(sdUUID) File "/usr/share/vdsm/storage/blockSD.py", line 1225, in findDomain return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/blockSD.py", line 1195, in findDomainPath vg = lvm.getVG(sdUUID) File "/usr/share/vdsm/storage/lvm.py", line 815, in getVG vg = _lvminfo.getVg(vgName) # returns single VG namedtuple File "/usr/share/vdsm/storage/lvm.py", line 547, in getVg vgs = self._reloadvgs(vgName) File "/usr/share/vdsm/storage/lvm.py", line 402, in _reloadvgs self._vgs.pop((staleName), None) File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__ self.gen.throw(type, value, traceback) File "/usr/share/vdsm/storage/misc.py", line 1204, in acquireContext yield self File "/usr/share/vdsm/storage/lvm.py", line 374, in _reloadvgs rc, out, err = self.cmd(cmd) File "/usr/share/vdsm/storage/lvm.py", line 310, in cmd rc, out, err = misc.execCmd(finalCmd, sudo=True) File "/usr/share/vdsm/storage/misc.py", line 198, in execCmd p = BetterPopen(command, close_fds=True, cwd=cwd, env=env) File "/usr/lib64/python2.6/site-packages/vdsm/betterPopen/__init__.py", line 46, in __init__ stderr=PIPE) File "/usr/lib64/python2.6/subprocess.py", line 632, in __init__ errread, errwrite) = self._get_handles(stdin, stdout, stderr) File "/usr/lib64/python2.6/subprocess.py", line 1055, in _get_handles p2cread, p2cwrite = os.pipe() OSError: [Errno 24] Too many open files Thread-289095::ERROR::2013-03-11 01:01:01,696::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain cf2146f6-7c86-42f1-b479-9db6a0495686 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('cf2146f6-7c86-42f1-b479-9db6a0495686',) Thread-121754::ERROR::2013-03-11 01:01:01,697::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 0c69c276-c7c1-4ab9-9fce-e7b57fbce2d5 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain raise se.StorageDomainDoesNotExist(sdUUID) :
Any chance in getting engine.log as well? Where did the "too many files" error occur? rhevm machine or vdsm? (Sorry, your title is a bit confusing).
(In reply to comment #1) > Any chance in getting engine.log as well? > Where did the "too many files" error occur? rhevm machine or vdsm? Did you look at the above exception? it's in Python, I know ;-) ... File "/usr/lib64/python2.6/subprocess.py", line 632, in __init__ errread, errwrite) = self._get_handles(stdin, stdout, stderr) File "/usr/lib64/python2.6/subprocess.py", line 1055, in _get_handles p2cread, p2cwrite = os.pipe() OSError: [Errno 24] Too many open files > (Sorry, your title is a bit confusing). Fixed.
What are the official numbers we need to support here ? How many NFS SDs ?
What is the current number of open file descriptors allowed in VDSM ? If we need to support 100 NFS SDs than we must increas it (in case this is the only problem)
Yes, the limit is the issue here. vdsm user limit for open files is 4096, and we can increase it
Danken, Saggi any reason not to increase the ulimit ? It's an easy thing to do harder to test.
is Bug 922517 a Duplicate ?
(In reply to comment #8) > Danken, Saggi any reason not to increase the ulimit ? > It's an easy thing to do harder to test. Maybe it's time to be more prudent about file descriptors, but I do not see a problem with updating vdsm/limits.conf. When you do that, please amend the comment there, to help end users tweak it better to support more than 100 sd.
Please test the bug scenario with patch: http://gerrit.ovirt.org/12869 This should alleviate this particular scenario (findDomain() call will not use lvm operations.) In addition please _don't_ increase the fd limit without a clear understanding why we reach so big numbers. A rough calculus show that 4096 fd's for 100 attach domains ~ 40 fd's per SD attach. Why so big number of fd's for 100 attach ops?
*** Bug 922517 has been marked as a duplicate of this bug. ***
For 3.2 we only intend to increase ulimit. For 3.3 I have opened bug 948214.
(In reply to comment #11) > Please test the bug scenario with patch: > > http://gerrit.ovirt.org/12869 > This should alleviate this particular scenario (findDomain() call will not > use lvm operations.) > > > In addition please _don't_ increase the fd limit without a clear > understanding why we reach so big numbers. > > A rough calculus show that 4096 fd's for 100 attach domains ~ 40 fd's per SD > attach. > Why so big number of fd's for 100 attach ops? Tested, findDomain does not fail anymore.
No issues are found Verified on RHEVM 3.2 - SF15 environment: RHEVM: rhevm-3.2.0-10.21.master.el6ev.noarch VDSM: vdsm-4.10.2-17.0.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.4.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64 SANLOCK: sanlock-2.6-2.el6.x86_64 Test done with following actions: Attache, activate, maintenance, detach. delete 100 NFS DC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0886.html