+++ This bug was initially created as a clone of Bug #920532 +++ Created attachment 708891 [details] vdsm log Description of problem: RHEVM - sf9 vdsm-4.10.2-10.0.el6ev.x86_64 libvirt-0.10.2-18.el6.x86_64 sanlock-2.6-2.el6.x86_64 [rhevm][scale] Attaching a big number of NFS Storage Domain fails. Scale environment testing maximum number of NFS storage domains. We created 100 NFS Storage domains. Creation was successful. Attach operation failed after 67 domains were successfully attached. How reproducible: 100% Steps to Reproduce: 1.Create 100 NFS Storage domain 2.Attach them Actual results: Attach SD operation failed. Not all 100 domains were attached. Expected results: All 100 domains successfully attached. Additional info: vdsm.log attached. Thread-121754::ERROR::2013-03-11 01:01:01,695::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `0c69c276-c7c1-4ab9-9fce-e7b57fbce2d5` Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain return mod.findDomain(sdUUID) File "/usr/share/vdsm/storage/blockSD.py", line 1225, in findDomain return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID)) File "/usr/share/vdsm/storage/blockSD.py", line 1195, in findDomainPath vg = lvm.getVG(sdUUID) File "/usr/share/vdsm/storage/lvm.py", line 815, in getVG vg = _lvminfo.getVg(vgName) # returns single VG namedtuple File "/usr/share/vdsm/storage/lvm.py", line 547, in getVg vgs = self._reloadvgs(vgName) File "/usr/share/vdsm/storage/lvm.py", line 402, in _reloadvgs self._vgs.pop((staleName), None) File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__ self.gen.throw(type, value, traceback) File "/usr/share/vdsm/storage/misc.py", line 1204, in acquireContext yield self File "/usr/share/vdsm/storage/lvm.py", line 374, in _reloadvgs rc, out, err = self.cmd(cmd) File "/usr/share/vdsm/storage/lvm.py", line 310, in cmd rc, out, err = misc.execCmd(finalCmd, sudo=True) File "/usr/share/vdsm/storage/misc.py", line 198, in execCmd p = BetterPopen(command, close_fds=True, cwd=cwd, env=env) File "/usr/lib64/python2.6/site-packages/vdsm/betterPopen/__init__.py", line 46, in __init__ stderr=PIPE) File "/usr/lib64/python2.6/subprocess.py", line 632, in __init__ errread, errwrite) = self._get_handles(stdin, stdout, stderr) File "/usr/lib64/python2.6/subprocess.py", line 1055, in _get_handles p2cread, p2cwrite = os.pipe() OSError: [Errno 24] Too many open files Thread-289095::ERROR::2013-03-11 01:01:01,696::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain cf2146f6-7c86-42f1-b479-9db6a0495686 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: ('cf2146f6-7c86-42f1-b479-9db6a0495686',) Thread-121754::ERROR::2013-03-11 01:01:01,697::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 0c69c276-c7c1-4ab9-9fce-e7b57fbce2d5 monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain self.domain.selftest() File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__ return getattr(self.getRealDomain(), attrName) File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain raise se.StorageDomainDoesNotExist(sdUUID) : --- Additional comment from Yair Zaslavsky on 2013-03-13 06:22:38 EDT --- Any chance in getting engine.log as well? Where did the "too many files" error occur? rhevm machine or vdsm? (Sorry, your title is a bit confusing). --- Additional comment from Yaniv Kaul on 2013-03-13 06:24:27 EDT --- (In reply to comment #1) > Any chance in getting engine.log as well? > Where did the "too many files" error occur? rhevm machine or vdsm? Did you look at the above exception? it's in Python, I know ;-) ... File "/usr/lib64/python2.6/subprocess.py", line 632, in __init__ errread, errwrite) = self._get_handles(stdin, stdout, stderr) File "/usr/lib64/python2.6/subprocess.py", line 1055, in _get_handles p2cread, p2cwrite = os.pipe() OSError: [Errno 24] Too many open files > (Sorry, your title is a bit confusing). Fixed. --- Additional comment from Barak on 2013-03-19 05:20:10 EDT --- What are the official numbers we need to support here ? How many NFS SDs ? --- Additional comment from Simon Grinberg on 2013-03-19 05:35:31 EDT --- (In reply to comment #3) > What are the official numbers we need to support here ? > How many NFS SDs ? Moving the need info to Sean --- Additional comment from Sean Cohen on 2013-03-19 10:10:37 EDT --- (In reply to comment #4) > (In reply to comment #3) > > What are the official numbers we need to support here ? > > How many NFS SDs ? > > Moving the need info to Sean We did not publish official Max NFS Storage Domains However, the initial scale target should be a 100 NFS Storage Domains. --- Additional comment from Barak on 2013-03-24 05:38:00 EDT --- What is the current number of open file descriptors allowed in VDSM ? If we need to support 100 NFS SDs than we must increas it (in case this is the only problem) --- Additional comment from Yaniv Bronhaim on 2013-03-24 09:34:18 EDT --- Yes, the limit is the issue here. vdsm user limit for open files is 4096, and we can increase it --- Additional comment from Barak on 2013-03-24 12:58:40 EDT --- Danken, Saggi any reason not to increase the ulimit ? It's an easy thing to do harder to test. --- Additional comment from Barak on 2013-03-24 13:07:15 EDT --- is Bug 922517 a Duplicate ? --- Additional comment from Dan Kenigsberg on 2013-03-27 16:30:37 EDT --- (In reply to comment #8) > Danken, Saggi any reason not to increase the ulimit ? > It's an easy thing to do harder to test. Maybe it's time to be more prudent about file descriptors, but I do not see a problem with updating vdsm/limits.conf. When you do that, please amend the comment there, to help end users tweak it better to support more than 100 sd. --- Additional comment from Eduardo Warszawski on 2013-04-03 09:05:55 EDT --- Please test the bug scenario with patch: http://gerrit.ovirt.org/12869 This should alleviate this particular scenario (findDomain() call will not use lvm operations.) In addition please _don't_ increase the fd limit without a clear understanding why we reach so big numbers. A rough calculus show that 4096 fd's for 100 attach domains ~ 40 fd's per SD attach. Why so big number of fd's for 100 attach ops? --- Additional comment from Barak on 2013-04-04 06:18:04 EDT --- *** Bug 922517 has been marked as a duplicate of this bug. ***
Solve with BZ#948210
Should be a non-issue in 3.6.0 with IOProcess. Moving to ON_QA for the scale team to verify.
The bug was verified on rhevm-setup-plugin-ovirt-engine-common-3.6.3-0.1.el6.noarch ovirt-engine-extension-aaa-jdbc-1.0.5-1.el6ev.noarch rhevm-setup-plugin-ovirt-engine-3.6.3-0.1.el6.noarch vdsm-python-4.17.19-0.el7ev.noarch vdsm-hook-vmfex-dev-4.17.19-0.el7ev.noarch vdsm-jsonrpc-4.17.19-0.el7ev.noarch vdsm-yajsonrpc-4.17.19-0.el7ev.noarch vdsm-xmlrpc-4.17.19-0.el7ev.noarch vdsm-cli-4.17.19-0.el7ev.noarch vdsm-4.17.19-0.el7ev.noarch vdsm-infra-4.17.19-0.el7ev.noarch It is reproduced 100% I have environment with 1 Host/100 VM Created 50 Storage domains, on 51th got errors jsonrpc.Executor/0::ERROR::2016-02-03 10:42:34,435::task::866::Storage.TaskManager.Task::(_setError) Task=`2196c96b-1f23-4105-936b-213708e1b70 7`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 2701, in createStorageDomain domVersion) File "/usr/share/vdsm/storage/nfsSD.py", line 80, in create version) File "/usr/share/vdsm/storage/nfsSD.py", line 49, in _preCreateValidation fileSD.validateFileSystemFeatures(sdUUID, domPath) File "/usr/share/vdsm/storage/fileSD.py", line 93, in validateFileSystemFeatures oop.getProcessPool(sdUUID).directTouch(testFilePath) File "/usr/share/vdsm/storage/outOfProcess.py", line 107, in getProcessPool return _getIOProcessPool(clientName) File "/usr/share/vdsm/storage/outOfProcess.py", line 98, in _getIOProcessPool max_queued_requests=MAX_QUEUED)) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 316, in __init__ self._run() File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 350, in _run p = _spawnProc(cmd) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 71, in _spawnProc return cpopen.CPopen(cmd) File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 52, in __init__ stderr=stderr) File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 69, in _execute_child_v276 errread, errwrite, File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 89, in _execute_child_v275 self._childUmask, OSError: [Errno 24] Too many open files jsonrpc.Executor/0::DEBUG::2016-02-03 10:42:34,436::task::885::Storage.TaskManager.Task::(_run) Task=`2196c96b-1f23-4105-936b-213708e1b707`::Task._run: 2196c96b-1f23-4105-936b-213708e1b707 (1, u'0e360f39-07b0-4580-966c-a2fe19d23b78', u'NFS-SD-50', u'ntap-oslab-01-phx1.scale.openstack.engineering.redhat.com:/rhev_real/nfs_sd_50', 1, u'3') {} failed - stopping task jsonrpc.Executor/0::DEBUG::2016-02-03 10:42:34,436::task::1246::Storage.TaskManager.Task::(stop) Task=`2196c96b-1f23-4105-936b-213708e1b707`::stopping in state preparing (force False) jsonrpc.Executor/0::DEBUG::2016-02-03 10:42:34,437::task::993::Storage.TaskManager.Task::(_decref) Task=`2196c96b-1f23-4105-936b-213708e1b707`::ref 1 aborting True jsonrpc.Executor/0::INFO::2016-02-03 10:42:34,437::task::1171::Storage.TaskManager.Task::(prepare) Task=`2196c96b-1f23-4105-936b-213708e1b707`::aborting: Task is aborted: u'[Errno 24] Too many open files' - code 100 jsonrpc.Executor/0::DEBUG::2016-02-03 10:42:34,437::task::1176::Storage.TaskManager.Task::(prepare) Task=`2196c96b-1f23-4105-936b-213708e1b707`::Prepare: aborted: [Errno 24] Too many open files Actually vdsm.log has more exceptions. Please, see in attached vdsm.log.1.xz file
Created attachment 1120707 [details] 3.6.3-0.1 vdsm log with exceptions
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Making the rule engine happy.
This should be fixed by: commit 25f1f537fd83fd0952c51dc1c8d813814a770a97 Author: Fred Rolland <frolland> Date: Thu May 26 14:30:33 2016 +0300 systemd: Specify number of open file limit for vdsmd process Specify for vdsmd process the limit for open files. We have a specification in limits.conf, but it is setting limit for the vdsm user not for the process. The ulimit for the process can be checked in /proc/PID/limits (where PID is the vdsm pid) In commit I26345e 'BZ#769502 Move limit configuration to its proper place', the configuration moved from setting the limit from the python code using ulimit to limits.conf. Change-Id: Ic8b4ddb54e812658e120f7ad10fba81dd8feef87 Bug-Url: https://bugzilla.redhat.com/1339245 Signed-off-by: Fred Rolland <frolland> Reviewed-on: https://gerrit.ovirt.org/58123 Continuous-Integration: Jenkins CI Reviewed-by: Yaniv Bronhaim <ybronhei> Reviewed-by: Piotr Kliczewski <piotr.kliczewski> Reviewed-by: Nir Soffer <nsoffer> Can you test again with 4.0, including this fix?
Hi Nir, No problem, we'll restest it again. Please, approve if it fixed in RHEVM: 4.0.2-0.2.rc1.el7ev LIBVIRT Version: libvirt-1.2.17-13.el7_2.5 VDSM Version: vdsm-4.18.5.1-1.el7ev And change status to ON_QA Yuri
Yuri, this was fixed was included in v4.18.1.
Bug verified on: RHEVM: 4.0.2-0.2.rc1.el7ev LIBVIRT Version: libvirt-1.2.17-13.el7_2.5 VDSM Version: vdsm-4.18.5.1-1.el7ev The problem wasn't reproduced. Created 100 Storage Domains in environment which has 1 DC/1 Cluster/1 Host/ 59 VMS See attached vdsm.log files and output of command #vdsClient -s 0 getVdsStats Several notes: 1. There was no possible attach more that 100 Storage Domains. I got error while create 101-st Storage Domain: jsonrpc.Executor/0::ERROR::2016-07-26 06:07:22,859::task::868::Storage.TaskManager.Task::(_setError) Task=`196b72a0-a4bc-4a1f-9315-88107640bfad`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 875, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1162, in attachStorageDomain pool.attachSD(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 78, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 928, in attachSD raise se.TooManyDomainsInStoragePoolError() TooManyDomainsInStoragePoolError: Too many domains in Storage pool: () Probably is OK. Looks like correct limitation. 2. I still got the following errors during attaching Storage Domain jsonrpc.Executor/7::ERROR::2016-07-26 05:51:57,264::sdc::140::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a jsonrpc.Executor/7::ERROR::2016-07-26 05:51:57,264::sdc::157::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a jsonrpc.Executor/7::ERROR::2016-07-26 05:51:58,836::sdc::146::Storage.StorageDomainCache::(_findDomain) domain 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'6c5e4924-9762-4d80-bc1a-590c8c9b1d7a',) As I remember, we discussed it early. And it is like expected errors, but it looks so bad in log file. Probably need to mask error in some way. Yuri
Created attachment 1184201 [details] output of command #vdsClient -s 0 getVdsStats
Created attachment 1184202 [details] vdsm log file
Created attachment 1184203 [details] vdsm 1 log file
Created attachment 1184204 [details] vdsm 2 log file
(In reply to Yuri Obshansky from comment #13) > Bug verified on: > RHEVM: 4.0.2-0.2.rc1.el7ev > LIBVIRT Version: libvirt-1.2.17-13.el7_2.5 > VDSM Version: vdsm-4.18.5.1-1.el7ev > The problem wasn't reproduced. > Created 100 Storage Domains in environment which has > 1 DC/1 Cluster/1 Host/ 59 VMS > See attached vdsm.log files > and output of command #vdsClient -s 0 getVdsStats > > Several notes: > > 1. There was no possible attach more that 100 Storage Domains. ... > TooManyDomainsInStoragePoolError: Too many domains in Storage pool: () > Probably is OK. Looks like correct limitation. This is the default limit in vdsm config, if you increase it you will be able to create more domains, but I don't think we support this amount of storage domains. > 2. I still got the following errors during attaching Storage Domain > jsonrpc.Executor/7::ERROR::2016-07-26 > 05:51:57,264::sdc::140::Storage.StorageDomainCache::(_findDomain) looking > for unfetched domain 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a > jsonrpc.Executor/7::ERROR::2016-07-26 > 05:51:57,264::sdc::157::Storage.StorageDomainCache::(_findUnfetchedDomain) > looking for domain 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a > > jsonrpc.Executor/7::ERROR::2016-07-26 > 05:51:58,836::sdc::146::Storage.StorageDomainCache::(_findDomain) domain > 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a not found > Traceback (most recent call last): > File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain > dom = findMethod(sdUUID) > File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain > raise se.StorageDomainDoesNotExist(sdUUID) > StorageDomainDoesNotExist: Storage domain does not exist: > (u'6c5e4924-9762-4d80-bc1a-590c8c9b1d7a',) > > As I remember, we discussed it early. And it is like expected errors, but it > looks so bad in log file. Probably need to mask error in some way. True, this is expected with current code, I suggest to open a bug for this. System should not log false errors.
(In reply to Nir Soffer from comment #18) > (In reply to Yuri Obshansky from comment #13) > > Bug verified on: > > RHEVM: 4.0.2-0.2.rc1.el7ev > > LIBVIRT Version: libvirt-1.2.17-13.el7_2.5 > > VDSM Version: vdsm-4.18.5.1-1.el7ev > > The problem wasn't reproduced. > > Created 100 Storage Domains in environment which has > > 1 DC/1 Cluster/1 Host/ 59 VMS > > See attached vdsm.log files > > and output of command #vdsClient -s 0 getVdsStats > > > > Several notes: > > > > 1. There was no possible attach more that 100 Storage Domains. > ... > > TooManyDomainsInStoragePoolError: Too many domains in Storage pool: () > > Probably is OK. Looks like correct limitation. > > This is the default limit in vdsm config, if you increase it you will > be able to create more domains, but I don't think we support this > amount of storage domains. > Accepted, thanks for clarification > > 2. I still got the following errors during attaching Storage Domain > > jsonrpc.Executor/7::ERROR::2016-07-26 > > 05:51:57,264::sdc::140::Storage.StorageDomainCache::(_findDomain) looking > > for unfetched domain 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a > > jsonrpc.Executor/7::ERROR::2016-07-26 > > 05:51:57,264::sdc::157::Storage.StorageDomainCache::(_findUnfetchedDomain) > > looking for domain 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a > > > > jsonrpc.Executor/7::ERROR::2016-07-26 > > 05:51:58,836::sdc::146::Storage.StorageDomainCache::(_findDomain) domain > > 6c5e4924-9762-4d80-bc1a-590c8c9b1d7a not found > > Traceback (most recent call last): > > File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain > > dom = findMethod(sdUUID) > > File "/usr/share/vdsm/storage/sdc.py", line 174, in _findUnfetchedDomain > > raise se.StorageDomainDoesNotExist(sdUUID) > > StorageDomainDoesNotExist: Storage domain does not exist: > > (u'6c5e4924-9762-4d80-bc1a-590c8c9b1d7a',) > > > > As I remember, we discussed it early. And it is like expected errors, but it > > looks so bad in log file. Probably need to mask error in some way. > > True, this is expected with current code, I suggest to open a bug for this. > System should not log false errors. Opened new bug Bug 1360294 - vdsm throws false errors during NFS Storage Domain creation and attaching https://bugzilla.redhat.com/show_bug.cgi?id=948214 Yuri