Bug 920532 - [scale] Attaching a big number of NFS Storage Domain fails. (fails on too many open files on VDSM side)
Summary: [scale] Attaching a big number of NFS Storage Domain fails. (fails on too man...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.2.0
Assignee: Yaniv Bronhaim
QA Contact: vvyazmin@redhat.com
URL:
Whiteboard: infra
: 922517 (view as bug list)
Depends On:
Blocks: 948210 948214
TreeView+ depends on / blocked
 
Reported: 2013-03-12 10:19 UTC by Leonid Natapov
Modified: 2022-07-09 05:56 UTC (History)
14 users (show)

Fixed In Version: vdsm-4.10.2-16.0.el6ev
Doc Type: Bug Fix
Doc Text:
Previously, attaching a large number of storage domains could result in failure; some but not all of the storage domains would attach. Now, attaching a large number of storage domains works as expected.
Clone Of:
: 948214 (view as bug list)
Environment:
Last Closed: 2013-06-10 20:45:09 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:
ykaul: needinfo+
scohen: needinfo+


Attachments (Terms of Use)
vdsm log (690.22 KB, application/x-xz)
2013-03-12 10:19 UTC, Leonid Natapov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-47055 0 None None None 2022-07-09 05:56:51 UTC
Red Hat Product Errata RHSA-2013:0886 0 normal SHIPPED_LIVE Moderate: rhev 3.2 - vdsm security and bug fix update 2013-06-11 00:25:02 UTC
oVirt gerrit 13566 0 None None None Never

Description Leonid Natapov 2013-03-12 10:19:41 UTC
Created attachment 708891 [details]
vdsm log

Description of problem:

RHEVM - sf9
vdsm-4.10.2-10.0.el6ev.x86_64
libvirt-0.10.2-18.el6.x86_64
sanlock-2.6-2.el6.x86_64 

[rhevm][scale] Attaching a big number of NFS Storage Domain fails.
Scale environment testing maximum number of NFS storage domains.
We created 100 NFS Storage domains. Creation was successful. Attach operation failed after 67 domains were successfully attached.  

How reproducible:

100%

Steps to Reproduce:
1.Create 100 NFS Storage domain
2.Attach them

  
Actual results:

Attach SD operation failed. Not all 100 domains were attached.

Expected results:
All 100 domains successfully attached.

Additional info:
vdsm.log attached.


Thread-121754::ERROR::2013-03-11 01:01:01,695::sdc::150::Storage.StorageDomainCache::(_findDomain) Error while looking for domain `0c69c276-c7c1-4ab9-9fce-e7b57fbce2d5`
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 145, in _findDomain
    return mod.findDomain(sdUUID)
  File "/usr/share/vdsm/storage/blockSD.py", line 1225, in findDomain
    return BlockStorageDomain(BlockStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/blockSD.py", line 1195, in findDomainPath
    vg = lvm.getVG(sdUUID)
  File "/usr/share/vdsm/storage/lvm.py", line 815, in getVG
    vg = _lvminfo.getVg(vgName)  # returns single VG namedtuple
  File "/usr/share/vdsm/storage/lvm.py", line 547, in getVg
    vgs = self._reloadvgs(vgName)
  File "/usr/share/vdsm/storage/lvm.py", line 402, in _reloadvgs
    self._vgs.pop((staleName), None)
  File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/share/vdsm/storage/misc.py", line 1204, in acquireContext
    yield self
  File "/usr/share/vdsm/storage/lvm.py", line 374, in _reloadvgs
    rc, out, err = self.cmd(cmd)
  File "/usr/share/vdsm/storage/lvm.py", line 310, in cmd
    rc, out, err = misc.execCmd(finalCmd, sudo=True)
  File "/usr/share/vdsm/storage/misc.py", line 198, in execCmd
    p = BetterPopen(command, close_fds=True, cwd=cwd, env=env)
  File "/usr/lib64/python2.6/site-packages/vdsm/betterPopen/__init__.py", line 46, in __init__
    stderr=PIPE)
  File "/usr/lib64/python2.6/subprocess.py", line 632, in __init__
    errread, errwrite) = self._get_handles(stdin, stdout, stderr)
  File "/usr/lib64/python2.6/subprocess.py", line 1055, in _get_handles
    p2cread, p2cwrite = os.pipe()
OSError: [Errno 24] Too many open files
Thread-289095::ERROR::2013-03-11 01:01:01,696::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain cf2146f6-7c86-42f1-b479-9db6a0495686 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: ('cf2146f6-7c86-42f1-b479-9db6a0495686',)
Thread-121754::ERROR::2013-03-11 01:01:01,697::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 0c69c276-c7c1-4ab9-9fce-e7b57fbce2d5 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 121, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 152, in _findDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
:

Comment 1 Yair Zaslavsky 2013-03-13 10:22:38 UTC
Any chance in getting engine.log as well?
Where did the "too many files" error occur? rhevm machine or vdsm?
(Sorry, your title is a bit confusing).

Comment 2 Yaniv Kaul 2013-03-13 10:24:27 UTC
(In reply to comment #1)
> Any chance in getting engine.log as well?
> Where did the "too many files" error occur? rhevm machine or vdsm?

Did you look at the above exception? it's in Python, I know ;-)

...
File "/usr/lib64/python2.6/subprocess.py", line 632, in __init__
    errread, errwrite) = self._get_handles(stdin, stdout, stderr)
  File "/usr/lib64/python2.6/subprocess.py", line 1055, in _get_handles
    p2cread, p2cwrite = os.pipe()
OSError: [Errno 24] Too many open files

> (Sorry, your title is a bit confusing).

Fixed.

Comment 3 Barak 2013-03-19 09:20:10 UTC
What are the official numbers we need to support here ?
How many NFS SDs  ?

Comment 6 Barak 2013-03-24 09:38:00 UTC
What is the current number of open file descriptors allowed in VDSM ?
If we need to support 100 NFS SDs than we must increas it (in case this is the only problem)

Comment 7 Yaniv Bronhaim 2013-03-24 13:34:18 UTC
Yes, the limit is the issue here. vdsm user limit for open files is 4096, and we can increase it

Comment 8 Barak 2013-03-24 16:58:40 UTC
Danken, Saggi any reason not to increase the ulimit ?
It's an easy thing to do harder to test.

Comment 9 Barak 2013-03-24 17:07:15 UTC
is Bug 922517 a Duplicate ?

Comment 10 Dan Kenigsberg 2013-03-27 20:30:37 UTC
(In reply to comment #8)
> Danken, Saggi any reason not to increase the ulimit ?
> It's an easy thing to do harder to test.

Maybe it's time to be more prudent about file descriptors, but I do not see a problem with updating vdsm/limits.conf. When you do that, please amend the comment there, to help end users tweak it better to support more than 100 sd.

Comment 11 Eduardo Warszawski 2013-04-03 13:05:55 UTC
Please test the bug scenario with patch:

http://gerrit.ovirt.org/12869
This should alleviate this particular scenario (findDomain() call will not use lvm operations.)


In addition please _don't_ increase the fd limit without a clear understanding why we reach so big numbers.

A rough calculus show that 4096 fd's for 100 attach domains ~ 40 fd's per SD attach.
Why so big number of fd's for 100 attach ops?

Comment 12 Barak 2013-04-04 10:18:04 UTC
*** Bug 922517 has been marked as a duplicate of this bug. ***

Comment 13 Barak 2013-04-07 20:03:17 UTC
For 3.2 we only intend to increase ulimit.

For 3.3 I have opened bug 948214.

Comment 14 Daniel Paikov 2013-04-14 13:41:33 UTC
(In reply to comment #11)
> Please test the bug scenario with patch:
> 
> http://gerrit.ovirt.org/12869
> This should alleviate this particular scenario (findDomain() call will not
> use lvm operations.)
> 
> 
> In addition please _don't_ increase the fd limit without a clear
> understanding why we reach so big numbers.
> 
> A rough calculus show that 4096 fd's for 100 attach domains ~ 40 fd's per SD
> attach.
> Why so big number of fd's for 100 attach ops?

Tested, findDomain does not fail anymore.

Comment 16 vvyazmin@redhat.com 2013-05-12 07:44:58 UTC
No issues are found

Verified on RHEVM 3.2 - SF15 environment:

RHEVM: rhevm-3.2.0-10.21.master.el6ev.noarch
VDSM: vdsm-4.10.2-17.0.el6ev.x86_64
LIBVIRT: libvirt-0.10.2-18.el6_4.4.x86_64
QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.3.x86_64
SANLOCK: sanlock-2.6-2.el6.x86_64


Test done with following actions:
Attache, activate, maintenance, detach. delete 100 NFS DC

Comment 18 errata-xmlrpc 2013-06-10 20:45:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html


Note You need to log in before you can comment on or make changes to this bug.