Bug 1271575 - [vdsm] nofiles impact hardly host - OSError: [Errno 24] Too many open files
[vdsm] nofiles impact hardly host - OSError: [Errno 24] Too many open files
Status: CLOSED CURRENTRELEASE
Product: vdsm
Classification: oVirt
Component: General (Show other bugs)
4.16.26
x86_64 Linux
unspecified Severity high (vote)
: ovirt-3.6.3
: 4.17.18
Assigned To: Piotr Kliczewski
Jiri Belka
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-14 06:11 EDT by Jiri Belka
Modified: 2016-02-18 06:15 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-18 06:15:27 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
mgoldboi: exception+
mgoldboi: planning_ack+
rule-engine: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)
vdsm fd check output (196.94 KB, text/plain)
2015-12-23 05:49 EST, Jiri Belka
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 51850 None None None 2016-01-14 09:10 EST
oVirt gerrit 52365 ovirt-3.6 MERGED supervdsm: failed validateAccess leaves pipes open 2016-01-19 02:05 EST

  None (edit)
Description Jiri Belka 2015-10-14 06:11:15 EDT
Description of problem:

"Low" nofile (12288) in /etc/security/limit.d/99-vdsm.conf caused one of our hosts which was part of hosted-engine setup being 'Unassinged' in engine.

In the same time there was running hosted-engine VM on the host and I could not migrate it when I discovered the issue...

Please observe log for this special issue and maybe we should raise the limit for nofiles.

----%----
Thread-4155624::DEBUG::2015-10-14 09:01:01,892::iscsi::424::Storage.ISCSI::(rescan) Performing SCSI scan, this will take up to 30 seconds
Thread-4155624::DEBUG::2015-10-14 09:01:01,893::iscsiadm::97::Storage.Misc.excCmd::(_runCmd) /usr/bin/sudo -n /sbin/iscsiadm -m session -R (cwd None)
Thread-4155624::ERROR::2015-10-14 09:01:01,893::domainMonitor::256::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain e469d56b-af8f-4a4a-b0dd-9fe2e8f84e69 monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 232, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 120, in _realProduce
    self.refreshStorage()
  File "/usr/share/vdsm/storage/misc.py", line 760, in helper
    return sm(*args, **kwargs)
  File "/usr/share/vdsm/storage/misc.py", line 745, in __call__
    self.__lastResult = self.__func(*args, **kwargs)
  File "/usr/share/vdsm/storage/sdc.py", line 83, in refreshStorage
    multipath.rescan()
  File "/usr/share/vdsm/storage/multipath.py", line 127, in rescan
    iscsi.rescan()
  File "/usr/share/vdsm/storage/misc.py", line 760, in helper
    return sm(*args, **kwargs)
  File "/usr/share/vdsm/storage/misc.py", line 745, in __call__
    self.__lastResult = self.__func(*args, **kwargs)
  File "/usr/share/vdsm/storage/iscsi.py", line 425, in rescan
    rescanOp = iscsiadm.session_rescan_async()
  File "/usr/share/vdsm/storage/iscsiadm.py", line 318, in session_rescan_async
    proc = _runCmd(["-m", "session", "-R"], sync=False)
  File "/usr/share/vdsm/storage/iscsiadm.py", line 97, in _runCmd
    return misc.execCmd(cmd, printable=printCmd, sudo=True, sync=sync)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 742, in execCmd
    deathSignal=deathSignal, childUmask=childUmask)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 52, in __init__
    stderr=stderr)
  File "/usr/lib64/python2.7/subprocess.py", line 703, in __init__
    errread, errwrite) = self._get_handles(stdin, stdout, stderr)
  File "/usr/lib64/python2.7/subprocess.py", line 1100, in _get_handles
    p2cread, p2cwrite = self.pipe_cloexec()
  File "/usr/lib64/python2.7/subprocess.py", line 1153, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files
----%----

Version-Release number of selected component (if applicable):
vdsm-4.16.26-1.el7ev.x86_64

How reproducible:
just once till now, thanks God

Steps to Reproduce:
1. (speculation) decrease nofiles for vdsm and create a lot of action
2.
3.

Actual results:
vdsm got issues because of low nofile limit number

Expected results:
vdsm happily up and running

Additional info:
I increased nofile to 15360 for now and restarted vdsm, after a while host got 'Up' in engine
Comment 2 Oved Ourfali 2015-10-15 00:53:37 EDT
Piotr, is that a duplicate?
Comment 3 Piotr Kliczewski 2015-10-15 03:04:30 EDT
Based on the log provided it is hard to tell whether it is duplicate or not. Since the beginning of the vdsm.log there are OSErrors so I am not sure what happened.

Jiri can you please provide older vdsm logs?
Comment 4 Piotr Kliczewski 2015-10-15 03:05:40 EDT
I can see there:

Thread-122719::ERROR::2015-10-14 11:57:35,754::sampling::488::vm.Vm::(collect) vmId=`f8fd9136-9f4d-492a-bfaa-5b1c76ba5d73`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x2842e70>
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/sampling.py", line 484, in collect
    statsFunction()
  File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/virt/vm.py", line 297, in _highWrite
    self._vm.extendDrivesIfNeeded()
  File "/usr/share/vdsm/virt/vm.py", line 2550, in extendDrivesIfNeeded
    extend = [x for x in self._getExtendCandidates()
  File "/usr/share/vdsm/virt/vm.py", line 2502, in _getExtendCandidates
    capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
  File "/usr/share/vdsm/virt/vm.py", line 702, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 646, in blockInfo
    if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)
libvirtError: cannot stat file '/rhev/data-center/3a63d854-bed0-11e0-b671-545200312d04/219bc71f-c5ec-4ace-80f5-f07b2f892163/images/5c8e29ad-f997-438b-9db1-ff0e672d0c99/aa771ceb-e697-480f-bc8a-a7e5555c1876': Bad file descriptor

which suggests that it could be something else.

Francesco can you take a look at this issue?
Comment 5 Francesco Romani 2015-10-16 03:30:29 EDT
(In reply to Piotr Kliczewski from comment #4)
> I can see there:
> 
> Thread-122719::ERROR::2015-10-14
> 11:57:35,754::sampling::488::vm.Vm::(collect)
> vmId=`f8fd9136-9f4d-492a-bfaa-5b1c76ba5d73`::Stats function failed:
> <AdvancedStatsFunction _highWrite at 0x2842e70>
> Traceback (most recent call last):
>   File "/usr/share/vdsm/virt/sampling.py", line 484, in collect
>     statsFunction()
>   File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__
>     retValue = self._function(*args, **kwargs)
>   File "/usr/share/vdsm/virt/vm.py", line 297, in _highWrite
>     self._vm.extendDrivesIfNeeded()
>   File "/usr/share/vdsm/virt/vm.py", line 2550, in extendDrivesIfNeeded
>     extend = [x for x in self._getExtendCandidates()
>   File "/usr/share/vdsm/virt/vm.py", line 2502, in _getExtendCandidates
>     capacity, alloc, physical = self._dom.blockInfo(drive.path, 0)
>   File "/usr/share/vdsm/virt/vm.py", line 702, in f
>     ret = attr(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line
> 111, in wrapper
>     ret = f(*args, **kwargs)
>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 646, in
> blockInfo
>     if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed',
> dom=self)
> libvirtError: cannot stat file
> '/rhev/data-center/3a63d854-bed0-11e0-b671-545200312d04/219bc71f-c5ec-4ace-
> 80f5-f07b2f892163/images/5c8e29ad-f997-438b-9db1-ff0e672d0c99/aa771ceb-e697-
> 480f-bc8a-a7e5555c1876': Bad file descriptor
> 
> which suggests that it could be something else.
> 
> Francesco can you take a look at this issue?

This could be related to the exaustion of the file descriptors.
There is a quite strong correlation with the operation attempted, the reported error, and the fact the application run out of FDs.
I'd retry after the FD exaustion is resolved, if still happens I'll have a deeper look.
Comment 6 Francesco Romani 2015-10-16 03:31:01 EDT
restoring NEEDINFO as per https://bugzilla.redhat.com/show_bug.cgi?id=1271575#c3
Comment 8 Piotr Kliczewski 2015-10-16 11:02:57 EDT
From provided logs I can see that after vdsm was restarted on 2015-10-14 11:58:17,976 there is no more fd issues. I only see storage related issues.

I tried to find information about what happened but log files between vdsm.log.37.xz and vdsm.log.56.xz are corrupted.

In 57 I can see:

Thread-4153487::DEBUG::2015-10-13 07:00:52,562::hsm::2457::Storage.HSM::(connectStorageServer) prefetch failed: {u'd894f2d2-18fd-4a51-a650-31d065d49095': <function findDomain at 0x24a1050>, u'0c78b4d6-ba00-4d3e-9f9f-65c7d5899d71': <function findDomain at 0x24a1050>, '219bc71f-c5ec-4ace-80f5-f07b2f892163': <function findDomain at 0x245f500>, u'23c03bb6-9889-4cbf-b7ad-55b9a2c70653': <function findDomain at 0x24a1050>, '11958807-b912-4363-8f3e-6d087608764f': <function findDomain at 0x245f500>, 'e469d56b-af8f-4a4a-b0dd-9fe2e8f84e69': <function findDomain at 0x245f500>}
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2454, in connectStorageServer
    doms = self.__prefetchDomains(domType, conObj)
  File "/usr/share/vdsm/storage/hsm.py", line 2391, in __prefetchDomains
    goop.glob.glob(os.path.join(lPath, uuidPatern)))
  File "/usr/share/vdsm/storage/outOfProcess.py", line 120, in glob
    return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 500, in glob
    return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 391, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 11] Resource temporarily unavailable

It seems like this issue is storage related.

Allon can you please take a look at it?
Comment 9 Piotr Kliczewski 2015-10-16 11:08:01 EDT
Can you please run lsof to make sure what is the type of fd that we leak?
Comment 10 Jiri Belka 2015-10-22 03:37:55 EDT
You mean to decrease nofiles and wait for the issue and then run lsof?
Comment 11 Piotr Kliczewski 2015-10-22 03:43:45 EDT
Correct
Comment 12 Oved Ourfali 2015-10-25 07:32:09 EDT
Restoring needinfos.
Comment 13 Jiri Belka 2015-10-26 09:10:17 EDT
FYI the env is our main env for whole (not only) RHEVM QE team. Thus playing with its settings etc... will need plan and it will take time.
Comment 14 Piotr Kliczewski 2015-10-28 11:58:01 EDT
Any information what fds are leaking?
Comment 15 Yaniv Lavi 2015-10-29 08:43:11 EDT
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Comment 16 Oved Ourfali 2015-11-05 03:14:38 EST
Closing on insufficient data.
Jiri - when you have additional information, please re-open.
Comment 17 Jiri Belka 2015-12-23 05:48:47 EST
Got it again:

~~~
Dec 23 10:15:15 example.com vdsm[6309]: vdsm vds ERROR failed to retrieve hardware info
                                                                Traceback (most recent call last):
                                                                  File "/usr/share/vdsm/API.py", line 1326, in getHardwareInfo
                                                                    hw = supervdsm.getProxy().getHardwareInfo()
                                                                  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
                                                                    return callMethod()
                                                                  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
                                                                    **kwargs)
                                                                  File "<string>", line 2, in getHardwareInfo
                                                                  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 755, in _callmethod
                                                                    self._connect()
                                                                  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 742, in _connect
                                                                    conn = self._Client(self._token.address, authkey=self._authkey)
                                                                  File "/usr/lib64/python2.7/multiprocessing/connection.py", line 179, in Client
                                                                    answer_challenge(c, authkey)
                                                                  File "/usr/lib64/python2.7/multiprocessing/connection.py", line 435, in answer_challenge
                                                                    message = connection.recv_bytes(256)         # reject large message
                                                                EOFError
Dec 23 10:15:21 example.com vdsm[6309]: vdsm vds ERROR failed to retrieve hardware info
                                                                Traceback (most recent call last):
                                                                  File "/usr/share/vdsm/API.py", line 1326, in getHardwareInfo
                                                                    hw = supervdsm.getProxy().getHardwareInfo()
                                                                  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
                                                                    return callMethod()
                                                                  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
                                                                    **kwargs)
                                                                  File "<string>", line 2, in getHardwareInfo
                                                                  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 755, in _callmethod
                                                                    self._connect()
                                                                  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 742, in _connect
                                                                    conn = self._Client(self._token.address, authkey=self._authkey)
                                                                  File "/usr/lib64/python2.7/multiprocessing/connection.py", line 173, in Client
                                                                    c = SocketClient(address)
                                                                  File "/usr/lib64/python2.7/multiprocessing/connection.py", line 302, in SocketClient
                                                                    s = socket.socket( getattr(socket, family) )
                                                                  File "/usr/lib64/python2.7/socket.py", line 187, in __init__
                                                                error: [Errno 24] Too many open files
~~~

[root@slot-5b ~]# cat /proc/6309/limits ; grep vdsm /etc/security/limits.d/*
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             192280               192280               processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       192280               192280               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us 
       
/etc/security/limits.d/99-vdsm.conf:# VMs run by "qemu" user, vm processes are not relavent to "vdsm" user limits.
/etc/security/limits.d/99-vdsm.conf:vdsm - nproc 4096
/etc/security/limits.d/99-vdsm.conf:vdsm - nofile 12288

[root@slot-5b ~]# lsof -p 6309 | wc -l
1162
Comment 18 Jiri Belka 2015-12-23 05:49 EST
Created attachment 1108935 [details]
vdsm fd check output

vdsm-4.17.13-1.el7ev.noarch (HE env)
Comment 19 Oved Ourfali 2015-12-23 05:54:25 EST
Yaniv, can you take a look?
Comment 20 Piotr Kliczewski 2015-12-23 07:40:26 EST
What type of file descriptor was it? How to reproduce it?
Comment 23 Piotr Kliczewski 2016-01-13 04:51:58 EST
As Jiri described for me how to reproduce I attempted to do it so I:
- installed hosted engine
- configured iscsi storage
- leave it idle

I monitored fds for 1 hr and I see stable usage of fds.

Please provide more information how to reproduce the issue because with above description I was unable to.
Comment 24 Jiri Belka 2016-01-13 05:17:34 EST
I have no more info. What do those FDs refer to? Is it host -> engine or what is that?

Our env:

- 45 vms up from total of 124 vms
- 6 hosts
- 16 storageconnections (12 nfs, 4 iscsi)
Comment 25 Piotr Kliczewski 2016-01-13 05:37:40 EST
I need steps to reproduce. What is configuration of the host that is failing?

With the information that you provided I am not able to reproduce.
Comment 30 Piotr Kliczewski 2016-01-14 05:49:00 EST
Looking at the logs I can see that issue is not related to vdsm but to supervdsm this time. Here is what I see happening many times:

MainProcess|Thread-2373::WARNING::2015-12-22 20:45:42,334::fileUtils::84::Storage.fileUtils::(validateAccess) Permission denied for directory: /rhev/data-center/mnt/10.34.63.202:_mnt_export_nfs_lv2___brq-setup with permissions: 7
MainProcess|Thread-2373::ERROR::2015-12-22 20:45:42,335::supervdsmServer::118::SuperVdsm.ServerCallback::(wrapper) Error in validateAccess
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer", line 116, in wrapper
    res = func(*args, **kwargs)
  File "/usr/share/vdsm/supervdsmServer", line 302, in validateAccess
    kwargs=kwargs)
  File "/usr/share/vdsm/supervdsmServer", line 290, in _runAs
    raise err
OSError: [Errno 13] Permission denied

and as a result we get:

MainProcess|Thread-40::ERROR::2015-12-22 20:45:47,292::supervdsmServer::118::SuperVdsm.ServerCallback::(wrapper) Error in hbaRescan
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer", line 116, in wrapper
    res = func(*args, **kwargs)
  File "/usr/share/vdsm/supervdsmServer", line 499, in hbaRescan
    return hba._rescan()
  File "/usr/share/vdsm/storage/hba.py", line 72, in _rescan
    execCmdLogger=log)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 672, in execCmd
    deathSignal=deathSignal, childUmask=childUmask)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 63, in __init__
    **kw)
  File "/usr/lib64/python2.7/subprocess.py", line 703, in __init__
    errread, errwrite), to_close = self._get_handles(stdin, stdout, stderr)
  File "/usr/lib64/python2.7/subprocess.py", line 1123, in _get_handles
    c2pread, c2pwrite = self.pipe_cloexec()
  File "/usr/lib64/python2.7/subprocess.py", line 1168, in pipe_cloexec
    r, w = os.pipe()
OSError: [Errno 24] Too many open files
Comment 31 Jiri Belka 2016-02-05 09:31:45 EST
ok, vdsm-4.17.19-0.el7ev.noarch

can't reproduce on an env where vdsm has problem to access HE storage domain

Note You need to log in before you can comment on or make changes to this bug.