Bug 1618352

Summary:	[downstream clone - 4.2.6] "OSError: [Errno 24] Too many open files" after ISO domain is non-responsive for 27 hours
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	RHV bug bot <rhv-bugzilla-bot>
Component:	vdsm	Assignee:	Nir Soffer <nsoffer>
Status:	CLOSED ERRATA	QA Contact:	Kevin Alon Goldblatt <kgoldbla>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1.7	CC:	ebenahar, eheftman, lsurette, ndave, nsoffer, obockows, srevivo, tnisan, ycui, ylavi
Target Milestone:	ovirt-4.2.6	Keywords:	TestOnly, ZStream
Target Release:	4.2.6
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1547679	Environment:
Last Closed:	2018-09-04 13:43:24 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1547679
Bug Blocks:

Description RHV bug bot 2018-08-16 13:26:06 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1547679 +++
======================================================================

Description of problem:

The _checkDomainStatus is failing, after recursive exception message there is an  error "OSError: [Errno 24] Too many open files"

----------------------------------------------------
2018-02-14 13:11:06,161+0100 INFO  (ioprocess communication (30118)) [IOProcess] Starting ioprocess (__init__:447)
2018-02-14 13:11:06,165+0100 INFO  (monitor/36fd2ae) [IOProcessClient] Starting client ioprocess-9188 (__init__:325)
2018-02-14 13:11:06,165+0100 ERROR (monitor/36fd2ae) [storage.Monitor] Error checking domain <domain id> (monitor:426)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 407, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 50, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain
    return findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 126, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 384, in __init__
    manifest = self.manifestClass(domainPath)
  File "/usr/share/vdsm/storage/fileSD.py", line 174, in __init__
    metadata = FileSDMetadata(self.metafile)
  File "/usr/share/vdsm/storage/fileSD.py", line 158, in <lambda>
    PersistentDict(FileMetadataRW(metafile)), FILE_SD_MD_FIELDS)
  File "/usr/share/vdsm/storage/fileSD.py", line 130, in __init__
    self._oop = oop.getProcessPool(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 90, in getProcessPool
    max_queued_requests=MAX_QUEUED)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 326, in __init__
    self._run()
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 364, in _run
    p = _spawnProc(cmd)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 71, in _spawnProc
    return cpopen.CPopen(cmd)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 63, in __init__
    **kw)
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 83, in _execute_child_v276
    _to_close=to_close
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 118, in _execute_child_v275
    restore_sigpipe
OSError: [Errno 24] Too many open files

-----------------------------


Version-Release number of selected component (if applicable):
ovirt-engine-4.1.7.6-0.1.el7.noarch


Actual results:
Please refer trace mentioned in the description

Expected results:
The files opened by IO process should be closed gracefully after the exception is generated.


Additional info:
The default files that can be opened is 4096 in limits.conf. Please do let me know if vdsm logs are required here.


Thanks and Regards,
Nirav Dave

(Originally by Nirav Dave)

Comment 3 RHV bug bot 2018-08-16 13:26:17 UTC

(In reply to Nirav Dave from comment #0)
> The _checkDomainStatus is failing, after recursive exception message 

What do yo mean by "recursive exception message"?

Please attach vdsm logs from the time vdsm was started, until the time of the error
mentioned in comment 0.

Also please share the output of:

- ls -lht /rhev/data-center/mnt/*/*
- rpm -qa

On the host where this error was seen.

(Originally by Nir Soffer)

Comment 4 RHV bug bot 2018-08-16 13:26:21 UTC


(In reply to Nir Soffer from comment #2)
> (In reply to Nirav Dave from comment #0)
> > The _checkDomainStatus is failing, after recursive exception message 
> 
> What do yo mean by "recursive exception message"?
> 
My mistake, I mean repetitive exception message, it is obvious as the query for checking storage domain is made after a definite interval and on failure of not getting the expected storage domain throws exception. 
> Please attach vdsm logs from the time vdsm was started, until the time of
> the error
> mentioned in comment 0.
> 
I will be attaching logs shortly

> Also please share the output of:
> 
> - ls -lht /rhev/data-center/mnt/*/*
I donot have full logs, let me ask for the same.

> - rpm -qa
> 
Do you need version for vdsm or you need list of all the package on the host.

> On the host where this error was seen.
In vdsm logs, hope I am getting your question correctly.

Thanks & Regards,
Nirav Dave

(Originally by Nirav Dave)

Comment 7 RHV bug bot 2018-08-16 13:26:33 UTC

Thanks Nirav, but we need the data I requested attached to the bug.

We cannot use external resources that may not be available when someone will work
on this bug.

(Originally by Nir Soffer)

Comment 18 RHV bug bot 2018-08-16 13:27:20 UTC

Next step: reproduce this on 4.2 to see how recent changes in ioprocess effects
this issue when a file based domain is stuck for many hours.

(Originally by Nir Soffer)

Comment 26 RHV bug bot 2018-08-16 13:27:52 UTC

This is comment 16 filtering private info from the logs.

Looking in vdsm logs from last restart until the "Too many files" error started:

1. ISO domain get stuck at 2018-02-13 11:27:25

$ grep '\[storage.check\]' vdsm.log.13 | head -5
2018-02-13 11:27:25,694+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2018-02-13 11:27:35,693+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 20.00 seconds (check:279)
2018-02-13 11:27:45,700+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 30.00 seconds (check:279)
2018-02-13 11:27:55,700+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 40.00 seconds (check:279)
2018-02-13 11:28:05,701+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 50.00 seconds (check:279)

2. More then 27 hours later, the ISO domain is still stuck

$ grep '\[storage.check\]' vdsm.log.7 | tail -5
2018-02-14 15:00:15,695+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99180.00 seconds (check:279)
2018-02-14 15:00:25,695+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99190.00 seconds (check:279)
2018-02-14 15:00:35,693+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99200.00 seconds (check:279)
2018-02-14 15:00:45,701+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99210.00 seconds (check:279)
2018-02-14 15:00:55,700+0100 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99220.00 seconds (check:279)

3. Vdsm times out trying to check the ISO domain

2018-02-14 08:01:20,972+0100 ERROR (monitor/36fd2ae) [storage.Monitor] Error checking domain 36fd2aeb-9476-4dad-ac03-8ed62f4b8520 (monitor:426)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 407, in _checkDomainStatus
    self.domain.selftest()
...
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

At this point, the vdsm call failed, but ioprocess is still stuck on storage,
probably in D state.

4. The ISO domain monitor thread starts a new iprocess client:

2018-02-14 08:01:30,977+0100 INFO  (monitor/36fd2ae) [IOProcessClient] Starting client ioprocess-8321 (__init__:325)

This process is terminated on:

2018-02-14 08:02:34,283+0100 INFO  (monitor/ff1ce91) [IOProcessClient] Closing client ioprocess-8321 (__init__:593)

By another monitor thread - we terminate idle ioprocesses when starting new
instances.

5. Starting new client again:

2018-02-14 08:02:41,123+0100 INFO  (monitor/36fd2ae) [IOProcessClient] Starting client ioprocess-8326 (__init__:325)

Again, stopped after a minute:

2018-02-14 08:03:44,323+0100 INFO  (monitor/ff1ce91) [IOProcessClient] Closing client ioprocess-8326 (__init__:593)

$ grep '(monitor/36fd2ae) \[IOProcessClient\] Starting client ioprocess' vdsm.log.8 vdsm.log.7 | wc -l
266

266 ioprocess instances were started.

Since the ISO domain is not responsive, every new ioprocess process is probably
stuck trying to access the non-responsive ISO domain. When vdsm close the client,
the client starts new thread that will kill and wait for the process, but returns
immediately - before the process terminates.

I think what could happen here that we started 266 concurrent ioprocesses.
every client uses 8 file descriptors in vdsm, consuming 2128 file descriptors.

Vdsm also refresh caches every 5 mintes; For file based domains, this means
scanning domains every 5 mintues. This is done by starting ioprocess instance
for every directory in /rhev/data-center/mnt, looking for domain mount
directories.

If the ISO domain is stuck, the ioprocess started on
/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520
will get stuck as well.

We see 547 ioprocess clients started this way:

$ grep '(itmap/.) \[IOProcessClient\] Starting client ioprocess' vdsm.log.8 vdsm.log.7 | wc -l
547

We can see in the log that each time we scan 4 directories:

vdsm.log.7:2018-02-14 13:11:06,043+0100 INFO  (itmap/0) [IOProcessClient] Starting client ioprocess-9184 (__init__:325)
vdsm.log.7:2018-02-14 13:11:06,075+0100 INFO  (itmap/1) [IOProcessClient] Starting client ioprocess-9185 (__init__:325)
vdsm.log.7:2018-02-14 13:11:06,105+0100 INFO  (itmap/2) [IOProcessClient] Starting client ioprocess-9186 (__init__:325)
vdsm.log.7:2018-02-14 13:11:06,134+0100 INFO  (itmap/3) [IOProcessClient] Starting client ioprocess-9187 (__init__:325)

Every call, one ioprocess may get stuck, so we may have 136
instances started this way, which uses 1088 file descriptors.

So we may have 402 stuck ioprocess instances, using 3216 file descriptors.
This can explain why we reach the 4096 limit, since vdsm uses file descriptors
for other tasks.

Summary:

1. The root cause is stuck ISO domain, the customer must fix this issue.

Moving the ISO domain to maintenance and activating it again may help.
If it does not help, stopping vdsm, force-unmounting this mount, and
killing any stuck ioprocesses should fix it.

2. IOProcess life time is not managed properly.

We should never start more then *one* instance for every file domain in
the system. This is a known issue.

We have these old patches fixing this issue:
https://gerrit.ovirt.org/#/q/topic:oop-per-mount+is:open
but they need a rewrite since the code was change a lot since
they were posted.

This change is quite big so best deferred to 4.3. We can consider
backport to 4.2.z when this will be merged in master.

Increasing priotiry to high since this issue may cause breaks storage
domain isolation - trouble with one domain can cause unrelated
operations to fail.

(Originally by Nir Soffer)

Comment 29 RHV bug bot 2018-08-16 13:28:06 UTC

This should be be improved in 4.2.3 in terms of ioprocess.
These is another bug on renewing stale NFS mounts, please track bug #1520008.

(Originally by ylavi)

Comment 31 RHV bug bot 2018-08-16 13:28:14 UTC

To test if this issue is resolved:

1. Setup a system with several storage domains (NFS, iSCSI or FC, ISO)
2. Block access to the storage providing the ISO domain
3. wait a day, watching
  - the number of ioprocess child processes
  - the number of open file descriptors in vdsm
4. Make sure we don't reach the "Too many open files error"

(Originally by Nir Soffer)

Comment 32 RHV bug bot 2018-08-16 13:28:18 UTC

Verified with the following code:
--------------------------------------
ovirt-engine-4.2.5.3-0.1.el7ev.noarch
vdsm-4.20.35-1.el7ev.x86_64


Verified with the following scenario:
--------------------------------------
Steps to reproduce:
-------------------------
1. Setup a system with several storage domains (NFS, iSCSI or FC, ISO)
2. Block access to the storage providing the ISO domain
3. wait a day, watching
  - the number of ioprocess child processes
  - the number of open file descriptors in vdsm
4. Make sure we don't reach the "Too many open files error"

After 24 hours the results were consistently as follows:
-----------------------------------------------------------
 ps -ef |grep ioprocess |wc -l 
17
ls -ltr  /proc/29965/fd |wc -l
153
 ps -ef |grep ioprocess |wc -l
16
 ls -ltr  /proc/29965/fd |wc -l
147

There were also no "Too many open files errors" reported

Moving to VERIFIED

(Originally by Kevin Goldblatt)

Comment 33 RHV bug bot 2018-08-16 13:28:22 UTC

Tal, since this seems to be fixed in 4.2, why not move the bug to 4.2.6?

(Originally by Nir Soffer)

Comment 34 RHV bug bot 2018-08-16 13:28:26 UTC

Created attachment 1476162 [details]
engine and vdsm logs

Added logs

(Originally by Kevin Goldblatt)

Comment 36 errata-xmlrpc 2018-09-04 13:43:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2624