Bug 1618352
Summary: | [downstream clone - 4.2.6] "OSError: [Errno 24] Too many open files" after ISO domain is non-responsive for 27 hours | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | RHV bug bot <rhv-bugzilla-bot> |
Component: | vdsm | Assignee: | Nir Soffer <nsoffer> |
Status: | CLOSED ERRATA | QA Contact: | Kevin Alon Goldblatt <kgoldbla> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.1.7 | CC: | ebenahar, eheftman, lsurette, ndave, nsoffer, obockows, srevivo, tnisan, ycui, ylavi |
Target Milestone: | ovirt-4.2.6 | Keywords: | TestOnly, ZStream |
Target Release: | 4.2.6 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1547679 | Environment: | |
Last Closed: | 2018-09-04 13:43:24 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1547679 | ||
Bug Blocks: |
Description
RHV bug bot
2018-08-16 13:26:06 UTC
(In reply to Nirav Dave from comment #0) > The _checkDomainStatus is failing, after recursive exception message What do yo mean by "recursive exception message"? Please attach vdsm logs from the time vdsm was started, until the time of the error mentioned in comment 0. Also please share the output of: - ls -lht /rhev/data-center/mnt/*/* - rpm -qa On the host where this error was seen. (Originally by Nir Soffer) (In reply to Nir Soffer from comment #2) > (In reply to Nirav Dave from comment #0) > > The _checkDomainStatus is failing, after recursive exception message > > What do yo mean by "recursive exception message"? > My mistake, I mean repetitive exception message, it is obvious as the query for checking storage domain is made after a definite interval and on failure of not getting the expected storage domain throws exception. > Please attach vdsm logs from the time vdsm was started, until the time of > the error > mentioned in comment 0. > I will be attaching logs shortly > Also please share the output of: > > - ls -lht /rhev/data-center/mnt/*/* I donot have full logs, let me ask for the same. > - rpm -qa > Do you need version for vdsm or you need list of all the package on the host. > On the host where this error was seen. In vdsm logs, hope I am getting your question correctly. Thanks & Regards, Nirav Dave (Originally by Nirav Dave) Thanks Nirav, but we need the data I requested attached to the bug. We cannot use external resources that may not be available when someone will work on this bug. (Originally by Nir Soffer) Next step: reproduce this on 4.2 to see how recent changes in ioprocess effects this issue when a file based domain is stuck for many hours. (Originally by Nir Soffer) This is comment 16 filtering private info from the logs. Looking in vdsm logs from last restart until the "Too many files" error started: 1. ISO domain get stuck at 2018-02-13 11:27:25 $ grep '\[storage.check\]' vdsm.log.13 | head -5 2018-02-13 11:27:25,694+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2018-02-13 11:27:35,693+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 20.00 seconds (check:279) 2018-02-13 11:27:45,700+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 30.00 seconds (check:279) 2018-02-13 11:27:55,700+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 40.00 seconds (check:279) 2018-02-13 11:28:05,701+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 50.00 seconds (check:279) 2. More then 27 hours later, the ISO domain is still stuck $ grep '\[storage.check\]' vdsm.log.7 | tail -5 2018-02-14 15:00:15,695+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99180.00 seconds (check:279) 2018-02-14 15:00:25,695+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99190.00 seconds (check:279) 2018-02-14 15:00:35,693+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99200.00 seconds (check:279) 2018-02-14 15:00:45,701+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99210.00 seconds (check:279) 2018-02-14 15:00:55,700+0100 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520/dom_md/metadata' is blocked for 99220.00 seconds (check:279) 3. Vdsm times out trying to check the ISO domain 2018-02-14 08:01:20,972+0100 ERROR (monitor/36fd2ae) [storage.Monitor] Error checking domain 36fd2aeb-9476-4dad-ac03-8ed62f4b8520 (monitor:426) Traceback (most recent call last): File "/usr/share/vdsm/storage/monitor.py", line 407, in _checkDomainStatus self.domain.selftest() ... File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in _sendCommand raise Timeout(os.strerror(errno.ETIMEDOUT)) Timeout: Connection timed out At this point, the vdsm call failed, but ioprocess is still stuck on storage, probably in D state. 4. The ISO domain monitor thread starts a new iprocess client: 2018-02-14 08:01:30,977+0100 INFO (monitor/36fd2ae) [IOProcessClient] Starting client ioprocess-8321 (__init__:325) This process is terminated on: 2018-02-14 08:02:34,283+0100 INFO (monitor/ff1ce91) [IOProcessClient] Closing client ioprocess-8321 (__init__:593) By another monitor thread - we terminate idle ioprocesses when starting new instances. 5. Starting new client again: 2018-02-14 08:02:41,123+0100 INFO (monitor/36fd2ae) [IOProcessClient] Starting client ioprocess-8326 (__init__:325) Again, stopped after a minute: 2018-02-14 08:03:44,323+0100 INFO (monitor/ff1ce91) [IOProcessClient] Closing client ioprocess-8326 (__init__:593) $ grep '(monitor/36fd2ae) \[IOProcessClient\] Starting client ioprocess' vdsm.log.8 vdsm.log.7 | wc -l 266 266 ioprocess instances were started. Since the ISO domain is not responsive, every new ioprocess process is probably stuck trying to access the non-responsive ISO domain. When vdsm close the client, the client starts new thread that will kill and wait for the process, but returns immediately - before the process terminates. I think what could happen here that we started 266 concurrent ioprocesses. every client uses 8 file descriptors in vdsm, consuming 2128 file descriptors. Vdsm also refresh caches every 5 mintes; For file based domains, this means scanning domains every 5 mintues. This is done by starting ioprocess instance for every directory in /rhev/data-center/mnt, looking for domain mount directories. If the ISO domain is stuck, the ioprocess started on /rhev/data-center/mnt/storage.1:_iso/36fd2aeb-9476-4dad-ac03-8ed62f4b8520 will get stuck as well. We see 547 ioprocess clients started this way: $ grep '(itmap/.) \[IOProcessClient\] Starting client ioprocess' vdsm.log.8 vdsm.log.7 | wc -l 547 We can see in the log that each time we scan 4 directories: vdsm.log.7:2018-02-14 13:11:06,043+0100 INFO (itmap/0) [IOProcessClient] Starting client ioprocess-9184 (__init__:325) vdsm.log.7:2018-02-14 13:11:06,075+0100 INFO (itmap/1) [IOProcessClient] Starting client ioprocess-9185 (__init__:325) vdsm.log.7:2018-02-14 13:11:06,105+0100 INFO (itmap/2) [IOProcessClient] Starting client ioprocess-9186 (__init__:325) vdsm.log.7:2018-02-14 13:11:06,134+0100 INFO (itmap/3) [IOProcessClient] Starting client ioprocess-9187 (__init__:325) Every call, one ioprocess may get stuck, so we may have 136 instances started this way, which uses 1088 file descriptors. So we may have 402 stuck ioprocess instances, using 3216 file descriptors. This can explain why we reach the 4096 limit, since vdsm uses file descriptors for other tasks. Summary: 1. The root cause is stuck ISO domain, the customer must fix this issue. Moving the ISO domain to maintenance and activating it again may help. If it does not help, stopping vdsm, force-unmounting this mount, and killing any stuck ioprocesses should fix it. 2. IOProcess life time is not managed properly. We should never start more then *one* instance for every file domain in the system. This is a known issue. We have these old patches fixing this issue: https://gerrit.ovirt.org/#/q/topic:oop-per-mount+is:open but they need a rewrite since the code was change a lot since they were posted. This change is quite big so best deferred to 4.3. We can consider backport to 4.2.z when this will be merged in master. Increasing priotiry to high since this issue may cause breaks storage domain isolation - trouble with one domain can cause unrelated operations to fail. (Originally by Nir Soffer) This should be be improved in 4.2.3 in terms of ioprocess. These is another bug on renewing stale NFS mounts, please track bug #1520008. (Originally by ylavi) To test if this issue is resolved: 1. Setup a system with several storage domains (NFS, iSCSI or FC, ISO) 2. Block access to the storage providing the ISO domain 3. wait a day, watching - the number of ioprocess child processes - the number of open file descriptors in vdsm 4. Make sure we don't reach the "Too many open files error" (Originally by Nir Soffer) Verified with the following code: -------------------------------------- ovirt-engine-4.2.5.3-0.1.el7ev.noarch vdsm-4.20.35-1.el7ev.x86_64 Verified with the following scenario: -------------------------------------- Steps to reproduce: ------------------------- 1. Setup a system with several storage domains (NFS, iSCSI or FC, ISO) 2. Block access to the storage providing the ISO domain 3. wait a day, watching - the number of ioprocess child processes - the number of open file descriptors in vdsm 4. Make sure we don't reach the "Too many open files error" After 24 hours the results were consistently as follows: ----------------------------------------------------------- ps -ef |grep ioprocess |wc -l 17 ls -ltr /proc/29965/fd |wc -l 153 ps -ef |grep ioprocess |wc -l 16 ls -ltr /proc/29965/fd |wc -l 147 There were also no "Too many open files errors" reported Moving to VERIFIED (Originally by Kevin Goldblatt) Tal, since this seems to be fixed in 4.2, why not move the bug to 4.2.6? (Originally by Nir Soffer) Created attachment 1476162 [details]
engine and vdsm logs
Added logs
(Originally by Kevin Goldblatt)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2624 |