Bug 1410120
Summary: | [scale] - vdsm error due to Resource temporarily unavailable | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Eldad Marciano <emarcian> | ||||
Component: | Core | Assignee: | Liron Aravot <laravot> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Eldad Marciano <emarcian> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.18.17 | CC: | bugs, eberman, emarcian, gklein, laravot, nsoffer, rgolan, tnisan, ylavi | ||||
Target Milestone: | --- | Keywords: | Performance | ||||
Target Release: | --- | Flags: | sbonazzo:
ovirt-4.1-
sbonazzo: exception- |
||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-03-27 18:21:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Eldad Marciano
2017-01-04 14:15:45 UTC
Please re-test with 4.1. If it's not a regression, I don't see it as a blocker. How many VMs are running / trying to run / stopped / etc.? Does it happen with FAR fewer domains? Hosts? we will upgrade our env and we'll update ASAP. it was around 47 tasks in background when we faced the issue which means 47 vms that trying to run, on top of 27 hosts and 12 SDs. 27 hosts / 47 vms (tasks) = 0.5 it is pretty much sense number. currently we test it with the following patch set https://gerrit.ovirt.org/#/q/topic:slow_host_start (In reply to Yaniv Kaul from comment #1) > Please re-test with 4.1. > If it's not a regression, I don't see it as a blocker. > How many VMs are running / trying to run / stopped / etc.? > > Does it happen with FAR fewer domains? Hosts? The bug also reproduced in 4.1 with same errors. enginen release: ovirt-engine-4.1.0-0.3.beta2.el7.noarch vdsm-jsonrpc-java release: vdsm-jsonrpc-java-1.3.5-1.el7ev.noarch vdsm release: vdsm-4.19.1-1.el7ev.x86_64 mainly around OSError: [Errno 11] Resource temporarily unavailable adding logs please add dmesg or journalctl from the time the error occurs. If slow storage is the culprit and we don't have an ISCSI setup to test this with you can spin all the VMs without disks (make them boot from PXE) and lets see it the problems occurs. (In reply to Roy Golan from comment #10) > If slow storage is the culprit and we don't have an ISCSI setup to test this > with you can spin all the VMs without disks (make them boot from PXE) and > lets see it the problems occurs. Hi Roy, I tested it as you suggested, the problem NOT reproduced. test profile: 1 SD NFS (foreman), 27 hosts, VMS without disk (just network). - not reproduced. Liron - any progress? I've looked and we had similar issue on the past caused by too many open file descriptors (see BZ 650588). In Eldad's scenario he ran a large number of vms with disks on NFS domains, which probably means too many open FDs as well causing to those errors. I'm not sure there's something we can do here - Nir, any other insight? perhaps there are bugs on other components that this one should depend on. (In reply to Liron Aravot from comment #13) > I've looked and we had similar issue on the past caused by too many open > file descriptors (see BZ 650588). In Eldad's scenario he ran a large number > of vms with disks on NFS domains, which probably means too many open FDs as > well causing to those errors. Please sync with Eldad. I did not get the impression there were many VMs (or disks) running at the time when the issue happened. I think the storage is flakey. > > I'm not sure there's something we can do here - Nir, any other insight? > perhaps there are bugs on other components that this one should depend on. If it's a scale issue, we can have a Docs bug on the need to increase the number of open files. But I doubt it's the issue. (In reply to Yaniv Kaul from comment #14) > (In reply to Liron Aravot from comment #13) > > I've looked and we had similar issue on the past caused by too many open > > file descriptors (see BZ 650588). In Eldad's scenario he ran a large number > > of vms with disks on NFS domains, which probably means too many open FDs as > > well causing to those errors. > > Please sync with Eldad. I did not get the impression there were many VMs (or > disks) running at the time when the issue happened. I think the storage is > flakey. > Eldad, can you please elaborate how many vms were running on the host and how many disks were plugged to each vm? In the reproduction with vms without disks (see comment #11) the issue hasn't reproduces - that led me to go in that direction. *on each relevant host I've missed that in comment #2 Eldad wrote that he had only 47 running vms (in the initial description the number is 2000). Will update here. (In reply to Liron Aravot from comment #17) > I've missed that in comment #2 Eldad wrote that he had only 47 running vms > (in the initial description the number is 2000). Will update here. Hi Liron, thats correct. we have a setup with more than 2K vms. as mentioned in comment #2 when we reproduced the problem we have only 47 vms running. based on previous assumption, regarding the FD amount, is the FDs amount reflects only disks of running vms? or any existing vm with any status? (In reply to Eldad Marciano from comment #18) > (In reply to Liron Aravot from comment #17) > > I've missed that in comment #2 Eldad wrote that he had only 47 running vms > > (in the initial description the number is 2000). Will update here. > > Hi Liron, thats correct. > we have a setup with more than 2K vms. > as mentioned in comment #2 when we reproduced the problem we have only 47 > vms running. > based on previous assumption, regarding the FD amount, is the FDs amount > reflects only disks of running vms? or any existing vm with any status? Eldad can you reproduce that and monitor the number of FD's so we'll know better the urgency here? sosreport should have /proc so that is essential and also if you can collect lsof while running it (In reply to Roy Golan from comment #19) > (In reply to Eldad Marciano from comment #18) > > (In reply to Liron Aravot from comment #17) > > > I've missed that in comment #2 Eldad wrote that he had only 47 running vms > > > (in the initial description the number is 2000). Will update here. > > > > Hi Liron, thats correct. > > we have a setup with more than 2K vms. > > as mentioned in comment #2 when we reproduced the problem we have only 47 > > vms running. > > based on previous assumption, regarding the FD amount, is the FDs amount > > reflects only disks of running vms? or any existing vm with any status? > > Eldad can you reproduce that and monitor the number of FD's so we'll know > better the urgency here? sosreport should have /proc so that is essential > and also if you can collect lsof while running it yes, we need to priorities this test. This is not related to the number of fds. Jan 4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432470) Request queue full This means that ioprocess for this domain has irs:process_pool_max_queued_slots_per_domain requests in the queue. All requests received at this point will fail immediately. This probably happens because the test is trying to start too many vms in the same time with slow storage, and the system does not limit the number of operation as it should. It is also possible that the defaults are too low, maybe we should increase the size of the queue. But I would be careful with this as bigger queues just hide the problem and does not help anyone. If storage is slow the rate of the operation is bigger than the rate storage can provide, the requests will sit in the queue and finally time out after 60 seconds. Eldad, can you try to tune the number of queued slots? You can create /etc/vdsm/vdsm.conf.d/50-scale.conf [irs] process_pool_max_slots_per_domain = 10 process_pool_max_queued_slots_per_domain = 20 Also, we have important debugging logs in ioprocess, that are disabled by default, when you test this issue, please enable IOProcess debug logs in /etc/vdsm/logger.conf: [logger_IOProcess] level=DEBUG handlers=logfile qualname=IOProcess propagate=0 You can also change the log level dynamically during the test without changing the config file: vdsm-client Host setLogLevel level=DEBUG name=IOProcess And restart vdsm. sure, we should cordinate this test and priorities it in our queue. eyal, please consider this bug. need to retest roy-can you please elaborate on the retest requirement s ? Nir aren't we failing to handle the error properly here? If VDSM can't handle that create request, then it should fail it and not return 'waitForLaunch' . Then the engine will try to run the vm on a differnt server. Or - if vdsm isn't performing this request on the synchronious part of the 'create' verb, then vdsm should move those vms to error state from waitForLaunch. On both cases the host shouldn't go non responsive. Eldad please increase the queue and retest. Its possible that a timeout is good enough for now. (In reply to Roy Golan from comment #24) > Nir aren't we failing to handle the error properly here? > > If VDSM can't handle that create request, then it should fail it and not > return 'waitForLaunch' . Then the engine will try to run the vm on a > differnt server. > > Or - if vdsm isn't performing this request on the synchronious part of the > 'create' verb, then vdsm should move those vms to error state from > waitForLaunch. > > On both cases the host shouldn't go non responsive. > > > Eldad please increase the queue and retest. Its possible that a timeout is > good enough for now. Hi I used the slowfs (https://github.com/nirs/slowfs) with 2 sec delay for all operations. Yet, I didn't enlarge the qeueue size in vdsm, in order to see the bug reproduced with the following profile: - 1 SD (slow nfs) - 1 Host currently, I can't see the IOprocess queue issue, besides other issue such as: 2017-02-06 15:39:12,178 ERROR (jsonrpc/4) [storage.Dispatcher] (-202, 'Unable to read resource owners', 'Sanlock exception') (dispatcher:80) 2017-02-06 15:39:12,286 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.12.69.13', 60782, 0, 0) at 0x33f9a70>: unexpected eof (betterAsyncore:113) under /var/log/messages I see Feb 6 15:39:12 b01-h05-r620 sanlock[967]: 2017-02-06 15:39:12+0000 91293 [14615]: s5 renewal error -202 delta_length 10 last_success 91263 Feb 6 15:39:12 b01-h05-r620 journal: vdsm vds.dispatcher ERROR SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.12.69.13', 60782, 0, 0) at 0x33f9a70>: unexpected eof any insights ?! (In reply to Eldad Marciano from comment #25) > (In reply to Roy Golan from comment #24) > Hi I used the slowfs (https://github.com/nirs/slowfs) with 2 sec delay for > all operations. > > Yet, I didn't enlarge the qeueue size in vdsm, in order to see the bug > reproduced with the following profile: > - 1 SD (slow nfs) > - 1 Host > > currently, I can't see the IOprocess queue issue, besides other issue such > as: > 2017-02-06 15:39:12,178 ERROR (jsonrpc/4) [storage.Dispatcher] (-202, > 'Unable to read resource owners', 'Sanlock exception') (dispatcher:80) Sanlock cannot read the ids file, your storage is probably misconfigured, or maybe too slow. slowfs is limited by python, when you delay an operation by 2 seconds, you delay *all* operations. I suggest to start with slowfs setup without any delay - it this does not work, the configuration should be fixed. Even this configuration is much slower then the original version because everything is serialized by python global lock. When you have a working configuration, try to add delay fo 0.1 seconds, this may be enough to reproduce the issue. (In reply to Roy Golan from comment #24) > Nir aren't we failing to handle the error properly here? > > If VDSM can't handle that create request, then it should fail it and not > return 'waitForLaunch' . Then the engine will try to run the vm on a > differnt server. > > Or - if vdsm isn't performing this request on the synchronious part of the > 'create' verb, then vdsm should move those vms to error state from > waitForLaunch. > > On both cases the host shouldn't go non responsive. Yes, we probably do not handle the error from storage. (In reply to Eldad Marciano from comment #0) > in the hosts we can clearly see the following errors > under /var/log/messages: > > Jan 4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432470) > Request queue full > Jan 4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432471) > Request queue full > Jan 4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432472) > Request queue full > Jan 4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432473) > Request queue full > Jan 4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432474) > Request queue full At the same time we see these errors in sanlock: 2017-01-04 11:30:54+0000 1468312 [17408]: b2acacd5 aio timeout RD 0x7f46340008c0:0x7f46340008d0:0x7f464b5fc000 ioto 10 to_count 42 2017-01-04 11:30:54+0000 1468312 [17408]: s23 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/172.16.249.58:_mnt_b02-h37-zfs01__SAS15k_jails_scale1_scale__17/b2acacd5-ed55-4ca 6-bd13-18f5089c9126/dom_md/ids 2017-01-04 11:30:54+0000 1468312 [17408]: s23 renewal error -202 delta_length 10 last_success 1468282 2017-01-04 11:30:55+0000 1468314 [18615]: a4fe7526 aio timeout RD 0x7f463c0008c0:0x7f463c0008d0:0x7f464aff0000 ioto 10 to_count 42 2017-01-04 11:30:55+0000 1468314 [18615]: s29 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/172.16.249.58:_mnt_b02-h37-zfs01__SAS15k_jails_scale1_scale__12/a4fe7526-f47b-42f 7-9f20-1060e2a2d1ea/dom_md/ids 2017-01-04 11:30:55+0000 1468314 [18615]: s29 renewal error -202 delta_length 10 last_success 1468283 2017-01-04 11:30:56+0000 1468314 [17408]: b2acacd5 aio collect RD 0x7f46340008c0:0x7f46340008d0:0x7f464b5fc000 result 1048576:0 match reap 2017-01-04 11:30:56+0000 1468314 [18615]: a4fe7526 aio collect RD 0x7f463c0008c0:0x7f463c0008d0:0x7f464aff0000 result 1048576:0 match reap 2017-01-04 12:00:46+0000 1470104 [17227]: c9b5dabd aio timeout RD 0x7f46180008c0:0x7f46180008d0:0x7f464b6fe000 ioto 10 to_count 39 2017-01-04 12:00:46+0000 1470104 [17227]: s22 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/172.16.249.58:_mnt_b02-h37-zfs01__SAS15k_jails_scale1_scale__19/c9b5dabd-8cce-455 If sanlock cannot renew the lease, it is expected that ioprocess will slow down and its queue will get full. It would be useful to look in vdsm log in the same time, to understand why the error is not communicated properly to engine, but the vdsm logs are missing. Eldad, can you attach the missing vdsm logs, showing this error: Jan 4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432470) Request queue full If we don't have the logs, please reproduce this again and attach the logs. Summarizing Iv'e reproduced the issue with slow storage using slowfs (https://github.com/nirs/slowfs) with delay of 0.1 sec as suggested here https://bugzilla.redhat.com/show_bug.cgi?id=1410120#c26 setup profile: - 1 SD (NFS), 1 Host, 500 VMs. Test profile: starting VMS in bulks of 50 vms. when working with the following vdsm configuration we can't identify any IOprocess ERRORS. [irs] process_pool_max_slots_per_domain = 10 process_pool_max_queued_slots_per_domain = 20 without these configuration the following warn messages were found "vdsm IOProcess Request queue full". In both cases we didn't identify the host under error or non-responsive state, as described here https://bugzilla.redhat.com/show_bug.cgi?id=1410120#c0 Nir, please advise Summary: On slow, *file* storage only, when the host needs to start few vms, it may fail to insert new jobs, because the Q is filling up quickly. All storage jobs for this domain are affected. The effect is that started vms, may turn not be able to start (move from WaitForLaunch -> PoweringUp) and would have to be rescheduled by the engine to another host. 3 failures in a row may shift the host to ERROR state. This is by design, and the VMs will be scheduled on other hosts. A larger Q will change the behaviour such that the jobs will get done but we may experience a big delay till the vm will actually start. We will see a big difference between WaitForLaunch -> PoweringUp. Also less likely the host will go to Error state, it really depends on the amount of incoming request and storage speed. I don't think we need to change something atm. I agree with Roy. We better have early failures when queues become full and let engine handle the error (e.g. try same operation on another host). (In reply to Nir Soffer from comment #33) > I agree with Roy. We better have early failures when queues become full and > let engine handle the error (e.g. try same operation on another host). I believe that we should separate between two possible usecases: 1. Storage usage "spike" that may occur even in a perfectly operational system (storage wise) when we attempt to perform a large amount of storage related operation concurrently that may reject other operations from entering the queue. 2. Queue size growing up indefinitely because of slow storage and system under load. that may occur when we have a lot of extend requests and slow storage (Nir - this issue isn't relevant only to file domains, correct me if i'm somehow wrong). We should definitely increase(or document) to increase the queue size to handle case 1. The engine should distribute the operations between the different hosts evenly regardless, but in case of a lot operations on specific host or load on all the hosts that won't do for case 1. As this yet to happen on a real env, I believe that for now we should be fine with the documentation while we'll be able to always increase the default size. I don't want to remove the size limit to not reach queue that grows indefinitely (as specified in scenario 2). Tal/Nir/Roy - any other opinion? As for now, I'm marking this one as closed - wontfix. If you have any other opinion please let me know and we can reopen it. (In reply to Liron Aravot from comment #34) > (In reply to Nir Soffer from comment #33) > > I agree with Roy. We better have early failures when queues become full and > > let engine handle the error (e.g. try same operation on another host). > > I believe that we should separate between two possible usecases: > > 1. Storage usage "spike" that may occur even in a perfectly operational > system (storage wise) when we attempt to perform a large amount of storage > related operation concurrently that may reject other operations from > entering the queue. We have these threads that may call ioprocess: - 10 threads in the storage thread pool with 10 workers - 8 workers in jsonrpc exectutor - 4 workers in periodic executor - 2 (3?) temporary event thread when domain becomes available - 1 thread per storage domain for monitoring - 1 temporary thread for scanning domains on domain refresh - 1 channel listener thread So in the worst case, all of these threads may call ioprocess, and some will fail when the queue becomes full, since we have 10 slots in the queue, and 10 running workers in ioprocess. This is very unlikely, all threads in the system accessing the same storage file domain in the same time, but we can increase the queue to 20 items to avoid this. Until we eliminate the uncontrolled temporary threads, I think we should increase the ioprocess qeueu size to 20. |