Bug 1410120 - [scale] - vdsm error due to Resource temporarily unavailable
Summary: [scale] - vdsm error due to Resource temporarily unavailable
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.18.17
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-4.1.2
: ---
Assignee: Liron Aravot
QA Contact: Eldad Marciano
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-04 14:15 UTC by Eldad Marciano
Modified: 2017-03-28 09:36 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-27 18:21:00 UTC
oVirt Team: Storage
rule-engine: ovirt-4.1+
rule-engine: exception+


Attachments (Terms of Use)
vdsm logs (5.58 MB, application/zip)
2017-01-04 14:15 UTC, Eldad Marciano
no flags Details

Description Eldad Marciano 2017-01-04 14:15:45 UTC
Created attachment 1237189 [details]
vdsm logs

Description of problem:
while working with multiple hosts + plus starting vms.
some of the hosts gets into error state and non responsive.

seems like the issue is more related to storage,
and probably, we will need to separate some of them to other bugs.

this bug addressed by https://bugzilla.redhat.com/show_bug.cgi?id=1364791

in the hosts we can clearly see the following errors
under /var/log/messages:

Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432470) Request queue full
Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432471) Request queue full
Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432472) Request queue full
Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432473) Request queue full
Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432474) Request queue full

Jan  4 11:40:08 b02-h27-r620 systemd: Stopping MOM instance configured for VDSM purposes...
Jan  4 11:40:14 b02-h27-r620 journal: vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
Jan  4 11:40:14 b02-h27-r620 systemd: Stopping Virtual Desktop Server Manager...
Jan  4 11:40:15 b02-h27-r620 journal: End of file while reading data: Input/output error
Jan  4 11:40:15 b02-h27-r620 vdsmd_init_common.sh: vdsm: Running run_final_hooks
Jan  4 11:40:15 b02-h27-r620 systemd: Starting Virtual Desktop Server Manager...
Jan  4 11:40:15 b02-h27-r620 vdsmd_init_common.sh: vdsm: Running mkdirs
Jan  4 11:40:15 b02-h27-r620 vdsmd_init_common.sh: vdsm: Running configure_coredump


this one might related to the vdsm resatrt.
jsonrpc.Executor/4::ERROR::2017-01-04 11:40:05,629::__init__::550::jsonrpc.JsonRpcServer::(_handle_request) Internal server error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 545, in _handle_request
    res = method(**params)
  File "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 195, in _dynamicMethod
    result = fn(*methodArgs)
  File "/usr/share/vdsm/API.py", line 1358, in getAllVmIoTunePolicies
    io_tune_policies_dict = self._cif.getAllVmIoTunePolicies()
  File "/usr/share/vdsm/clientIF.py", line 447, in getAllVmIoTunePolicies
    vm_io_tune_policies[v.id] = {'policy': v.getIoTunePolicy(),
  File "/usr/share/vdsm/virt/vm.py", line 2772, in getIoTunePolicy
    qos = self._getVmPolicy()
  File "/usr/share/vdsm/virt/vm.py", line 2746, in _getVmPolicy
    metadata_xml = self._dom.metadata(
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 47, in __getattr__
    % self.vmid)
NotConnectedError: VM u'6bfb9a40-14af-4550-bff7-74ba84395e53' was not started yet or was shut down



this one generated a lot
Thread-167::ERROR::2017-01-04 11:48:21,063::monitor::425::Storage.Monitor::(_checkDomainStatus) Error checking domain 4bc1c975-cfe6-4f91-8d4b-ee21b2dd1425
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 406, in _checkDomainStatus
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 50, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 125, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 144, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 126, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 332, in __init__
    manifest = self.manifestClass(domainPath)
  File "/usr/share/vdsm/storage/fileSD.py", line 168, in __init__
    sd.StorageDomainManifest.__init__(self, sdUUID, domaindir, metadata)
  File "/usr/share/vdsm/storage/sd.py", line 311, in __init__
    self._domainLock = self._makeDomainLock()
  File "/usr/share/vdsm/storage/sd.py", line 440, in _makeDomainLock
    domVersion = self.getVersion()
  File "/usr/share/vdsm/storage/sd.py", line 360, in getVersion
    return self.getMetaParam(DMDK_VERSION)
  File "/usr/share/vdsm/storage/sd.py", line 357, in getMetaParam
    return self._metadata[key]
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 91, in __getitem__
    return dec(self._dict[key])
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 201, in __getitem__
    with self._accessWrapper():
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 154, in _accessWrapper
    self.refresh()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/persistent.py", line 232, in refresh
    lines = self._metaRW.readlines()
  File "/usr/share/vdsm/storage/fileSD.py", line 127, in readlines
    return stripNewLines(self._oop.directReadLines(self._metafile))
  File "/usr/share/vdsm/storage/outOfProcess.py", line 340, in directReadLines
    fileStr = ioproc.readfile(path, direct=True)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 552, in readfile
    "direct": direct}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 461, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 11] Resource temporarily unavailable



Thread-640825::ERROR::2017-01-04 11:36:15,141::dispatcher::80::Storage.Dispatcher::(wrapper) Connection timed out
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 72, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 106, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1181, in prepare
    raise self.error
Timeout: Connection timed out
Thread-640825::ERROR::2017-01-04 11:36:15,142::vm::773::virt.vm::(_startUnderlyingVm) vmId=`2b4b542d-ed35-43cf-93fe-b7c93927ff2f`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1940, in _run
    self._preparePathsForDrives(dev_spec_map[hwclass.DISK])
  File "/usr/share/vdsm/virt/vm.py", line 799, in _preparePathsForDrives
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 320, in prepareVolumePath
    raise vm.VolumeError(drive)


Version-Release number of selected component (if applicable):
vdsm-4.18.17-1.el7ev.x86_64

How reproducible:
100%

Steps to Reproduce:
1. having populated env with 12 SDs (NFS) 27 hosts 2Kvms
2. start or stop vms 


Actual results:
engine - some hosts get non responsive and ERROR state
vdsm - faced some storage errors.
vdsm - faced some restarts.

Expected results:


Additional info:

Comment 1 Yaniv Kaul 2017-01-04 16:40:45 UTC
Please re-test with 4.1.
If it's not a regression, I don't see it as a blocker.
How many VMs are running / trying to run / stopped / etc.?

Does it happen with FAR fewer domains? Hosts?

Comment 2 Eldad Marciano 2017-01-04 17:15:18 UTC
we will upgrade our env and we'll update ASAP.
it was around 47 tasks in background when we faced the issue which means 47 vms that trying to run, on top of 27 hosts and 12 SDs.


27 hosts / 47 vms (tasks) = 0.5 it is pretty much sense number.

currently we test it with the following patch set
https://gerrit.ovirt.org/#/q/topic:slow_host_start

Comment 3 Eldad Marciano 2017-01-09 09:19:59 UTC
(In reply to Yaniv Kaul from comment #1)
> Please re-test with 4.1.
> If it's not a regression, I don't see it as a blocker.
> How many VMs are running / trying to run / stopped / etc.?
> 
> Does it happen with FAR fewer domains? Hosts?

The bug also reproduced in 4.1 with same errors.
enginen release: ovirt-engine-4.1.0-0.3.beta2.el7.noarch
vdsm-jsonrpc-java release: vdsm-jsonrpc-java-1.3.5-1.el7ev.noarch
vdsm release: vdsm-4.19.1-1.el7ev.x86_64

mainly around OSError: [Errno 11] Resource temporarily unavailable

adding logs

Comment 6 Roy Golan 2017-01-09 10:01:16 UTC
please add dmesg or journalctl from the time the error occurs.

Comment 10 Roy Golan 2017-01-11 10:20:14 UTC
If slow storage is the culprit and we don't have an ISCSI setup to test this with you can spin all the VMs without disks (make them boot from PXE) and lets see it the problems occurs.

Comment 11 Eldad Marciano 2017-01-11 16:44:42 UTC
(In reply to Roy Golan from comment #10)
> If slow storage is the culprit and we don't have an ISCSI setup to test this
> with you can spin all the VMs without disks (make them boot from PXE) and
> lets see it the problems occurs.

Hi Roy,
I tested it as you suggested, the problem NOT reproduced.

test profile:
1 SD NFS (foreman), 27 hosts, VMS without disk (just network). - not reproduced.

Comment 12 Yaniv Kaul 2017-01-18 07:21:39 UTC
Liron - any progress?

Comment 13 Liron Aravot 2017-01-18 12:10:31 UTC
I've looked and we had similar issue on the past caused by too many open file descriptors (see BZ 650588). In Eldad's scenario he ran a large number of vms with disks on NFS domains, which probably means too many open FDs as well causing to those errors.

I'm not sure there's something we can do here - Nir, any other insight? perhaps there are bugs on other components that this one should depend on.

Comment 14 Yaniv Kaul 2017-01-18 12:18:04 UTC
(In reply to Liron Aravot from comment #13)
> I've looked and we had similar issue on the past caused by too many open
> file descriptors (see BZ 650588). In Eldad's scenario he ran a large number
> of vms with disks on NFS domains, which probably means too many open FDs as
> well causing to those errors.

Please sync with Eldad. I did not get the impression there were many VMs (or disks) running at the time when the issue happened. I think the storage is flakey.

> 
> I'm not sure there's something we can do here - Nir, any other insight?
> perhaps there are bugs on other components that this one should depend on.

If it's a scale issue, we can have a Docs bug on the need to increase the number of open files. But I doubt it's the issue.

Comment 15 Liron Aravot 2017-01-18 12:27:32 UTC
(In reply to Yaniv Kaul from comment #14)
> (In reply to Liron Aravot from comment #13)
> > I've looked and we had similar issue on the past caused by too many open
> > file descriptors (see BZ 650588). In Eldad's scenario he ran a large number
> > of vms with disks on NFS domains, which probably means too many open FDs as
> > well causing to those errors.
> 
> Please sync with Eldad. I did not get the impression there were many VMs (or
> disks) running at the time when the issue happened. I think the storage is
> flakey.
> 

Eldad, can you please elaborate how many vms were running on the host and how many disks were plugged to each vm?
In the reproduction with vms without disks (see comment #11) the issue hasn't reproduces - that led me to go in that direction.

Comment 16 Liron Aravot 2017-01-18 12:30:27 UTC
*on each relevant host

Comment 17 Liron Aravot 2017-01-18 12:43:44 UTC
I've missed that in comment #2 Eldad wrote that he had only 47 running vms (in the initial description the number is 2000). Will update here.

Comment 18 Eldad Marciano 2017-01-18 15:03:08 UTC
(In reply to Liron Aravot from comment #17)
> I've missed that in comment #2 Eldad wrote that he had only 47 running vms
> (in the initial description the number is 2000). Will update here.

Hi Liron, thats correct.
we have a setup with more than 2K vms.
as mentioned in comment #2 when we reproduced the problem we have only 47 vms running.
based on previous assumption, regarding the FD amount, is the FDs amount reflects only disks of running vms? or any existing vm with any status?

Comment 19 Roy Golan 2017-01-19 13:22:01 UTC
(In reply to Eldad Marciano from comment #18)
> (In reply to Liron Aravot from comment #17)
> > I've missed that in comment #2 Eldad wrote that he had only 47 running vms
> > (in the initial description the number is 2000). Will update here.
> 
> Hi Liron, thats correct.
> we have a setup with more than 2K vms.
> as mentioned in comment #2 when we reproduced the problem we have only 47
> vms running.
> based on previous assumption, regarding the FD amount, is the FDs amount
> reflects only disks of running vms? or any existing vm with any status?

Eldad can you reproduce that and monitor the number of FD's so we'll know better the urgency here? sosreport should have /proc so that is essential and also if you can collect lsof while running it

Comment 20 Eldad Marciano 2017-01-26 09:50:39 UTC
(In reply to Roy Golan from comment #19)
> (In reply to Eldad Marciano from comment #18)
> > (In reply to Liron Aravot from comment #17)
> > > I've missed that in comment #2 Eldad wrote that he had only 47 running vms
> > > (in the initial description the number is 2000). Will update here.
> > 
> > Hi Liron, thats correct.
> > we have a setup with more than 2K vms.
> > as mentioned in comment #2 when we reproduced the problem we have only 47
> > vms running.
> > based on previous assumption, regarding the FD amount, is the FDs amount
> > reflects only disks of running vms? or any existing vm with any status?
> 
> Eldad can you reproduce that and monitor the number of FD's so we'll know
> better the urgency here? sosreport should have /proc so that is essential
> and also if you can collect lsof while running it

yes, we need to priorities this test.

Comment 21 Nir Soffer 2017-01-26 10:08:16 UTC
This is not related to the number of fds.

Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432470) Request queue full

This means that ioprocess for this domain has
irs:process_pool_max_queued_slots_per_domain requests in the queue. All requests
received at this point will fail immediately.

This probably happens because the test is trying to start too many vms in the same
time with slow storage, and the system does not limit the number of operation as
it should.

It is also possible that the defaults are too low, maybe we should increase the
size of the queue. But I would be careful with this as bigger queues just hide
the problem and does not help anyone. If storage is slow the rate of the operation
is bigger than the rate storage can provide, the requests will sit in the queue
and finally time out after 60 seconds.

Eldad, can you try to tune the number of queued slots?

You can create /etc/vdsm/vdsm.conf.d/50-scale.conf

[irs]
process_pool_max_slots_per_domain = 10
process_pool_max_queued_slots_per_domain = 20

Also, we have important debugging logs in ioprocess, that are disabled by default,
when you test this issue, please enable IOProcess debug logs in
/etc/vdsm/logger.conf:

[logger_IOProcess]
level=DEBUG
handlers=logfile
qualname=IOProcess
propagate=0

You can also change the log level dynamically during the test without changing
the config file:

vdsm-client Host setLogLevel level=DEBUG name=IOProcess

And restart vdsm.

Comment 22 Eldad Marciano 2017-01-29 10:42:24 UTC
sure, we should cordinate this test and priorities it in our queue.
eyal, please consider this bug.

Comment 23 eberman 2017-02-01 10:29:50 UTC
need to retest 

roy-can you please elaborate on the retest requirement   s ?

Comment 24 Roy Golan 2017-02-05 10:22:59 UTC
Nir aren't we failing to handle the error properly here? 

If VDSM can't handle that create request, then it should fail it and not return 'waitForLaunch' . Then the engine will try to run the vm on a differnt server. 

Or - if vdsm isn't performing this request on the synchronious part of the 'create' verb, then vdsm should move those vms to error state from waitForLaunch. 

On both cases the host shouldn't go non responsive.


Eldad please increase the queue and retest. Its possible that a timeout is good enough for now.

Comment 25 Eldad Marciano 2017-02-06 15:45:15 UTC
(In reply to Roy Golan from comment #24)
> Nir aren't we failing to handle the error properly here? 
> 
> If VDSM can't handle that create request, then it should fail it and not
> return 'waitForLaunch' . Then the engine will try to run the vm on a
> differnt server. 
> 
> Or - if vdsm isn't performing this request on the synchronious part of the
> 'create' verb, then vdsm should move those vms to error state from
> waitForLaunch. 
> 
> On both cases the host shouldn't go non responsive.
> 
> 
> Eldad please increase the queue and retest. Its possible that a timeout is
> good enough for now.

Hi I used the slowfs (https://github.com/nirs/slowfs) with 2 sec delay for all operations. 

Yet, I didn't enlarge the qeueue size in vdsm, in order to see the bug reproduced with the following profile:
- 1  SD (slow nfs)
- 1 Host

currently, I can't see the IOprocess queue issue, besides other issue such as:
2017-02-06 15:39:12,178 ERROR (jsonrpc/4) [storage.Dispatcher] (-202, 'Unable to read resource owners', 'Sanlock exception') (dispatcher:80)

2017-02-06 15:39:12,286 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.12.69.13', 60782, 0, 0) at 0x33f9a70>: unexpected eof (betterAsyncore:113)


under /var/log/messages I see 
Feb  6 15:39:12 b01-h05-r620 sanlock[967]: 2017-02-06 15:39:12+0000 91293 [14615]: s5 renewal error -202 delta_length 10 last_success 91263
Feb  6 15:39:12 b01-h05-r620 journal: vdsm vds.dispatcher ERROR SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::ffff:10.12.69.13', 60782, 0, 0) at 0x33f9a70>: unexpected eof

any insights ?!

Comment 26 Nir Soffer 2017-02-06 18:31:23 UTC
(In reply to Eldad Marciano from comment #25)
> (In reply to Roy Golan from comment #24)
> Hi I used the slowfs (https://github.com/nirs/slowfs) with 2 sec delay for
> all operations. 
> 
> Yet, I didn't enlarge the qeueue size in vdsm, in order to see the bug
> reproduced with the following profile:
> - 1  SD (slow nfs)
> - 1 Host
> 
> currently, I can't see the IOprocess queue issue, besides other issue such
> as:
> 2017-02-06 15:39:12,178 ERROR (jsonrpc/4) [storage.Dispatcher] (-202,
> 'Unable to read resource owners', 'Sanlock exception') (dispatcher:80)

Sanlock cannot read the ids file, your storage is probably misconfigured, or
maybe too slow. slowfs is limited by python, when you delay an operation by
2 seconds, you delay *all* operations.

I suggest to start with slowfs setup without any delay - it this does not work,
the configuration should be fixed. Even this configuration is much slower then
the original version because everything is serialized by python global lock.

When you have a working configuration, try to add delay fo 0.1 seconds, this may
be enough to reproduce the issue.

Comment 27 Nir Soffer 2017-02-06 18:34:46 UTC
(In reply to Roy Golan from comment #24)
> Nir aren't we failing to handle the error properly here? 
> 
> If VDSM can't handle that create request, then it should fail it and not
> return 'waitForLaunch' . Then the engine will try to run the vm on a
> differnt server. 
> 
> Or - if vdsm isn't performing this request on the synchronious part of the
> 'create' verb, then vdsm should move those vms to error state from
> waitForLaunch. 
> 
> On both cases the host shouldn't go non responsive.

Yes, we probably do not handle the error from storage.

Comment 28 Nir Soffer 2017-02-06 18:49:32 UTC
(In reply to Eldad Marciano from comment #0)
> in the hosts we can clearly see the following errors
> under /var/log/messages:
> 
> Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432470)
> Request queue full
> Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432471)
> Request queue full
> Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432472)
> Request queue full
> Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432473)
> Request queue full
> Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess WARNING (432474)
> Request queue full

At the same time we see these errors in sanlock:

2017-01-04 11:30:54+0000 1468312 [17408]: b2acacd5 aio timeout RD 0x7f46340008c0:0x7f46340008d0:0x7f464b5fc000 ioto 10 to_count 42
2017-01-04 11:30:54+0000 1468312 [17408]: s23 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/172.16.249.58:_mnt_b02-h37-zfs01__SAS15k_jails_scale1_scale__17/b2acacd5-ed55-4ca
6-bd13-18f5089c9126/dom_md/ids
2017-01-04 11:30:54+0000 1468312 [17408]: s23 renewal error -202 delta_length 10 last_success 1468282
2017-01-04 11:30:55+0000 1468314 [18615]: a4fe7526 aio timeout RD 0x7f463c0008c0:0x7f463c0008d0:0x7f464aff0000 ioto 10 to_count 42
2017-01-04 11:30:55+0000 1468314 [18615]: s29 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/172.16.249.58:_mnt_b02-h37-zfs01__SAS15k_jails_scale1_scale__12/a4fe7526-f47b-42f
7-9f20-1060e2a2d1ea/dom_md/ids
2017-01-04 11:30:55+0000 1468314 [18615]: s29 renewal error -202 delta_length 10 last_success 1468283
2017-01-04 11:30:56+0000 1468314 [17408]: b2acacd5 aio collect RD 0x7f46340008c0:0x7f46340008d0:0x7f464b5fc000 result 1048576:0 match reap
2017-01-04 11:30:56+0000 1468314 [18615]: a4fe7526 aio collect RD 0x7f463c0008c0:0x7f463c0008d0:0x7f464aff0000 result 1048576:0 match reap
2017-01-04 12:00:46+0000 1470104 [17227]: c9b5dabd aio timeout RD 0x7f46180008c0:0x7f46180008d0:0x7f464b6fe000 ioto 10 to_count 39
2017-01-04 12:00:46+0000 1470104 [17227]: s22 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/172.16.249.58:_mnt_b02-h37-zfs01__SAS15k_jails_scale1_scale__19/c9b5dabd-8cce-455

If sanlock cannot renew the lease, it is expected that ioprocess
will slow down and its queue will get full.

It would be useful to look in vdsm log in the same time, to understand
why the error is not communicated properly to engine, but the vdsm
logs are missing.

Eldad, can you attach the missing vdsm logs, showing this error:

    Jan  4 11:35:58 b02-h27-r620 journal: vdsm IOProcess
    WARNING (432470) Request queue full

If we don't have the logs, please reproduce this again and attach
the logs.

Comment 31 Eldad Marciano 2017-02-07 14:38:51 UTC
Summarizing

Iv'e reproduced the issue with slow storage using slowfs
(https://github.com/nirs/slowfs)
with delay of 0.1 sec as suggested here https://bugzilla.redhat.com/show_bug.cgi?id=1410120#c26

setup profile:
- 1 SD (NFS), 1 Host, 500 VMs.

Test profile:
starting VMS in bulks of 50 vms.


when working with the following vdsm configuration we can't identify any IOprocess ERRORS.
[irs]
process_pool_max_slots_per_domain = 10
process_pool_max_queued_slots_per_domain = 20


without these configuration the following warn messages were found
"vdsm IOProcess Request queue full".


In both cases we didn't identify the host under error or non-responsive state, as described here 
https://bugzilla.redhat.com/show_bug.cgi?id=1410120#c0



Nir, please advise

Comment 32 Roy Golan 2017-02-15 13:27:42 UTC
Summary:

On slow, *file* storage only, when the host needs to start few vms, it may fail to insert new jobs, because the Q is filling up quickly. All storage jobs for this domain are affected.

The effect is that started vms, may turn not be able to start (move from WaitForLaunch -> PoweringUp) and would have to be rescheduled by the engine to another host. 3 failures in a row may shift the host to ERROR state. This is by design, and the VMs will be scheduled on other hosts.

A larger Q will change the behaviour such that the jobs will get done but we may experience a big delay till the vm will actually start. We will see a big difference between WaitForLaunch -> PoweringUp. Also less likely the host will go to Error state, it really depends on the amount of incoming request and storage speed.

I don't think we need to change something atm.

Comment 33 Nir Soffer 2017-02-15 13:33:29 UTC
I agree with Roy. We better have early failures when queues become full and
let engine handle the error (e.g. try same operation on another host).

Comment 34 Liron Aravot 2017-03-27 18:17:48 UTC
(In reply to Nir Soffer from comment #33)
> I agree with Roy. We better have early failures when queues become full and
> let engine handle the error (e.g. try same operation on another host).

I believe that we should separate between two possible usecases:

1. Storage usage "spike" that may occur even in a perfectly operational system (storage wise) when we attempt to perform a large amount of storage related operation concurrently that may reject other operations from entering the queue.

2. Queue size growing up indefinitely because of slow storage and system under load. that may occur when we have a lot of extend requests and slow storage (Nir - this issue isn't relevant only to file domains, correct me if i'm somehow wrong).

We should definitely increase(or document) to increase the queue size to handle case 1.
The engine should distribute the operations between the different hosts evenly regardless, but in case of a lot operations on specific host or load on all the hosts that won't do for case 1.

As this yet to happen on a real env, I believe that for now we should be fine with the documentation while we'll be able to always increase the default size.
I don't want to remove the size limit to not reach queue that grows indefinitely (as specified in scenario 2).

Tal/Nir/Roy - any other opinion?

Comment 35 Liron Aravot 2017-03-27 18:21:00 UTC
As for now, I'm marking this one as closed - wontfix. If you have any other opinion please let me know and we can reopen it.

Comment 36 Nir Soffer 2017-03-28 09:36:02 UTC
(In reply to Liron Aravot from comment #34)
> (In reply to Nir Soffer from comment #33)
> > I agree with Roy. We better have early failures when queues become full and
> > let engine handle the error (e.g. try same operation on another host).
> 
> I believe that we should separate between two possible usecases:
> 
> 1. Storage usage "spike" that may occur even in a perfectly operational
> system (storage wise) when we attempt to perform a large amount of storage
> related operation concurrently that may reject other operations from
> entering the queue.

We have these threads that may call ioprocess:
- 10 threads in the storage thread pool with 10 workers
- 8 workers in jsonrpc exectutor
- 4 workers in periodic executor
- 2 (3?) temporary event thread when domain becomes available
- 1 thread per storage domain for monitoring
- 1 temporary thread for scanning domains on domain refresh
- 1 channel listener thread

So in the worst case, all of these threads may call ioprocess, and some will 
fail when the queue becomes full, since we have 10 slots in the queue, and
10 running workers in ioprocess.

This is very unlikely, all threads in the system accessing the same storage
file domain in the same time, but we can increase the queue to 20 items to avoid
this.

Until we eliminate the uncontrolled temporary threads, I think we should increase
the ioprocess qeueu size to 20.


Note You need to log in before you can comment on or make changes to this bug.