Bug 1428415 - Improve logging to find out the cause of RPC pool being exhausted
Summary: Improve logging to find out the cause of RPC pool being exhausted
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.19.4
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.1.2
: 4.19.11
Assignee: Francesco Romani
QA Contact: eberman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-02 14:30 UTC by Piotr Kliczewski
Modified: 2017-05-23 08:13 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1433380 (view as bug list)
Environment:
Last Closed: 2017-05-23 08:13:15 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.1+
lsvaty: testing_ack+


Attachments (Terms of Use)
sanlock (2.52 MB, text/plain)
2017-03-02 15:35 UTC, Piotr Kliczewski
no flags Details
vdsm log (442.59 KB, application/x-xz)
2017-03-02 15:35 UTC, Piotr Kliczewski
no flags Details
engine log 1 (2.93 MB, application/x-gzip)
2017-03-09 10:51 UTC, Yaniv Bronhaim
no flags Details
engine log 2 (5.11 MB, application/x-gzip)
2017-03-09 10:52 UTC, Yaniv Bronhaim
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 73556 0 master MERGED storage.asyncutils: Introduce the asyncutils module 2020-11-03 20:41:49 UTC
oVirt gerrit 73592 0 master MERGED check: Warn about blocked checker 2020-11-03 20:41:32 UTC
oVirt gerrit 73836 0 master ABANDONED Changing Calling api log print to INFO level 2020-11-03 20:41:31 UTC
oVirt gerrit 73845 0 master MERGED executor: bump the priority of logging 2020-11-03 20:41:32 UTC
oVirt gerrit 73846 0 master ABANDONED executor: periodically report the executor state 2020-11-03 20:41:32 UTC
oVirt gerrit 73847 0 master ABANDONED executor: use the same format when dumping state 2020-11-03 20:41:31 UTC
oVirt gerrit 73848 0 master MERGED executor: store scheduled call as attribute 2020-11-03 20:41:50 UTC
oVirt gerrit 73849 0 master MERGED executor: add support to report blocked workers 2020-11-03 20:41:33 UTC
oVirt gerrit 75051 0 master MERGED lib: jsonrpc: use periodic warning on stuck worker 2020-11-03 20:41:32 UTC
oVirt gerrit 75059 0 master MERGED lib: executor: handle timeout=None in __repr__ 2020-11-03 20:41:32 UTC
oVirt gerrit 75070 0 ovirt-4.1 MERGED executor: add task duration in __repr__ 2020-11-03 20:41:32 UTC
oVirt gerrit 75071 0 ovirt-4.1 MERGED executor: store scheduled call as attribute 2020-11-03 20:41:33 UTC
oVirt gerrit 75072 0 ovirt-4.1 MERGED executor: Don't start thread in __init__ 2020-11-03 20:41:51 UTC
oVirt gerrit 75073 0 ovirt-4.1 MERGED executor: allow to override the logger 2020-11-03 20:41:34 UTC
oVirt gerrit 75074 0 ovirt-4.1 MERGED executor: add support to report blocked workers 2020-11-03 20:41:34 UTC
oVirt gerrit 75100 0 master MERGED tests: Mark slow tests in storage_asyncutils_test.py 2020-11-03 20:41:34 UTC
oVirt gerrit 75137 0 ovirt-4.1 MERGED lib: executor: handle timeout=None in __repr__ 2020-11-03 20:41:34 UTC
oVirt gerrit 75250 0 ovirt-4.1 MERGED storage.asyncutils: Introduce the asyncutils module 2020-11-03 20:41:34 UTC
oVirt gerrit 75251 0 ovirt-4.1 MERGED tests: Mark slow tests in storage_asyncutils_test.py 2020-11-03 20:41:36 UTC
oVirt gerrit 75252 0 ovirt-4.1 MERGED check: Warn about blocked checker 2020-11-03 20:41:35 UTC
oVirt gerrit 75261 0 ovirt-4.1 MERGED tests: Complete FakeLogger helper 2020-11-03 20:41:35 UTC
oVirt gerrit 75282 0 master MERGED yajsonrpc: replace functools.partial with wrapper 2020-11-03 20:41:35 UTC
oVirt gerrit 75283 0 master MERGED yajsonrpc: add __repr__ for jsonRpcRequest 2020-11-03 20:41:35 UTC
oVirt gerrit 75391 0 ovirt-4.1 MERGED lib: jsonrpc: use periodic warning on stuck worker 2020-11-03 20:41:35 UTC
oVirt gerrit 75392 0 ovirt-4.1 MERGED yajsonrpc: replace functools.partial with wrapper 2020-11-03 20:41:37 UTC
oVirt gerrit 75393 0 ovirt-4.1 MERGED yajsonrpc: add __repr__ for jsonRpcRequest 2020-11-03 20:41:54 UTC
oVirt gerrit 75538 0 ovirt-4.1 MERGED tests: Increase timeouts on time sensitive test 2020-11-03 20:41:54 UTC

Description Piotr Kliczewski 2017-03-02 14:30:44 UTC
In one of our environments we observed that after storage timeout occurring:


2017-02-25 03:30:43,723 ERROR (periodic/27) [storage.TaskManager.Task] (Task='75fcbc90-d17f-47a3-80ed-89b5b43f08ca') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3016, in getVolumeSize
    apparentsize = str(dom.getVSize(imgUUID, volUUID))
  File "/usr/share/vdsm/storage/sd.py", line 663, in getVSize
    return self._manifest.getVSize(imgUUID, volUUID)
  File "/usr/share/vdsm/storage/fileSD.py", line 189, in getVSize
    return self.oop.os.stat(volPath).st_size
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 254, in stat
    return self._iop.stat(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 484, in stat
    resdict = self._sendCommand("stat", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

after some time we see:

2017-02-25 03:34:24,917 WARN  (jsonrpc/2) [virt.vm] (vmId='5f56d327-8348-4fc0-867c-819348d96847') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,918 WARN  (jsonrpc/2) [virt.vm] (vmId='094f498a-ec1c-4792-b6b0-ac982e912c36') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,919 WARN  (jsonrpc/2) [virt.vm] (vmId='24fbd184-3ecd-4031-9cb1-a29bc5b3fa2a') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,919 WARN  (jsonrpc/2) [virt.vm] (vmId='683fb432-76d1-4087-b6c4-73bfe4de9843') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,920 WARN  (jsonrpc/2) [virt.vm] (vmId='e77ae01d-6401-4653-aacc-7646f4f8402d') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,921 WARN  (jsonrpc/2) [virt.vm] (vmId='269d0d0a-7cc3-4661-886f-f6d929cada63') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,921 WARN  (jsonrpc/2) [virt.vm] (vmId='d40e0bc6-dad5-4ee9-bfa1-09f074882e78') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)

in the meantime we see:

2017-02-25 03:34:24,976 ERROR (jsonrpc/3) [virt.vm] (vmId='cd14a081-35aa-4a7e-97fa-442d6d864942') getVmIoTune failed (vm:2773)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2758, in getIoTuneResponse
    libvirt.VIR_DOMAIN_AFFECT_LIVE)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

Over time this end up with:

2017-02-25 03:59:32,904 ERROR (JsonRpcServer) [jsonrpc.JsonRpcServer] could not allocate request thread (__init__:626)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 624, in _runRequest
    self._threadFactory(partial(self._serveRequest, ctx, request))
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 148, in dispatch
    self._tasks.put(Task(callable, timeout))
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 361, in put
    raise TooManyTasks()
TooManyTasks

which causes vdsm not to respond to any calls.

Restart of the vdsm solves the issue.

Comment 1 Nir Soffer 2017-03-02 15:06:29 UTC
The issue is not exhausted pool. The RPC executor has a limit on the number of 
tasks in the queue. If you add tasks to the queue faster then vdsm can consume,
you get the TooManyTasks error.

In older versions of vdsm, vdsm had no limit, and was creating new thread for
each task, without any limit. This would fail when vdsm created about 700 threads,
and was not useful to anyone.

The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
handle the error properly, maybe by stopping sending messages to this host for
some time.

We see timeouts in ioprocess operations:

    return self._iop.stat(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 484, in stat
    resdict = self._sendCommand("stat", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

And timeouts in libvirt/qemu:

2017-02-25 03:34:24,917 WARN  (jsonrpc/2) [virt.vm] (vmId='5f56d327-8348-4fc0-867c-819348d96847') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)

So it seems that the root cause is the underlying storage.

If you enable ioprocess debug logs, we can get more info why the operations times
out.

Please try to set:

[logger_IOProcess]
level=DEBUG
handlers=logfile
qualname=IOProcess
propagate=0

Or use vdsm-client to change the log level for the current process:

    vdsm-client Host setLogLevel level=DEBUG
    vdsm-client Host setLogLevel level=DEBUG name=IOProcess

If this is an issue with storage, we should also see timeouts in sanlock log and
in /var/log/messages. Please attach these logs.

Comment 2 Francesco Romani 2017-03-02 15:28:04 UTC
(In reply to Nir Soffer from comment #1)
> The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
> handle the error properly, maybe by stopping sending messages to this host
> for some time.

I agree, Engine should learn about this.


> We see timeouts in ioprocess operations:
> 
>     return self._iop.stat(path)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 484,
> in stat
>     resdict = self._sendCommand("stat", {"path": path}, self.timeout)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455,
> in _sendCommand
>     raise Timeout(os.strerror(errno.ETIMEDOUT))
> Timeout: Connection timed out
> 
> And timeouts in libvirt/qemu:
> 
> 2017-02-25 03:34:24,917 WARN  (jsonrpc/2) [virt.vm]
> (vmId='5f56d327-8348-4fc0-867c-819348d96847') monitor became unresponsive
> (command timeout, age=70.5100000007) (vm:4893)

Yes, most likely same root cause

CC'ing myself to this BZ.

Comment 3 Piotr Kliczewski 2017-03-02 15:35:24 UTC
Created attachment 1259183 [details]
sanlock

Comment 4 Piotr Kliczewski 2017-03-02 15:35:51 UTC
Created attachment 1259184 [details]
vdsm log

Comment 5 Piotr Kliczewski 2017-03-02 15:38:36 UTC
(In reply to Francesco Romani from comment #2)
> (In reply to Nir Soffer from comment #1)
> > The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
> > handle the error properly, maybe by stopping sending messages to this host
> > for some time.
> 
> I agree, Engine should learn about this.
> 

We do not have an error code for this exception. How do we expect old engine work with new code?

The only value of this change is to have proper message which would allow us to
debug it easier. We can't expect older engines to react on it.

> 
> > We see timeouts in ioprocess operations:
> > 
> >     return self._iop.stat(path)
> >   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 484,
> > in stat
> >     resdict = self._sendCommand("stat", {"path": path}, self.timeout)
> >   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455,
> > in _sendCommand
> >     raise Timeout(os.strerror(errno.ETIMEDOUT))
> > Timeout: Connection timed out
> > 
> > And timeouts in libvirt/qemu:
> > 
> > 2017-02-25 03:34:24,917 WARN  (jsonrpc/2) [virt.vm]
> > (vmId='5f56d327-8348-4fc0-867c-819348d96847') monitor became unresponsive
> > (command timeout, age=70.5100000007) (vm:4893)
> 
> Yes, most likely same root cause
> 
> CC'ing myself to this BZ.

Comment 6 Francesco Romani 2017-03-02 15:58:51 UTC
(In reply to Piotr Kliczewski from comment #5)
> (In reply to Francesco Romani from comment #2)
> > (In reply to Nir Soffer from comment #1)
> > > The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
> > > handle the error properly, maybe by stopping sending messages to this host
> > > for some time.
> > 
> > I agree, Engine should learn about this.
> > 
> 
> We do not have an error code for this exception. How do we expect old engine
> work with new code?
> 
> The only value of this change is to have proper message which would allow us
> to
> debug it easier. We can't expect older engines to react on it.

Surely old Engine can't do anything, but new Engine (4.2?) could. But I don't want to hijack this bug, this is I believe a low priority/benefit change which can be discussed separately.

Comment 7 Piotr Kliczewski 2017-03-02 16:41:38 UTC
(In reply to Francesco Romani from comment #6)
> (In reply to Piotr Kliczewski from comment #5)
> > (In reply to Francesco Romani from comment #2)
> > > (In reply to Nir Soffer from comment #1)
> > > > The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
> > > > handle the error properly, maybe by stopping sending messages to this host
> > > > for some time.
> > > 
> > > I agree, Engine should learn about this.
> > > 
> > 
> > We do not have an error code for this exception. How do we expect old engine
> > work with new code?
> > 
> > The only value of this change is to have proper message which would allow us
> > to
> > debug it easier. We can't expect older engines to react on it.
> 
> Surely old Engine can't do anything, but new Engine (4.2?) could. But I
> don't want to hijack this bug, this is I believe a low priority/benefit
> change which can be discussed separately.

I agree for newer engines we can improve it. I will push patches to fix it.

Comment 8 Nir Soffer 2017-03-02 22:48:06 UTC
Looking in sanlock log, we see renewal timeouts in the same time.
So this looks like actual storage issue.

Do we have info from netapp-vserver-rhev.lab.eng.brq.redhat.com
in the same time?

2017-02-25 01:54:37+0100 2881152 [30909]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 5
2017-02-25 01:54:37+0100 2881152 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 01:54:37+0100 2881152 [30909]: s58 renewal error -202 delta_length 10 last_success 2881122
2017-02-25 01:54:58+0100 2881173 [30909]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 6
2017-02-25 01:54:58+0100 2881173 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 01:54:58+0100 2881173 [30909]: s58 renewal error -202 delta_length 20 last_success 2881122
2017-02-25 01:55:07+0100 2881182 [982]: s58 check_our_lease warning 60 last_success 2881122
2017-02-25 01:55:08+0100 2881183 [982]: s58 check_our_lease warning 61 last_success 2881122
2017-02-25 01:55:09+0100 2881184 [982]: s58 check_our_lease warning 62 last_success 2881122
2017-02-25 01:55:10+0100 2881185 [982]: s58 check_our_lease warning 63 last_success 2881122
2017-02-25 01:55:11+0100 2881186 [982]: s58 check_our_lease warning 64 last_success 2881122
2017-02-25 01:55:12+0100 2881187 [982]: s58 check_our_lease warning 65 last_success 2881122
2017-02-25 01:55:13+0100 2881188 [30909]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free
2017-02-25 01:55:13+0100 2881188 [30909]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free
2017-02-25 01:55:13+0100 2881188 [982]: s58 check_our_lease warning 66 last_success 2881122
2017-02-25 01:56:04+0100 2881239 [30909]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 7
2017-02-25 01:56:04+0100 2881239 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 01:56:04+0100 2881239 [30909]: s58 renewal error -202 delta_length 10 last_success 2881209
2017-02-25 01:56:10+0100 2881245 [30909]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 match reap
2017-02-25 03:29:26+0100 2886841 [30909]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 8
2017-02-25 03:29:26+0100 2886841 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:29:26+0100 2886841 [30909]: s58 renewal error -202 delta_length 10 last_success 2886811
2017-02-25 03:29:46+0100 2886862 [30909]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 ioto 10 to_count 9
2017-02-25 03:29:46+0100 2886862 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:29:46+0100 2886862 [30909]: s58 renewal error -202 delta_length 20 last_success 2886811
2017-02-25 03:29:55+0100 2886871 [982]: s58 check_our_lease warning 60 last_success 2886811
2017-02-25 03:29:56+0100 2886872 [982]: s58 check_our_lease warning 61 last_success 2886811
2017-02-25 03:29:57+0100 2886873 [982]: s58 check_our_lease warning 62 last_success 2886811
2017-02-25 03:29:58+0100 2886874 [982]: s58 check_our_lease warning 63 last_success 2886811
2017-02-25 03:29:59+0100 2886875 [982]: s58 check_our_lease warning 64 last_success 2886811
2017-02-25 03:30:00+0100 2886876 [982]: s58 check_our_lease warning 65 last_success 2886811
2017-02-25 03:30:01+0100 2886877 [982]: s58 check_our_lease warning 66 last_success 2886811
2017-02-25 03:30:02+0100 2886878 [982]: s58 check_our_lease warning 67 last_success 2886811
2017-02-25 03:30:03+0100 2886879 [982]: s58 check_our_lease warning 68 last_success 2886811
2017-02-25 03:30:04+0100 2886880 [982]: s58 check_our_lease warning 69 last_success 2886811
2017-02-25 03:30:05+0100 2886881 [982]: s58 check_our_lease warning 70 last_success 2886811
2017-02-25 03:30:06+0100 2886882 [982]: s58 check_our_lease warning 71 last_success 2886811
2017-02-25 03:30:07+0100 2886882 [30909]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552d07000 ioto 10 to_count 10
2017-02-25 03:30:07+0100 2886882 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:30:07+0100 2886882 [30909]: s58 renewal error -202 delta_length 20 last_success 2886811
2017-02-25 03:30:07+0100 2886883 [982]: s58 check_our_lease warning 72 last_success 2886811
2017-02-25 03:30:08+0100 2886884 [982]: s58 check_our_lease warning 73 last_success 2886811
2017-02-25 03:30:09+0100 2886885 [982]: s58 check_our_lease warning 74 last_success 2886811
2017-02-25 03:30:10+0100 2886886 [982]: s58 check_our_lease warning 75 last_success 2886811
2017-02-25 03:30:11+0100 2886887 [982]: s58 check_our_lease warning 76 last_success 2886811
2017-02-25 03:30:12+0100 2886888 [982]: s58 check_our_lease warning 77 last_success 2886811
2017-02-25 03:30:13+0100 2886889 [982]: s58 check_our_lease warning 78 last_success 2886811
2017-02-25 03:30:14+0100 2886890 [982]: s58 check_our_lease warning 79 last_success 2886811
2017-02-25 03:30:15+0100 2886891 [982]: s58 check_our_lease failed 80
2017-02-25 03:30:16+0100 2886891 [982]: s58 all pids clear
2017-02-25 03:30:27+0100 2886903 [30909]: 310b5348 aio timeout RD 0x7fe53c0009b0:0x7fe53c0009c0:0x7fe552201000 ioto 10 to_count 11
2017-02-25 03:30:27+0100 2886903 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:30:27+0100 2886903 [30909]: s58 renewal error -202 delta_length 20 last_success 2886811
2017-02-25 03:30:47+0100 2886922 [30909]: 310b5348 aio collect 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 close free
2017-02-25 03:30:47+0100 2886922 [30909]: 310b5348 aio collect 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 result 1048576:0 close free
2017-02-25 03:30:47+0100 2886922 [30909]: 310b5348 aio collect 0x7fe53c000960:0x7fe53c000970:0x7fe552d07000 result 1048576:0 close free
2017-02-25 03:30:47+0100 2886922 [30909]: 310b5348 aio collect 0x7fe53c0009b0:0x7fe53c0009c0:0x7fe552201000 result 1048576:0 close free
2017-02-25 03:30:58+0100 2886933 [991]: s59 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 1 147 2874109 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 2 306 2876536 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 3 329 2886933 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 4 234 2883433 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 03:33:31+0100 2887087 [30639]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 03:33:31+0100 2887087 [30639]: s59 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:33:31+0100 2887087 [30639]: s59 renewal error -202 delta_length 10 last_success 2887056
2017-02-25 03:33:52+0100 2887107 [30639]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 03:33:52+0100 2887107 [30639]: s59 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:33:52+0100 2887107 [30639]: s59 renewal error -202 delta_length 20 last_success 2887056
2017-02-25 03:34:01+0100 2887116 [982]: s59 check_our_lease warning 60 last_success 2887056
2017-02-25 03:34:02+0100 2887117 [982]: s59 check_our_lease warning 61 last_success 2887056
2017-02-25 03:34:03+0100 2887118 [982]: s59 check_our_lease warning 62 last_success 2887056
2017-02-25 03:34:04+0100 2887119 [982]: s59 check_our_lease warning 63 last_success 2887056
2017-02-25 03:34:05+0100 2887120 [982]: s59 check_our_lease warning 64 last_success 2887056
2017-02-25 03:34:06+0100 2887121 [982]: s59 check_our_lease warning 65 last_success 2887056
2017-02-25 03:34:07+0100 2887122 [982]: s59 check_our_lease warning 66 last_success 2887056
2017-02-25 03:34:08+0100 2887123 [982]: s59 check_our_lease warning 67 last_success 2887056
2017-02-25 03:34:09+0100 2887124 [982]: s59 check_our_lease warning 68 last_success 2887056
2017-02-25 03:34:10+0100 2887125 [982]: s59 check_our_lease warning 69 last_success 2887056
2017-02-25 03:34:11+0100 2887126 [982]: s59 check_our_lease warning 70 last_success 2887056
2017-02-25 03:34:12+0100 2887127 [982]: s59 check_our_lease warning 71 last_success 2887056
2017-02-25 03:34:12+0100 2887128 [30639]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 3
2017-02-25 03:34:12+0100 2887128 [30639]: s59 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:34:12+0100 2887128 [30639]: s59 renewal error -202 delta_length 20 last_success 2887056
2017-02-25 03:34:13+0100 2887128 [982]: s59 check_our_lease warning 72 last_success 2887056
2017-02-25 03:34:14+0100 2887129 [982]: s59 check_our_lease warning 73 last_success 2887056
2017-02-25 03:34:15+0100 2887130 [982]: s59 check_our_lease warning 74 last_success 2887056
2017-02-25 03:34:16+0100 2887131 [982]: s59 check_our_lease warning 75 last_success 2887056
2017-02-25 03:34:17+0100 2887132 [982]: s59 check_our_lease warning 76 last_success 2887056
2017-02-25 03:34:18+0100 2887133 [982]: s59 check_our_lease warning 77 last_success 2887056
2017-02-25 03:34:19+0100 2887134 [982]: s59 check_our_lease warning 78 last_success 2887056
2017-02-25 03:34:20+0100 2887135 [982]: s59 check_our_lease warning 79 last_success 2887056
2017-02-25 03:34:21+0100 2887136 [982]: s59 check_our_lease failed 80
2017-02-25 03:34:21+0100 2887136 [982]: s59 all pids clear
2017-02-25 03:34:25+0100 2887140 [30639]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free
2017-02-25 03:34:25+0100 2887140 [30639]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free
2017-02-25 03:34:25+0100 2887140 [30639]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 other free
2017-02-25 03:34:35+0100 2887150 [991]: s60 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 1 147 2874335 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 2 306 2876762 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 3 330 2887150 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 4 234 2883659 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 04:31:57+0100 2890592 [36739]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 04:31:57+0100 2890592 [36739]: s60 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:31:57+0100 2890592 [36739]: s60 renewal error -202 delta_length 10 last_success 2890562
2017-02-25 04:32:18+0100 2890613 [36739]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 04:32:18+0100 2890613 [36739]: s60 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:32:18+0100 2890613 [36739]: s60 renewal error -202 delta_length 20 last_success 2890562
2017-02-25 04:32:27+0100 2890622 [982]: s60 check_our_lease warning 60 last_success 2890562
2017-02-25 04:32:28+0100 2890623 [982]: s60 check_our_lease warning 61 last_success 2890562
2017-02-25 04:32:29+0100 2890624 [982]: s60 check_our_lease warning 62 last_success 2890562
2017-02-25 04:32:30+0100 2890625 [982]: s60 check_our_lease warning 63 last_success 2890562
2017-02-25 04:32:31+0100 2890626 [982]: s60 check_our_lease warning 64 last_success 2890562
2017-02-25 04:32:32+0100 2890627 [982]: s60 check_our_lease warning 65 last_success 2890562
2017-02-25 04:32:33+0100 2890628 [982]: s60 check_our_lease warning 66 last_success 2890562
2017-02-25 04:32:34+0100 2890629 [982]: s60 check_our_lease warning 67 last_success 2890562
2017-02-25 04:32:35+0100 2890630 [982]: s60 check_our_lease warning 68 last_success 2890562
2017-02-25 04:32:36+0100 2890631 [982]: s60 check_our_lease warning 69 last_success 2890562
2017-02-25 04:32:37+0100 2890632 [982]: s60 check_our_lease warning 70 last_success 2890562
2017-02-25 04:32:38+0100 2890633 [982]: s60 check_our_lease warning 71 last_success 2890562
2017-02-25 04:32:38+0100 2890633 [36739]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 3
2017-02-25 04:32:38+0100 2890633 [36739]: s60 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:32:38+0100 2890633 [36739]: s60 renewal error -202 delta_length 20 last_success 2890562
2017-02-25 04:32:39+0100 2890634 [982]: s60 check_our_lease warning 72 last_success 2890562
2017-02-25 04:32:40+0100 2890635 [982]: s60 check_our_lease warning 73 last_success 2890562
2017-02-25 04:32:41+0100 2890636 [982]: s60 check_our_lease warning 74 last_success 2890562
2017-02-25 04:32:42+0100 2890637 [982]: s60 check_our_lease warning 75 last_success 2890562
2017-02-25 04:32:43+0100 2890638 [982]: s60 check_our_lease warning 76 last_success 2890562
2017-02-25 04:32:44+0100 2890639 [982]: s60 check_our_lease warning 77 last_success 2890562
2017-02-25 04:32:45+0100 2890640 [982]: s60 check_our_lease warning 78 last_success 2890562
2017-02-25 04:32:46+0100 2890641 [982]: s60 check_our_lease warning 79 last_success 2890562
2017-02-25 04:32:47+0100 2890642 [982]: s60 check_our_lease failed 80
2017-02-25 04:32:47+0100 2890642 [982]: s60 all pids clear
2017-02-25 04:32:48+0100 2890644 [36739]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free r
2017-02-25 04:32:48+0100 2890644 [36739]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free r
2017-02-25 04:32:48+0100 2890644 [36739]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 match reap
2017-02-25 04:32:58+0100 2890654 [992]: s61 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 1 147 2877844 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 2 306 2880254 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 3 331 2890654 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 4 234 2887164 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 04:33:49+0100 2890705 [13507]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 04:33:49+0100 2890705 [13507]: s61 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:33:49+0100 2890705 [13507]: s61 renewal error -202 delta_length 10 last_success 2890674
2017-02-25 04:34:10+0100 2890725 [13507]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 04:34:10+0100 2890725 [13507]: s61 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:34:10+0100 2890725 [13507]: s61 renewal error -202 delta_length 20 last_success 2890674
2017-02-25 04:34:19+0100 2890734 [982]: s61 check_our_lease warning 60 last_success 2890674
2017-02-25 04:34:20+0100 2890735 [982]: s61 check_our_lease warning 61 last_success 2890674
2017-02-25 04:34:21+0100 2890736 [982]: s61 check_our_lease warning 62 last_success 2890674
2017-02-25 04:34:22+0100 2890737 [982]: s61 check_our_lease warning 63 last_success 2890674
2017-02-25 04:34:23+0100 2890738 [982]: s61 check_our_lease warning 64 last_success 2890674
2017-02-25 04:34:24+0100 2890739 [982]: s61 check_our_lease warning 65 last_success 2890674
2017-02-25 04:34:25+0100 2890740 [982]: s61 check_our_lease warning 66 last_success 2890674
2017-02-25 04:34:26+0100 2890741 [982]: s61 check_our_lease warning 67 last_success 2890674
2017-02-25 04:34:27+0100 2890742 [982]: s61 check_our_lease warning 68 last_success 2890674
2017-02-25 04:34:28+0100 2890743 [982]: s61 check_our_lease warning 69 last_success 2890674
2017-02-25 04:34:29+0100 2890744 [982]: s61 check_our_lease warning 70 last_success 2890674
2017-02-25 04:34:30+0100 2890745 [982]: s61 check_our_lease warning 71 last_success 2890674
2017-02-25 04:34:30+0100 2890746 [13507]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 3
2017-02-25 04:34:30+0100 2890746 [13507]: s61 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:34:30+0100 2890746 [13507]: s61 renewal error -202 delta_length 20 last_success 2890674
2017-02-25 04:34:31+0100 2890746 [982]: s61 check_our_lease warning 72 last_success 2890674
2017-02-25 04:34:32+0100 2890747 [982]: s61 check_our_lease warning 73 last_success 2890674
2017-02-25 04:34:33+0100 2890748 [982]: s61 check_our_lease warning 74 last_success 2890674
2017-02-25 04:34:34+0100 2890749 [982]: s61 check_our_lease warning 75 last_success 2890674
2017-02-25 04:34:35+0100 2890750 [982]: s61 check_our_lease warning 76 last_success 2890674
2017-02-25 04:34:36+0100 2890751 [982]: s61 check_our_lease warning 77 last_success 2890674
2017-02-25 04:34:37+0100 2890752 [982]: s61 check_our_lease warning 78 last_success 2890674
2017-02-25 04:34:38+0100 2890753 [982]: s61 check_our_lease warning 79 last_success 2890674
2017-02-25 04:34:39+0100 2890754 [982]: s61 check_our_lease failed 80
2017-02-25 04:34:39+0100 2890755 [982]: s61 all pids clear
2017-02-25 04:34:51+0100 2890766 [13507]: 310b5348 aio timeout RD 0x7fe53c0009b0:0x7fe53c0009c0:0x7fe552201000 ioto 10 to_count 4
2017-02-25 04:34:51+0100 2890766 [13507]: s61 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:34:51+0100 2890766 [13507]: s61 renewal error -202 delta_length 20 last_success 2890674
2017-02-25 04:35:13+0100 2890788 [13507]: 310b5348 aio collect 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 close free
2017-02-25 04:35:13+0100 2890788 [13507]: 310b5348 aio collect 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 close free
2017-02-25 04:35:13+0100 2890788 [13507]: 310b5348 aio collect 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 close free
2017-02-25 04:35:13+0100 2890788 [13507]: 310b5348 aio collect 0x7fe53c0009b0:0x7fe53c0009c0:0x7fe552201000 result 1048576:0 close free
2017-02-25 04:35:23+0100 2890798 [992]: s62 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 1 147 2877987 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 2 306 2880398 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 3 332 2890798 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 4 234 2887307 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 05:14:51+0100 2893166 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 05:14:51+0100 2893166 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:14:51+0100 2893166 [15983]: s62 renewal error -202 delta_length 10 last_success 2893136
2017-02-25 05:15:12+0100 2893187 [15983]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 05:15:12+0100 2893187 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:15:12+0100 2893187 [15983]: s62 renewal error -202 delta_length 20 last_success 2893136
2017-02-25 05:15:21+0100 2893196 [982]: s62 check_our_lease warning 60 last_success 2893136
2017-02-25 05:15:22+0100 2893197 [982]: s62 check_our_lease warning 61 last_success 2893136
2017-02-25 05:15:23+0100 2893198 [982]: s62 check_our_lease warning 62 last_success 2893136
2017-02-25 05:15:24+0100 2893199 [982]: s62 check_our_lease warning 63 last_success 2893136
2017-02-25 05:15:25+0100 2893200 [982]: s62 check_our_lease warning 64 last_success 2893136
2017-02-25 05:15:26+0100 2893201 [982]: s62 check_our_lease warning 65 last_success 2893136
2017-02-25 05:15:27+0100 2893202 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free
2017-02-25 05:15:27+0100 2893202 [15983]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free
2017-02-25 05:16:38+0100 2893273 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 3
2017-02-25 05:16:38+0100 2893273 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:16:38+0100 2893273 [15983]: s62 renewal error -202 delta_length 10 last_success 2893243
2017-02-25 05:16:41+0100 2893277 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 match reap
2017-02-25 05:18:13+0100 2893369 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 4
2017-02-25 05:18:13+0100 2893369 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:18:13+0100 2893369 [15983]: s62 renewal error -202 delta_length 10 last_success 2893338
2017-02-25 05:18:34+0100 2893389 [15983]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 ioto 10 to_count 5
2017-02-25 05:18:34+0100 2893389 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:18:34+0100 2893389 [15983]: s62 renewal error -202 delta_length 20 last_success 2893338
2017-02-25 05:18:43+0100 2893398 [982]: s62 check_our_lease warning 60 last_success 2893338
2017-02-25 05:18:44+0100 2893399 [982]: s62 check_our_lease warning 61 last_success 2893338
2017-02-25 05:18:45+0100 2893400 [982]: s62 check_our_lease warning 62 last_success 2893338
2017-02-25 05:18:46+0100 2893401 [982]: s62 check_our_lease warning 63 last_success 2893338
2017-02-25 05:18:47+0100 2893402 [982]: s62 check_our_lease warning 64 last_success 2893338
2017-02-25 05:18:48+0100 2893403 [982]: s62 check_our_lease warning 65 last_success 2893338
2017-02-25 05:18:49+0100 2893404 [982]: s62 check_our_lease warning 66 last_success 2893338
2017-02-25 05:18:50+0100 2893405 [982]: s62 check_our_lease warning 67 last_success 2893338
2017-02-25 05:18:51+0100 2893406 [982]: s62 check_our_lease warning 68 last_success 2893338
2017-02-25 05:18:52+0100 2893407 [982]: s62 check_our_lease warning 69 last_success 2893338
2017-02-25 05:18:53+0100 2893408 [982]: s62 check_our_lease warning 70 last_success 2893338
2017-02-25 05:18:54+0100 2893409 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 other free
2017-02-25 05:18:54+0100 2893409 [982]: s62 check_our_lease warning 71 last_success 2893338
2017-02-25 05:18:54+0100 2893409 [15983]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 result 1048576:0 other free
2017-02-25 05:38:14+0100 2894569 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552d07000 ioto 10 to_count 6
2017-02-25 05:38:14+0100 2894569 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:38:14+0100 2894569 [15983]: s62 renewal error -202 delta_length 10 last_success 2894538
2017-02-25 05:38:34+0100 2894589 [15983]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 ioto 10 to_count 7
2017-02-25 05:38:34+0100 2894589 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:38:34+0100 2894589 [15983]: s62 renewal error -202 delta_length 20 last_success 2894538
2017-02-25 05:38:42+0100 2894598 [982]: s62 check_our_lease warning 60 last_success 2894538
2017-02-25 05:38:43+0100 2894599 [982]: s62 check_our_lease warning 61 last_success 2894538
2017-02-25 05:38:44+0100 2894600 [982]: s62 check_our_lease warning 62 last_success 2894538
2017-02-25 05:38:45+0100 2894601 [982]: s62 check_our_lease warning 63 last_success 2894538
2017-02-25 05:38:46+0100 2894602 [982]: s62 check_our_lease warning 64 last_success 2894538
2017-02-25 05:38:47+0100 2894603 [982]: s62 check_our_lease warning 65 last_success 2894538
2017-02-25 05:38:48+0100 2894604 [982]: s62 check_our_lease warning 66 last_success 2894538
2017-02-25 05:38:49+0100 2894605 [982]: s62 check_our_lease warning 67 last_success 2894538
2017-02-25 05:38:50+0100 2894606 [982]: s62 check_our_lease warning 68 last_success 2894538
2017-02-25 05:38:51+0100 2894607 [982]: s62 check_our_lease warning 69 last_success 2894538
2017-02-25 05:38:52+0100 2894608 [982]: s62 check_our_lease warning 70 last_success 2894538
2017-02-25 05:38:53+0100 2894609 [982]: s62 check_our_lease warning 71 last_success 2894538
2017-02-25 05:38:54+0100 2894610 [982]: s62 check_our_lease warning 72 last_success 2894538
2017-02-25 05:38:55+0100 2894610 [15983]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 8
2017-02-25 05:38:55+0100 2894610 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:38:55+0100 2894610 [15983]: s62 renewal error -202 delta_length 20 last_success 2894538
2017-02-25 05:38:55+0100 2894611 [982]: s62 check_our_lease warning 73 last_success 2894538
2017-02-25 05:38:56+0100 2894612 [982]: s62 check_our_lease warning 74 last_success 2894538
2017-02-25 05:38:57+0100 2894613 [982]: s62 check_our_lease warning 75 last_success 2894538
2017-02-25 05:38:58+0100 2894614 [982]: s62 check_our_lease warning 76 last_success 2894538
2017-02-25 05:38:59+0100 2894615 [982]: s62 check_our_lease warning 77 last_success 2894538
2017-02-25 05:39:00+0100 2894616 [982]: s62 check_our_lease warning 78 last_success 2894538
2017-02-25 05:39:01+0100 2894616 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552d07000 result 1048576:0 other free r
2017-02-25 05:39:01+0100 2894616 [15983]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 result 1048576:0 other free r
2017-02-25 05:39:01+0100 2894616 [15983]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 match reap
2017-02-25 06:19:10+0100 2897026 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 9
2017-02-25 06:19:10+0100 2897026 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 06:19:10+0100 2897026 [15983]: s62 renewal error -202 delta_length 10 last_success 2896995
2017-02-25 06:19:31+0100 2897046 [15983]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 ioto 10 to_count 10
2017-02-25 06:19:31+0100 2897046 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 06:19:31+0100 2897046 [15983]: s62 renewal error -202 delta_length 20 last_success 2896995
2017-02-25 06:19:40+0100 2897055 [982]: s62 check_our_lease warning 60 last_success 2896995
2017-02-25 06:19:41+0100 2897056 [982]: s62 check_our_lease warning 61 last_success 2896995
2017-02-25 06:19:42+0100 2897057 [982]: s62 check_our_lease warning 62 last_success 2896995
2017-02-25 06:19:43+0100 2897058 [982]: s62 check_our_lease warning 63 last_success 2896995
2017-02-25 06:19:44+0100 2897059 [982]: s62 check_our_lease warning 64 last_success 2896995
2017-02-25 06:19:45+0100 2897060 [982]: s62 check_our_lease warning 65 last_success 2896995
2017-02-25 06:19:46+0100 2897061 [982]: s62 check_our_lease warning 66 last_success 2896995
2017-02-25 06:19:47+0100 2897062 [982]: s62 check_our_lease warning 67 last_success 2896995
2017-02-25 06:19:48+0100 2897063 [982]: s62 check_our_lease warning 68 last_success 2896995
2017-02-25 06:19:49+0100 2897064 [982]: s62 check_our_lease warning 69 last_success 2896995
2017-02-25 06:19:50+0100 2897065 [982]: s62 check_our_lease warning 70 last_success 2896995
2017-02-25 06:19:51+0100 2897066 [982]: s62 check_our_lease warning 71 last_success 2896995
2017-02-25 06:19:51+0100 2897067 [15983]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552d07000 ioto 10 to_count 11
2017-02-25 06:19:51+0100 2897067 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 06:19:51+0100 2897067 [15983]: s62 renewal error -202 delta_length 20 last_success 2896995
2017-02-25 06:19:52+0100 2897067 [982]: s62 check_our_lease warning 72 last_success 2896995
2017-02-25 06:19:53+0100 2897068 [982]: s62 check_our_lease warning 73 last_success 2896995
2017-02-25 06:19:54+0100 2897069 [982]: s62 check_our_lease warning 74 last_success 2896995
2017-02-25 06:19:55+0100 2897070 [982]: s62 check_our_lease warning 75 last_success 2896995
2017-02-25 06:19:56+0100 2897071 [982]: s62 check_our_lease warning 76 last_success 2896995
2017-02-25 06:19:57+0100 2897072 [982]: s62 check_our_lease warning 77 last_success 2896995
2017-02-25 06:19:58+0100 2897073 [982]: s62 check_our_lease warning 78 last_success 2896995
2017-02-25 06:19:59+0100 2897074 [982]: s62 check_our_lease warning 79 last_success 2896995
2017-02-25 06:20:00+0100 2897075 [982]: s62 check_our_lease failed 80
2017-02-25 06:20:00+0100 2897075 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 other free r
2017-02-25 06:20:00+0100 2897075 [15983]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 result 1048576:0 other free r
2017-02-25 06:20:00+0100 2897075 [15983]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552d07000 result 1048576:0 match reap
2017-02-25 06:20:00+0100 2897075 [982]: s62 all pids clear
2017-02-25 06:20:10+0100 2897086 [991]: s63 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 1 147 2884267 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 2 306 2886699 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 3 333 2897086 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 4 234 2893587 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 08:48:20+0100 2905976 [11649]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 08:48:20+0100 2905976 [11649]: s63 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 08:48:20+0100 2905976 [11649]: s63 renewal error -202 delta_length 10 last_success 2905945
2017-02-25 08:48:41+0100 2905996 [11649]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 08:48:41+0100 2905996 [11649]: s63 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 08:48:41+0100 2905996 [11649]: s63 renewal error -202 delta_length 20 last_success 2905945
2017-02-25 08:48:50+0100 2906005 [982]: s63 check_our_lease warning 60 last_success 2905945
2017-02-25 08:48:51+0100 2906006 [982]: s63 check_our_lease warning 61 last_success 2905945
2017-02-25 08:48:52+0100 2906007 [982]: s63 check_our_lease warning 62 last_success 2905945
2017-02-25 08:48:53+0100 2906008 [982]: s63 check_our_lease warning 63 last_success 2905945
2017-02-25 08:48:54+0100 2906009 [982]: s63 check_our_lease warning 64 last_success 2905945
2017-02-25 08:48:55+0100 2906010 [982]: s63 check_our_lease warning 65 last_success 2905945
2017-02-25 08:48:56+0100 2906011 [982]: s63 check_our_lease warning 66 last_success 2905945
2017-02-25 08:48:57+0100 2906012 [982]: s63 check_our_lease warning 67 last_success 2905945
2017-02-25 08:48:58+0100 2906013 [982]: s63 check_our_lease warning 68 last_success 2905945
2017-02-25 08:48:59+0100 2906014 [982]: s63 check_our_lease warning 69 last_success 2905945
2017-02-25 08:49:00+0100 2906015 [982]: s63 check_our_lease warning 70 last_success 2905945
2017-02-25 08:49:01+0100 2906016 [982]: s63 check_our_lease warning 71 last_success 2905945
2017-02-25 08:49:01+0100 2906017 [11649]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 3
2017-02-25 08:49:01+0100 2906017 [11649]: s63 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 08:49:01+0100 2906017 [11649]: s63 renewal error -202 delta_length 20 last_success 2905945
2017-02-25 08:49:02+0100 2906017 [982]: s63 check_our_lease warning 72 last_success 2905945
2017-02-25 08:49:03+0100 2906018 [982]: s63 check_our_lease warning 73 last_success 2905945
2017-02-25 08:49:04+0100 2906019 [982]: s63 check_our_lease warning 74 last_success 2905945
2017-02-25 08:49:05+0100 2906020 [982]: s63 check_our_lease warning 75 last_success 2905945
2017-02-25 08:49:06+0100 2906021 [982]: s63 check_our_lease warning 76 last_success 2905945
2017-02-25 08:49:06+0100 2906021 [11649]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free r
2017-02-25 08:49:06+0100 2906021 [11649]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free r
2017-02-25 08:49:06+0100 2906021 [11649]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 match reap

Comment 9 Yaniv Bronhaim 2017-03-07 12:23:53 UTC
Can you please elaborate what happens when this storage issue occurs? you wrote in the bug description "in this case which causes vdsm not to respond to any calls. Restart of the vdsm solves the issue."

Is the bug about the storage issue that stuck or about ovirt treatment when such storage issue occurs? probably if the storage is stuck or not-responsive we need to trigger our fencing logic..
Timeout is something that ends. I guess that after timeout expiration vdsm should be responsive again. how the restart solves it if the storage still not responsive? maybe it could be solved also by just waiting for the timeout and trigger the operation again?

Comment 10 Yaniv Bronhaim 2017-03-09 10:51:11 UTC
After long discussion in mails, the resolution should be:
1. Adding logs for incoming requests with source description (ip or service name).
2. Wait for reproduction - Denis please change vdsm log lever to LOG - it will help, and ping us (Piotr or me) when this happens before restarting vdsmd.
3. Denis - check if you can also provide Nir's requests. maybe it will assist to reproduce it offline.

Comment 11 Yaniv Bronhaim 2017-03-09 10:51:52 UTC
Created attachment 1261508 [details]
engine log 1

Comment 12 Yaniv Bronhaim 2017-03-09 10:52:44 UTC
Created attachment 1261509 [details]
engine log 2

Comment 13 Nir Soffer 2017-03-09 12:28:53 UTC
Basically this is bug is about storage issues causing queues to become full.

The noise this bug made show that our logs make it too hard to understand 
unresponsive storage issues.

Adding log when starting a request that may block is good idea, we usually do this
in storage, but it is not enough. What we need is an active system that report
during runtime if a request is blocked for long time.

I think the best place to implement this is in the executor - we already have
a mechanism to discard tasks, we can extend it to warn about stuck workers.

Possible design:

    executor.dispatch(callback, warn_timeout=None, discard_timeout=None)

Each warn_timeout seconds, the executor can check the the worker and warn about
blocker worker. This log can use the worker repr() to show the function, arguments and context of this worker.

If a worker is blocked for long time, we will see multiple warnings about it, this is good, makes it easy to detect issues by greping the logs.

If discard_timeout is set, the worker will be discarded, same logic as we have
today.

Eventually everything in vdsm should be done in an executor, so this will solve
this issue globally in the application.

One area will cannot use executors is storage monitoring, done in dedicated thread
this thread currently does not log warning when storage checkers are stuck. This
is handled in these patches:
https://gerrit.ovirt.org/#/q/topic:monitor-alerts

I'm adding this bug to the patches, as this is a good example of case that would
be easier to handle with these logs.

Comment 14 Nir Soffer 2017-03-09 12:36:22 UTC
Francesco, can you check comment 13?

Comment 15 Francesco Romani 2017-03-09 12:50:06 UTC
(In reply to Nir Soffer from comment #13)
> Basically this is bug is about storage issues causing queues to become full.
> 
> The noise this bug made show that our logs make it too hard to understand 
> unresponsive storage issues.
> 
> Adding log when starting a request that may block is good idea, we usually
> do this
> in storage, but it is not enough. What we need is an active system that
> report
> during runtime if a request is blocked for long time.

I agree with this assessment

> I think the best place to implement this is in the executor - we already have
> a mechanism to discard tasks, we can extend it to warn about stuck workers.

Agreed, executor should be in charge

> 
> Possible design:
> 
>     executor.dispatch(callback, warn_timeout=None, discard_timeout=None)
> 
> Each warn_timeout seconds, the executor can check the the worker and warn
> about
> blocker worker. This log can use the worker repr() to show the function,
> arguments and context of this worker.

Why two timeouts? the more knobs we have, the harder it gets to have good defaults.
Another option could be:

  executor.dispatch(callback, timeout=None, discard_on_timeout=True)

with defaults, same behaviour as today.
with discard_on_timeout=False
  if timeout expires
    emit warning in the logs about stuck task, using repr() and the infra as
    you just described
    reschedule the check after <timeout> secs

For discarded workers, I think it is enough if we periodically dump the executor state (again no new infra needed, we can use repr()) if there are any stuck worker. Otherwise better be silent.

Summary: I agree with the concept, but I think we should have only one timeout if possible, to reduce the number of knobs.

> If a worker is blocked for long time, we will see multiple warnings about
> it, this is good, makes it easy to detect issues by greping the logs.

Agreed, also having reliable executor state dump is a good thing when things are blocked; today we do the dump only if we discarded a worker, if executor pool is exhausted we don't have dumps anymore.

> Eventually everything in vdsm should be done in an executor, so this will
> solve this issue globally in the application.

Agreed. If we don't need the full power of executor, it is also easy to make it behave like a standard thread pool.

Comment 16 Francesco Romani 2017-03-09 12:57:36 UTC
restoring NEEDINFO cleared by mistake

Comment 17 Nir Soffer 2017-03-09 13:04:48 UTC
(In reply to Francesco Romani from comment #15)
I agree with the suggested simplification, this will make it easier to implement
and backport to 4.1.z with less risk.

Not sure what is the timeout value for warnings for jsonrpc executor. At this level
we don't have any context, so we must use large timeout that will not make unwanted
noise for slow operations (storage sync operations, setup network, etc.)

Comment 18 Francesco Romani 2017-03-09 14:22:00 UTC
patches https://gerrit.ovirt.org/73845 - https://gerrit.ovirt.org/73849 are attached to this BZ, but will not solve it - they cannot, as the above comments explained. We are just aiming to make the troubleshooting easier.

Comment 19 Roy Golan 2017-03-14 13:38:59 UTC
Can you add this error to vdsmfake so we would more easily integrate test? at least it would easy to simulate the error and test the engine behaviour easily. If you need help with it let me know

Comment 20 Roy Golan 2017-03-14 13:42:54 UTC
Guy, Eyal, anyone seen 'TooManyTasks' error in vdsm while scale-up tests?

Comment 21 Piotr Kliczewski 2017-03-14 16:10:50 UTC
@Roy what do you want to test specifically? This error would trigger soft fencing which don't make sense with fake vdsm.

Comment 22 Martin Perina 2017-03-17 14:02:28 UTC
@Francesco, do we have all logging patches attached to this bug or are there some more?

Comment 23 Francesco Romani 2017-03-20 08:49:09 UTC
(In reply to Martin Perina from comment #22)
> @Francesco, do we have all logging patches attached to this bug or are there
> some more?

TL;DR: from my side, yes.

The executor (aka threadpool enhanced) gained all the logs we thought we could need, however to fully make use of the new facilities the client code need to be reviewed.

Comment 24 Yaniv Bronhaim 2017-03-31 06:35:41 UTC
Nir, Francesco - the bug is still infra which I'm not sure if to change it or not, but all the patches are yours. this is 4.1 bug - please backport your changes ASAP.
Thanks

Comment 25 Francesco Romani 2017-04-03 10:44:51 UTC
(In reply to Yaniv Bronhaim from comment #24)
> Nir, Francesco - the bug is still infra which I'm not sure if to change it
> or not, but all the patches are yours. this is 4.1 bug - please backport
> your changes ASAP.
> Thanks

From my side, besides the backporting itself, we just lack the final wiring.
I did this in https://gerrit.ovirt.org/#/c/75051/

for periodic operations, I think we already have good enough reporting.

Comment 26 Francesco Romani 2017-04-03 15:19:16 UTC
taking the bug since I posted most of the patches, but keeping under "infra" because the patches are in the shared part.

Comment 27 Nir Soffer 2017-04-03 21:49:42 UTC
I have two patches to backport, and one additional patch is needed to fix flaky 
new tests on travis.

Comment 28 Francesco Romani 2017-04-12 11:18:47 UTC
jsonrpc patches landed in 4.1. Waiting for storage patches before to move as MODIFIED.

Comment 29 Francesco Romani 2017-04-18 06:56:31 UTC
storage patches (https://gerrit.ovirt.org/#/q/topic:backport/4.1/monitor-alerts) merged

Comment 30 Francesco Romani 2017-04-19 08:22:33 UTC
no doc_text required: we are adding logs about something which should never happen under normal operation.

Comment 31 Francesco Romani 2017-04-26 14:06:48 UTC
I think we found the root cause of RPC pool being exhausted thanks to the patches added for this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1443654#c5

Comment 32 Lukas Svaty 2017-05-03 08:00:36 UTC
Hi Francesco,

any hints how to reproduce these?

Comment 33 Francesco Romani 2017-05-03 08:14:12 UTC
(In reply to Lukas Svaty from comment #32)
> Hi Francesco,
> 
> any hints how to reproduce these?

Yep. grab a scapegoat host running few (>= 4 at least) VMs. Storage domain should be irrelevant, NFS could be fine (simpler to setup).

Either unplug the network cable the VMs are using, or kill storage connection (iptable rules could be fine).

After >= 20m you should start to see TooManyTasks.

You may want to read https://bugzilla.redhat.com/show_bug.cgi?id=1443654 to see another facet of this issue.

Comment 34 Petr Matyáš 2017-05-10 12:39:19 UTC
So, I had 5 VMs running on a host, then I killed connection of the host to storage (with iptables) which resulted in host, and VMs running on it, going non responsive.
After a while (2 hours) I checked the logs for TooManyTasks, none were found.
After unblocking the connection all the VMs and the host became up. However those VMs had some problems with guest agent which didn't come up right away, but after some time even the guest agent launched and is now working correctly.

This in my opinion should be enough to verify this bug, but I'll rather ask first.

Comment 35 Francesco Romani 2017-05-10 12:52:34 UTC
(In reply to Petr Matyáš from comment #34)
> So, I had 5 VMs running on a host, then I killed connection of the host to
> storage (with iptables) which resulted in host, and VMs running on it, going
> non responsive.
> After a while (2 hours) I checked the logs for TooManyTasks, none were found.
> After unblocking the connection all the VMs and the host became up. However
> those VMs had some problems with guest agent which didn't come up right
> away, but after some time even the guest agent launched and is now working
> correctly.
> 
> This in my opinion should be enough to verify this bug, but I'll rather ask
> first.

This is partially true. The verification steps are good, but the fact you did *not* see TooManyTasks is inconclusive *for this bug*

We had a few bugs related to this issue (resource pool exhaustion -> TooManyTasks triggered).

This verification is actually a good verification for https://bugzilla.redhat.com/show_bug.cgi?id=1443654 - still related to this area, but different.

In *this* bug, we want to make sure that *when* we get TooManyTasks, *then* we have a better error message in the log pointing to the bad operation.

https://bugzilla.redhat.com/show_bug.cgi?id=1443654 is also relevant because should remove one big cause of TooManyTasks, but it is not yet merged in Vdsm 4.19.12

To increase the chance of seeing TooManyTasks, you may want to aggressively call getAllVmIoTunePolicies using jsonrpc

Also, 2hr waiting time is more than enough, ~30m should be sufficient to start to see TooManyTasks

Comment 36 Petr Matyáš 2017-05-10 15:41:23 UTC
Well, after trying again with 28 VMs and 33 infinite cycles running 'vdsm-client Host getAllVmIoTunePolicies', again none TooManyTasks error to be found. Maybe this can be tested by scale team.

Comment 37 Martin Perina 2017-05-10 18:28:55 UTC
(In reply to Petr Matyáš from comment #36)
> Well, after trying again with 28 VMs and 33 infinite cycles running
> 'vdsm-client Host getAllVmIoTunePolicies', again none TooManyTasks error to
> be found. Maybe this can be tested by scale team.

This bug is not about fixing the TooManyTasks issue, it just adds logging so we can understand what's the reason that VDSM thread pool is exhausted

Comment 38 Martin Perina 2017-05-10 18:31:11 UTC
(In reply to Martin Perina from comment #37)
> (In reply to Petr Matyáš from comment #36)
> > Well, after trying again with 28 VMs and 33 infinite cycles running
> > 'vdsm-client Host getAllVmIoTunePolicies', again none TooManyTasks error to
> > be found. Maybe this can be tested by scale team.
> 
> This bug is not about fixing the TooManyTasks issue, it just adds logging so
> we can understand what's the reason that VDSM thread pool is exhausted

Sorry Petr, I misunderstood your reply, now I understood that you were not able to simulate TooManyTasks issue. Francesco, any idea how to simulate that?

Comment 39 Francesco Romani 2017-05-11 07:06:42 UTC
(In reply to Martin Perina from comment #38)
> (In reply to Martin Perina from comment #37)
> > (In reply to Petr Matyáš from comment #36)
> > > Well, after trying again with 28 VMs and 33 infinite cycles running
> > > 'vdsm-client Host getAllVmIoTunePolicies', again none TooManyTasks error to
> > > be found. Maybe this can be tested by scale team.
> > 
> > This bug is not about fixing the TooManyTasks issue, it just adds logging so
> > we can understand what's the reason that VDSM thread pool is exhausted
> 
> Sorry Petr, I misunderstood your reply, now I understood that you were not
> able to simulate TooManyTasks issue. Francesco, any idea how to simulate
> that?

I concur that the scale team has probably better chance and tooling to see this.

I was able to reproduce TooManyTasks (and so investigate and deliver a patch for
https://bugzilla.redhat.com/show_bug.cgi?id=1443654) using 4 vms with mixed SDs.

The VMs had a couple of disks each, one with disk on NFS storage, one with disk on ISCSI storage. Then I followed the steps on https://bugzilla.redhat.com/show_bug.cgi?id=1428415#c35

Comment 40 Eldad Marciano 2017-05-22 14:26:27 UTC
Francesco,
we've reproduced this bug via multiple verb execution [1].

as suggested we use the getAllVmIoTunePolicies verb to reproduce the issue.

I found the max verb concurrency we can used is around 32 tasks(e.g vdsm-client Host getAllVmIoTunePolicies), do you familiar with such number of limitation ?
I guess, they run and published to the queue but just pushed to end of the queue.

test results:
50 concurrent tasks - passed.
100 concurrent tasks - passed.
200 concurrent tasks - failed - we started to see the TooManyTasks errors.

I used the following script to correlate the concurrency vs errors in vdsm[2].

lab setup:
1 DC, 1 Cl, 1 Host, 1 ISCSI SD, 500 VMS (1 disk).

[1]
https://gist.github.com/mrsiano/0a1185285f18b48a3497065df1dd76f4

[2]
while true; do
echo "`date |awk '{print $1"-"$2"-"$3"-"$4}'` - verbs: `ps -ef |grep -c vdsm-client` error: `grep -c TooManyTasks /var/log/vdsm/vdsm.log`" >> /tmp/verberror.log
sleep 1
done


output(just a small piece):
Mon-May-22-17:15:28 - verbs: 32 error: 0
Mon-May-22-17:15:29 - verbs: 33 error: 0
Mon-May-22-17:15:31 - verbs: 33 error: 2
Mon-May-22-17:15:33 - verbs: 32 error: 2


BTW, we still see vms in non responsive state.

Comment 41 Francesco Romani 2017-05-22 14:53:03 UTC
(In reply to Eldad Marciano from comment #40)
> Francesco,
> we've reproduced this bug via multiple verb execution [1].
> 
> as suggested we use the getAllVmIoTunePolicies verb to reproduce the issue.
> 
> I found the max verb concurrency we can used is around 32 tasks(e.g
> vdsm-client Host getAllVmIoTunePolicies), do you familiar with such number
> of limitation ?
> I guess, they run and published to the queue but just pushed to end of the
> queue.

Well, there it must be one limit somewhere, but we didn't do yet any specific tuning so I can't say where this limit is. 32 is a valid number as anything else, I guess it makes sense.

> test results:
> 50 concurrent tasks - passed.
> 100 concurrent tasks - passed.
> 200 concurrent tasks - failed - we started to see the TooManyTasks errors.

OK, can you share some Vdsm logs? If you see some extra informations like the example in https://gerrit.ovirt.org/#/c/75074/ we can mark *this* bug as verified.

Comment 42 Eldad Marciano 2017-05-22 15:36:54 UTC
seems like yes, please approve

2017-05-22 18:34:57,759+0300 WARN  (vdsm.Scheduler) [virt.periodic.Operation] could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x25d1c50>, executor queue full (periodic:235)

2017-05-22 18:35:56,105+0300 INFO  (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/10770 running <Task discardable <Operation action=<VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x25d1c50> at 0x25d1c90> timeout=1.0, duration=1 at 0x7f0a35542990> discarded task#=264 at 0x28acfd0> (executor:339)

Comment 43 Michal Skrivanek 2017-05-23 07:48:01 UTC
As per your comment #37 this bug is about additional logging. Feel free to move to VERIFIED.

Comment 44 Martin Perina 2017-05-23 07:57:36 UTC
Moving to verified per Comment 42


Note You need to log in before you can comment on or make changes to this bug.