Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1433380

Summary: Execute fencing on hosts which RPC pool is exhausted
Product: [oVirt] ovirt-engine Reporter: Martin Perina <mperina>
Component: BLL.InfraAssignee: Piotr Kliczewski <pkliczew>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Matyáš <pmatyas>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0.3CC: bugs, dprezhev, eberman, fromani, lsvaty, mperina, nsoffer, oourfali, pkliczew, ratamir, rgolan, ybronhei
Target Milestone: ovirt-4.2.0Flags: rule-engine: ovirt-4.2+
lsvaty: testing_ack+
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1428415 Environment:
Last Closed: 2017-12-20 11:09:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Perina 2017-03-17 13:55:45 UTC
+++ This bug was initially created as a clone of Bug #1428415 +++

In one of our environments we observed that after storage timeout occurring:


2017-02-25 03:30:43,723 ERROR (periodic/27) [storage.TaskManager.Task] (Task='75fcbc90-d17f-47a3-80ed-89b5b43f08ca') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3016, in getVolumeSize
    apparentsize = str(dom.getVSize(imgUUID, volUUID))
  File "/usr/share/vdsm/storage/sd.py", line 663, in getVSize
    return self._manifest.getVSize(imgUUID, volUUID)
  File "/usr/share/vdsm/storage/fileSD.py", line 189, in getVSize
    return self.oop.os.stat(volPath).st_size
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 254, in stat
    return self._iop.stat(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 484, in stat
    resdict = self._sendCommand("stat", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

after some time we see:

2017-02-25 03:34:24,917 WARN  (jsonrpc/2) [virt.vm] (vmId='5f56d327-8348-4fc0-867c-819348d96847') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,918 WARN  (jsonrpc/2) [virt.vm] (vmId='094f498a-ec1c-4792-b6b0-ac982e912c36') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,919 WARN  (jsonrpc/2) [virt.vm] (vmId='24fbd184-3ecd-4031-9cb1-a29bc5b3fa2a') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,919 WARN  (jsonrpc/2) [virt.vm] (vmId='683fb432-76d1-4087-b6c4-73bfe4de9843') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,920 WARN  (jsonrpc/2) [virt.vm] (vmId='e77ae01d-6401-4653-aacc-7646f4f8402d') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,921 WARN  (jsonrpc/2) [virt.vm] (vmId='269d0d0a-7cc3-4661-886f-f6d929cada63') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)
2017-02-25 03:34:24,921 WARN  (jsonrpc/2) [virt.vm] (vmId='d40e0bc6-dad5-4ee9-bfa1-09f074882e78') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)

in the meantime we see:

2017-02-25 03:34:24,976 ERROR (jsonrpc/3) [virt.vm] (vmId='cd14a081-35aa-4a7e-97fa-442d6d864942') getVmIoTune failed (vm:2773)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2758, in getIoTuneResponse
    libvirt.VIR_DOMAIN_AFFECT_LIVE)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainGetBlockIoTune)

Over time this end up with:

2017-02-25 03:59:32,904 ERROR (JsonRpcServer) [jsonrpc.JsonRpcServer] could not allocate request thread (__init__:626)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 624, in _runRequest
    self._threadFactory(partial(self._serveRequest, ctx, request))
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 148, in dispatch
    self._tasks.put(Task(callable, timeout))
  File "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 361, in put
    raise TooManyTasks()
TooManyTasks

which causes vdsm not to respond to any calls.

Restart of the vdsm solves the issue.

--- Additional comment from Nir Soffer on 2017-03-02 16:06:29 CET ---

The issue is not exhausted pool. The RPC executor has a limit on the number of 
tasks in the queue. If you add tasks to the queue faster then vdsm can consume,
you get the TooManyTasks error.

In older versions of vdsm, vdsm had no limit, and was creating new thread for
each task, without any limit. This would fail when vdsm created about 700 threads,
and was not useful to anyone.

The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
handle the error properly, maybe by stopping sending messages to this host for
some time.

We see timeouts in ioprocess operations:

    return self._iop.stat(path)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 484, in stat
    resdict = self._sendCommand("stat", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455, in _sendCommand
    raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out

And timeouts in libvirt/qemu:

2017-02-25 03:34:24,917 WARN  (jsonrpc/2) [virt.vm] (vmId='5f56d327-8348-4fc0-867c-819348d96847') monitor became unresponsive (command timeout, age=70.5100000007) (vm:4893)

So it seems that the root cause is the underlying storage.

If you enable ioprocess debug logs, we can get more info why the operations times
out.

Please try to set:

[logger_IOProcess]
level=DEBUG
handlers=logfile
qualname=IOProcess
propagate=0

Or use vdsm-client to change the log level for the current process:

    vdsm-client Host setLogLevel level=DEBUG
    vdsm-client Host setLogLevel level=DEBUG name=IOProcess

If this is an issue with storage, we should also see timeouts in sanlock log and
in /var/log/messages. Please attach these logs.

--- Additional comment from Francesco Romani on 2017-03-02 16:28:04 CET ---

(In reply to Nir Soffer from comment #1)
> The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
> handle the error properly, maybe by stopping sending messages to this host
> for some time.

I agree, Engine should learn about this.


> We see timeouts in ioprocess operations:
> 
>     return self._iop.stat(path)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 484,
> in stat
>     resdict = self._sendCommand("stat", {"path": path}, self.timeout)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455,
> in _sendCommand
>     raise Timeout(os.strerror(errno.ETIMEDOUT))
> Timeout: Connection timed out
> 
> And timeouts in libvirt/qemu:
> 
> 2017-02-25 03:34:24,917 WARN  (jsonrpc/2) [virt.vm]
> (vmId='5f56d327-8348-4fc0-867c-819348d96847') monitor became unresponsive
> (command timeout, age=70.5100000007) (vm:4893)

Yes, most likely same root cause

CC'ing myself to this BZ.

--- Additional comment from Piotr Kliczewski on 2017-03-02 16:35 CET ---



--- Additional comment from Piotr Kliczewski on 2017-03-02 16:35 CET ---



--- Additional comment from Piotr Kliczewski on 2017-03-02 16:38:36 CET ---

(In reply to Francesco Romani from comment #2)
> (In reply to Nir Soffer from comment #1)
> > The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
> > handle the error properly, maybe by stopping sending messages to this host
> > for some time.
> 
> I agree, Engine should learn about this.
> 

We do not have an error code for this exception. How do we expect old engine work with new code?

The only value of this change is to have proper message which would allow us to
debug it easier. We can't expect older engines to react on it.

> 
> > We see timeouts in ioprocess operations:
> > 
> >     return self._iop.stat(path)
> >   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 484,
> > in stat
> >     resdict = self._sendCommand("stat", {"path": path}, self.timeout)
> >   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 455,
> > in _sendCommand
> >     raise Timeout(os.strerror(errno.ETIMEDOUT))
> > Timeout: Connection timed out
> > 
> > And timeouts in libvirt/qemu:
> > 
> > 2017-02-25 03:34:24,917 WARN  (jsonrpc/2) [virt.vm]
> > (vmId='5f56d327-8348-4fc0-867c-819348d96847') monitor became unresponsive
> > (command timeout, age=70.5100000007) (vm:4893)
> 
> Yes, most likely same root cause
> 
> CC'ing myself to this BZ.

--- Additional comment from Francesco Romani on 2017-03-02 16:58:51 CET ---

(In reply to Piotr Kliczewski from comment #5)
> (In reply to Francesco Romani from comment #2)
> > (In reply to Nir Soffer from comment #1)
> > > The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
> > > handle the error properly, maybe by stopping sending messages to this host
> > > for some time.
> > 
> > I agree, Engine should learn about this.
> > 
> 
> We do not have an error code for this exception. How do we expect old engine
> work with new code?
> 
> The only value of this change is to have proper message which would allow us
> to
> debug it easier. We can't expect older engines to react on it.

Surely old Engine can't do anything, but new Engine (4.2?) could. But I don't want to hijack this bug, this is I believe a low priority/benefit change which can be discussed separately.

--- Additional comment from Piotr Kliczewski on 2017-03-02 17:41:38 CET ---

(In reply to Francesco Romani from comment #6)
> (In reply to Piotr Kliczewski from comment #5)
> > (In reply to Francesco Romani from comment #2)
> > > (In reply to Nir Soffer from comment #1)
> > > > The jsonrpc layer should pass the TooManyTasks error to engine, so engine can
> > > > handle the error properly, maybe by stopping sending messages to this host
> > > > for some time.
> > > 
> > > I agree, Engine should learn about this.
> > > 
> > 
> > We do not have an error code for this exception. How do we expect old engine
> > work with new code?
> > 
> > The only value of this change is to have proper message which would allow us
> > to
> > debug it easier. We can't expect older engines to react on it.
> 
> Surely old Engine can't do anything, but new Engine (4.2?) could. But I
> don't want to hijack this bug, this is I believe a low priority/benefit
> change which can be discussed separately.

I agree for newer engines we can improve it. I will push patches to fix it.

--- Additional comment from Nir Soffer on 2017-03-02 23:48:06 CET ---

Looking in sanlock log, we see renewal timeouts in the same time.
So this looks like actual storage issue.

Do we have info from netapp-vserver-rhev.lab.eng.brq.redhat.com
in the same time?

2017-02-25 01:54:37+0100 2881152 [30909]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 5
2017-02-25 01:54:37+0100 2881152 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 01:54:37+0100 2881152 [30909]: s58 renewal error -202 delta_length 10 last_success 2881122
2017-02-25 01:54:58+0100 2881173 [30909]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 6
2017-02-25 01:54:58+0100 2881173 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 01:54:58+0100 2881173 [30909]: s58 renewal error -202 delta_length 20 last_success 2881122
2017-02-25 01:55:07+0100 2881182 [982]: s58 check_our_lease warning 60 last_success 2881122
2017-02-25 01:55:08+0100 2881183 [982]: s58 check_our_lease warning 61 last_success 2881122
2017-02-25 01:55:09+0100 2881184 [982]: s58 check_our_lease warning 62 last_success 2881122
2017-02-25 01:55:10+0100 2881185 [982]: s58 check_our_lease warning 63 last_success 2881122
2017-02-25 01:55:11+0100 2881186 [982]: s58 check_our_lease warning 64 last_success 2881122
2017-02-25 01:55:12+0100 2881187 [982]: s58 check_our_lease warning 65 last_success 2881122
2017-02-25 01:55:13+0100 2881188 [30909]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free
2017-02-25 01:55:13+0100 2881188 [30909]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free
2017-02-25 01:55:13+0100 2881188 [982]: s58 check_our_lease warning 66 last_success 2881122
2017-02-25 01:56:04+0100 2881239 [30909]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 7
2017-02-25 01:56:04+0100 2881239 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 01:56:04+0100 2881239 [30909]: s58 renewal error -202 delta_length 10 last_success 2881209
2017-02-25 01:56:10+0100 2881245 [30909]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 match reap
2017-02-25 03:29:26+0100 2886841 [30909]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 8
2017-02-25 03:29:26+0100 2886841 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:29:26+0100 2886841 [30909]: s58 renewal error -202 delta_length 10 last_success 2886811
2017-02-25 03:29:46+0100 2886862 [30909]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 ioto 10 to_count 9
2017-02-25 03:29:46+0100 2886862 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:29:46+0100 2886862 [30909]: s58 renewal error -202 delta_length 20 last_success 2886811
2017-02-25 03:29:55+0100 2886871 [982]: s58 check_our_lease warning 60 last_success 2886811
2017-02-25 03:29:56+0100 2886872 [982]: s58 check_our_lease warning 61 last_success 2886811
2017-02-25 03:29:57+0100 2886873 [982]: s58 check_our_lease warning 62 last_success 2886811
2017-02-25 03:29:58+0100 2886874 [982]: s58 check_our_lease warning 63 last_success 2886811
2017-02-25 03:29:59+0100 2886875 [982]: s58 check_our_lease warning 64 last_success 2886811
2017-02-25 03:30:00+0100 2886876 [982]: s58 check_our_lease warning 65 last_success 2886811
2017-02-25 03:30:01+0100 2886877 [982]: s58 check_our_lease warning 66 last_success 2886811
2017-02-25 03:30:02+0100 2886878 [982]: s58 check_our_lease warning 67 last_success 2886811
2017-02-25 03:30:03+0100 2886879 [982]: s58 check_our_lease warning 68 last_success 2886811
2017-02-25 03:30:04+0100 2886880 [982]: s58 check_our_lease warning 69 last_success 2886811
2017-02-25 03:30:05+0100 2886881 [982]: s58 check_our_lease warning 70 last_success 2886811
2017-02-25 03:30:06+0100 2886882 [982]: s58 check_our_lease warning 71 last_success 2886811
2017-02-25 03:30:07+0100 2886882 [30909]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552d07000 ioto 10 to_count 10
2017-02-25 03:30:07+0100 2886882 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:30:07+0100 2886882 [30909]: s58 renewal error -202 delta_length 20 last_success 2886811
2017-02-25 03:30:07+0100 2886883 [982]: s58 check_our_lease warning 72 last_success 2886811
2017-02-25 03:30:08+0100 2886884 [982]: s58 check_our_lease warning 73 last_success 2886811
2017-02-25 03:30:09+0100 2886885 [982]: s58 check_our_lease warning 74 last_success 2886811
2017-02-25 03:30:10+0100 2886886 [982]: s58 check_our_lease warning 75 last_success 2886811
2017-02-25 03:30:11+0100 2886887 [982]: s58 check_our_lease warning 76 last_success 2886811
2017-02-25 03:30:12+0100 2886888 [982]: s58 check_our_lease warning 77 last_success 2886811
2017-02-25 03:30:13+0100 2886889 [982]: s58 check_our_lease warning 78 last_success 2886811
2017-02-25 03:30:14+0100 2886890 [982]: s58 check_our_lease warning 79 last_success 2886811
2017-02-25 03:30:15+0100 2886891 [982]: s58 check_our_lease failed 80
2017-02-25 03:30:16+0100 2886891 [982]: s58 all pids clear
2017-02-25 03:30:27+0100 2886903 [30909]: 310b5348 aio timeout RD 0x7fe53c0009b0:0x7fe53c0009c0:0x7fe552201000 ioto 10 to_count 11
2017-02-25 03:30:27+0100 2886903 [30909]: s58 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:30:27+0100 2886903 [30909]: s58 renewal error -202 delta_length 20 last_success 2886811
2017-02-25 03:30:47+0100 2886922 [30909]: 310b5348 aio collect 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 close free
2017-02-25 03:30:47+0100 2886922 [30909]: 310b5348 aio collect 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 result 1048576:0 close free
2017-02-25 03:30:47+0100 2886922 [30909]: 310b5348 aio collect 0x7fe53c000960:0x7fe53c000970:0x7fe552d07000 result 1048576:0 close free
2017-02-25 03:30:47+0100 2886922 [30909]: 310b5348 aio collect 0x7fe53c0009b0:0x7fe53c0009c0:0x7fe552201000 result 1048576:0 close free
2017-02-25 03:30:58+0100 2886933 [991]: s59 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 1 147 2874109 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 2 306 2876536 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 3 329 2886933 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 4 234 2883433 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 03:31:20+0100 2886955 [982]: s59 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 03:33:31+0100 2887087 [30639]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 03:33:31+0100 2887087 [30639]: s59 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:33:31+0100 2887087 [30639]: s59 renewal error -202 delta_length 10 last_success 2887056
2017-02-25 03:33:52+0100 2887107 [30639]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 03:33:52+0100 2887107 [30639]: s59 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:33:52+0100 2887107 [30639]: s59 renewal error -202 delta_length 20 last_success 2887056
2017-02-25 03:34:01+0100 2887116 [982]: s59 check_our_lease warning 60 last_success 2887056
2017-02-25 03:34:02+0100 2887117 [982]: s59 check_our_lease warning 61 last_success 2887056
2017-02-25 03:34:03+0100 2887118 [982]: s59 check_our_lease warning 62 last_success 2887056
2017-02-25 03:34:04+0100 2887119 [982]: s59 check_our_lease warning 63 last_success 2887056
2017-02-25 03:34:05+0100 2887120 [982]: s59 check_our_lease warning 64 last_success 2887056
2017-02-25 03:34:06+0100 2887121 [982]: s59 check_our_lease warning 65 last_success 2887056
2017-02-25 03:34:07+0100 2887122 [982]: s59 check_our_lease warning 66 last_success 2887056
2017-02-25 03:34:08+0100 2887123 [982]: s59 check_our_lease warning 67 last_success 2887056
2017-02-25 03:34:09+0100 2887124 [982]: s59 check_our_lease warning 68 last_success 2887056
2017-02-25 03:34:10+0100 2887125 [982]: s59 check_our_lease warning 69 last_success 2887056
2017-02-25 03:34:11+0100 2887126 [982]: s59 check_our_lease warning 70 last_success 2887056
2017-02-25 03:34:12+0100 2887127 [982]: s59 check_our_lease warning 71 last_success 2887056
2017-02-25 03:34:12+0100 2887128 [30639]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 3
2017-02-25 03:34:12+0100 2887128 [30639]: s59 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 03:34:12+0100 2887128 [30639]: s59 renewal error -202 delta_length 20 last_success 2887056
2017-02-25 03:34:13+0100 2887128 [982]: s59 check_our_lease warning 72 last_success 2887056
2017-02-25 03:34:14+0100 2887129 [982]: s59 check_our_lease warning 73 last_success 2887056
2017-02-25 03:34:15+0100 2887130 [982]: s59 check_our_lease warning 74 last_success 2887056
2017-02-25 03:34:16+0100 2887131 [982]: s59 check_our_lease warning 75 last_success 2887056
2017-02-25 03:34:17+0100 2887132 [982]: s59 check_our_lease warning 76 last_success 2887056
2017-02-25 03:34:18+0100 2887133 [982]: s59 check_our_lease warning 77 last_success 2887056
2017-02-25 03:34:19+0100 2887134 [982]: s59 check_our_lease warning 78 last_success 2887056
2017-02-25 03:34:20+0100 2887135 [982]: s59 check_our_lease warning 79 last_success 2887056
2017-02-25 03:34:21+0100 2887136 [982]: s59 check_our_lease failed 80
2017-02-25 03:34:21+0100 2887136 [982]: s59 all pids clear
2017-02-25 03:34:25+0100 2887140 [30639]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free
2017-02-25 03:34:25+0100 2887140 [30639]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free
2017-02-25 03:34:25+0100 2887140 [30639]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 other free
2017-02-25 03:34:35+0100 2887150 [991]: s60 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 1 147 2874335 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 2 306 2876762 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 3 330 2887150 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 4 234 2883659 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 03:34:59+0100 2887174 [982]: s60 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 04:31:57+0100 2890592 [36739]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 04:31:57+0100 2890592 [36739]: s60 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:31:57+0100 2890592 [36739]: s60 renewal error -202 delta_length 10 last_success 2890562
2017-02-25 04:32:18+0100 2890613 [36739]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 04:32:18+0100 2890613 [36739]: s60 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:32:18+0100 2890613 [36739]: s60 renewal error -202 delta_length 20 last_success 2890562
2017-02-25 04:32:27+0100 2890622 [982]: s60 check_our_lease warning 60 last_success 2890562
2017-02-25 04:32:28+0100 2890623 [982]: s60 check_our_lease warning 61 last_success 2890562
2017-02-25 04:32:29+0100 2890624 [982]: s60 check_our_lease warning 62 last_success 2890562
2017-02-25 04:32:30+0100 2890625 [982]: s60 check_our_lease warning 63 last_success 2890562
2017-02-25 04:32:31+0100 2890626 [982]: s60 check_our_lease warning 64 last_success 2890562
2017-02-25 04:32:32+0100 2890627 [982]: s60 check_our_lease warning 65 last_success 2890562
2017-02-25 04:32:33+0100 2890628 [982]: s60 check_our_lease warning 66 last_success 2890562
2017-02-25 04:32:34+0100 2890629 [982]: s60 check_our_lease warning 67 last_success 2890562
2017-02-25 04:32:35+0100 2890630 [982]: s60 check_our_lease warning 68 last_success 2890562
2017-02-25 04:32:36+0100 2890631 [982]: s60 check_our_lease warning 69 last_success 2890562
2017-02-25 04:32:37+0100 2890632 [982]: s60 check_our_lease warning 70 last_success 2890562
2017-02-25 04:32:38+0100 2890633 [982]: s60 check_our_lease warning 71 last_success 2890562
2017-02-25 04:32:38+0100 2890633 [36739]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 3
2017-02-25 04:32:38+0100 2890633 [36739]: s60 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:32:38+0100 2890633 [36739]: s60 renewal error -202 delta_length 20 last_success 2890562
2017-02-25 04:32:39+0100 2890634 [982]: s60 check_our_lease warning 72 last_success 2890562
2017-02-25 04:32:40+0100 2890635 [982]: s60 check_our_lease warning 73 last_success 2890562
2017-02-25 04:32:41+0100 2890636 [982]: s60 check_our_lease warning 74 last_success 2890562
2017-02-25 04:32:42+0100 2890637 [982]: s60 check_our_lease warning 75 last_success 2890562
2017-02-25 04:32:43+0100 2890638 [982]: s60 check_our_lease warning 76 last_success 2890562
2017-02-25 04:32:44+0100 2890639 [982]: s60 check_our_lease warning 77 last_success 2890562
2017-02-25 04:32:45+0100 2890640 [982]: s60 check_our_lease warning 78 last_success 2890562
2017-02-25 04:32:46+0100 2890641 [982]: s60 check_our_lease warning 79 last_success 2890562
2017-02-25 04:32:47+0100 2890642 [982]: s60 check_our_lease failed 80
2017-02-25 04:32:47+0100 2890642 [982]: s60 all pids clear
2017-02-25 04:32:48+0100 2890644 [36739]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free r
2017-02-25 04:32:48+0100 2890644 [36739]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free r
2017-02-25 04:32:48+0100 2890644 [36739]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 match reap
2017-02-25 04:32:58+0100 2890654 [992]: s61 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 1 147 2877844 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 2 306 2880254 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 3 331 2890654 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 4 234 2887164 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 04:33:20+0100 2890675 [982]: s61 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 04:33:49+0100 2890705 [13507]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 04:33:49+0100 2890705 [13507]: s61 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:33:49+0100 2890705 [13507]: s61 renewal error -202 delta_length 10 last_success 2890674
2017-02-25 04:34:10+0100 2890725 [13507]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 04:34:10+0100 2890725 [13507]: s61 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:34:10+0100 2890725 [13507]: s61 renewal error -202 delta_length 20 last_success 2890674
2017-02-25 04:34:19+0100 2890734 [982]: s61 check_our_lease warning 60 last_success 2890674
2017-02-25 04:34:20+0100 2890735 [982]: s61 check_our_lease warning 61 last_success 2890674
2017-02-25 04:34:21+0100 2890736 [982]: s61 check_our_lease warning 62 last_success 2890674
2017-02-25 04:34:22+0100 2890737 [982]: s61 check_our_lease warning 63 last_success 2890674
2017-02-25 04:34:23+0100 2890738 [982]: s61 check_our_lease warning 64 last_success 2890674
2017-02-25 04:34:24+0100 2890739 [982]: s61 check_our_lease warning 65 last_success 2890674
2017-02-25 04:34:25+0100 2890740 [982]: s61 check_our_lease warning 66 last_success 2890674
2017-02-25 04:34:26+0100 2890741 [982]: s61 check_our_lease warning 67 last_success 2890674
2017-02-25 04:34:27+0100 2890742 [982]: s61 check_our_lease warning 68 last_success 2890674
2017-02-25 04:34:28+0100 2890743 [982]: s61 check_our_lease warning 69 last_success 2890674
2017-02-25 04:34:29+0100 2890744 [982]: s61 check_our_lease warning 70 last_success 2890674
2017-02-25 04:34:30+0100 2890745 [982]: s61 check_our_lease warning 71 last_success 2890674
2017-02-25 04:34:30+0100 2890746 [13507]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 3
2017-02-25 04:34:30+0100 2890746 [13507]: s61 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:34:30+0100 2890746 [13507]: s61 renewal error -202 delta_length 20 last_success 2890674
2017-02-25 04:34:31+0100 2890746 [982]: s61 check_our_lease warning 72 last_success 2890674
2017-02-25 04:34:32+0100 2890747 [982]: s61 check_our_lease warning 73 last_success 2890674
2017-02-25 04:34:33+0100 2890748 [982]: s61 check_our_lease warning 74 last_success 2890674
2017-02-25 04:34:34+0100 2890749 [982]: s61 check_our_lease warning 75 last_success 2890674
2017-02-25 04:34:35+0100 2890750 [982]: s61 check_our_lease warning 76 last_success 2890674
2017-02-25 04:34:36+0100 2890751 [982]: s61 check_our_lease warning 77 last_success 2890674
2017-02-25 04:34:37+0100 2890752 [982]: s61 check_our_lease warning 78 last_success 2890674
2017-02-25 04:34:38+0100 2890753 [982]: s61 check_our_lease warning 79 last_success 2890674
2017-02-25 04:34:39+0100 2890754 [982]: s61 check_our_lease failed 80
2017-02-25 04:34:39+0100 2890755 [982]: s61 all pids clear
2017-02-25 04:34:51+0100 2890766 [13507]: 310b5348 aio timeout RD 0x7fe53c0009b0:0x7fe53c0009c0:0x7fe552201000 ioto 10 to_count 4
2017-02-25 04:34:51+0100 2890766 [13507]: s61 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 04:34:51+0100 2890766 [13507]: s61 renewal error -202 delta_length 20 last_success 2890674
2017-02-25 04:35:13+0100 2890788 [13507]: 310b5348 aio collect 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 close free
2017-02-25 04:35:13+0100 2890788 [13507]: 310b5348 aio collect 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 close free
2017-02-25 04:35:13+0100 2890788 [13507]: 310b5348 aio collect 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 close free
2017-02-25 04:35:13+0100 2890788 [13507]: 310b5348 aio collect 0x7fe53c0009b0:0x7fe53c0009c0:0x7fe552201000 result 1048576:0 close free
2017-02-25 04:35:23+0100 2890798 [992]: s62 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 1 147 2877987 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 2 306 2880398 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 3 332 2890798 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 4 234 2887307 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 04:35:44+0100 2890820 [982]: s62 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 05:14:51+0100 2893166 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 05:14:51+0100 2893166 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:14:51+0100 2893166 [15983]: s62 renewal error -202 delta_length 10 last_success 2893136
2017-02-25 05:15:12+0100 2893187 [15983]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 05:15:12+0100 2893187 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:15:12+0100 2893187 [15983]: s62 renewal error -202 delta_length 20 last_success 2893136
2017-02-25 05:15:21+0100 2893196 [982]: s62 check_our_lease warning 60 last_success 2893136
2017-02-25 05:15:22+0100 2893197 [982]: s62 check_our_lease warning 61 last_success 2893136
2017-02-25 05:15:23+0100 2893198 [982]: s62 check_our_lease warning 62 last_success 2893136
2017-02-25 05:15:24+0100 2893199 [982]: s62 check_our_lease warning 63 last_success 2893136
2017-02-25 05:15:25+0100 2893200 [982]: s62 check_our_lease warning 64 last_success 2893136
2017-02-25 05:15:26+0100 2893201 [982]: s62 check_our_lease warning 65 last_success 2893136
2017-02-25 05:15:27+0100 2893202 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free
2017-02-25 05:15:27+0100 2893202 [15983]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free
2017-02-25 05:16:38+0100 2893273 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 3
2017-02-25 05:16:38+0100 2893273 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:16:38+0100 2893273 [15983]: s62 renewal error -202 delta_length 10 last_success 2893243
2017-02-25 05:16:41+0100 2893277 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 match reap
2017-02-25 05:18:13+0100 2893369 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 4
2017-02-25 05:18:13+0100 2893369 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:18:13+0100 2893369 [15983]: s62 renewal error -202 delta_length 10 last_success 2893338
2017-02-25 05:18:34+0100 2893389 [15983]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 ioto 10 to_count 5
2017-02-25 05:18:34+0100 2893389 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:18:34+0100 2893389 [15983]: s62 renewal error -202 delta_length 20 last_success 2893338
2017-02-25 05:18:43+0100 2893398 [982]: s62 check_our_lease warning 60 last_success 2893338
2017-02-25 05:18:44+0100 2893399 [982]: s62 check_our_lease warning 61 last_success 2893338
2017-02-25 05:18:45+0100 2893400 [982]: s62 check_our_lease warning 62 last_success 2893338
2017-02-25 05:18:46+0100 2893401 [982]: s62 check_our_lease warning 63 last_success 2893338
2017-02-25 05:18:47+0100 2893402 [982]: s62 check_our_lease warning 64 last_success 2893338
2017-02-25 05:18:48+0100 2893403 [982]: s62 check_our_lease warning 65 last_success 2893338
2017-02-25 05:18:49+0100 2893404 [982]: s62 check_our_lease warning 66 last_success 2893338
2017-02-25 05:18:50+0100 2893405 [982]: s62 check_our_lease warning 67 last_success 2893338
2017-02-25 05:18:51+0100 2893406 [982]: s62 check_our_lease warning 68 last_success 2893338
2017-02-25 05:18:52+0100 2893407 [982]: s62 check_our_lease warning 69 last_success 2893338
2017-02-25 05:18:53+0100 2893408 [982]: s62 check_our_lease warning 70 last_success 2893338
2017-02-25 05:18:54+0100 2893409 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 other free
2017-02-25 05:18:54+0100 2893409 [982]: s62 check_our_lease warning 71 last_success 2893338
2017-02-25 05:18:54+0100 2893409 [15983]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 result 1048576:0 other free
2017-02-25 05:38:14+0100 2894569 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552d07000 ioto 10 to_count 6
2017-02-25 05:38:14+0100 2894569 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:38:14+0100 2894569 [15983]: s62 renewal error -202 delta_length 10 last_success 2894538
2017-02-25 05:38:34+0100 2894589 [15983]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 ioto 10 to_count 7
2017-02-25 05:38:34+0100 2894589 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:38:34+0100 2894589 [15983]: s62 renewal error -202 delta_length 20 last_success 2894538
2017-02-25 05:38:42+0100 2894598 [982]: s62 check_our_lease warning 60 last_success 2894538
2017-02-25 05:38:43+0100 2894599 [982]: s62 check_our_lease warning 61 last_success 2894538
2017-02-25 05:38:44+0100 2894600 [982]: s62 check_our_lease warning 62 last_success 2894538
2017-02-25 05:38:45+0100 2894601 [982]: s62 check_our_lease warning 63 last_success 2894538
2017-02-25 05:38:46+0100 2894602 [982]: s62 check_our_lease warning 64 last_success 2894538
2017-02-25 05:38:47+0100 2894603 [982]: s62 check_our_lease warning 65 last_success 2894538
2017-02-25 05:38:48+0100 2894604 [982]: s62 check_our_lease warning 66 last_success 2894538
2017-02-25 05:38:49+0100 2894605 [982]: s62 check_our_lease warning 67 last_success 2894538
2017-02-25 05:38:50+0100 2894606 [982]: s62 check_our_lease warning 68 last_success 2894538
2017-02-25 05:38:51+0100 2894607 [982]: s62 check_our_lease warning 69 last_success 2894538
2017-02-25 05:38:52+0100 2894608 [982]: s62 check_our_lease warning 70 last_success 2894538
2017-02-25 05:38:53+0100 2894609 [982]: s62 check_our_lease warning 71 last_success 2894538
2017-02-25 05:38:54+0100 2894610 [982]: s62 check_our_lease warning 72 last_success 2894538
2017-02-25 05:38:55+0100 2894610 [15983]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 8
2017-02-25 05:38:55+0100 2894610 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 05:38:55+0100 2894610 [15983]: s62 renewal error -202 delta_length 20 last_success 2894538
2017-02-25 05:38:55+0100 2894611 [982]: s62 check_our_lease warning 73 last_success 2894538
2017-02-25 05:38:56+0100 2894612 [982]: s62 check_our_lease warning 74 last_success 2894538
2017-02-25 05:38:57+0100 2894613 [982]: s62 check_our_lease warning 75 last_success 2894538
2017-02-25 05:38:58+0100 2894614 [982]: s62 check_our_lease warning 76 last_success 2894538
2017-02-25 05:38:59+0100 2894615 [982]: s62 check_our_lease warning 77 last_success 2894538
2017-02-25 05:39:00+0100 2894616 [982]: s62 check_our_lease warning 78 last_success 2894538
2017-02-25 05:39:01+0100 2894616 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552d07000 result 1048576:0 other free r
2017-02-25 05:39:01+0100 2894616 [15983]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 result 1048576:0 other free r
2017-02-25 05:39:01+0100 2894616 [15983]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 match reap
2017-02-25 06:19:10+0100 2897026 [15983]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 ioto 10 to_count 9
2017-02-25 06:19:10+0100 2897026 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 06:19:10+0100 2897026 [15983]: s62 renewal error -202 delta_length 10 last_success 2896995
2017-02-25 06:19:31+0100 2897046 [15983]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 ioto 10 to_count 10
2017-02-25 06:19:31+0100 2897046 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 06:19:31+0100 2897046 [15983]: s62 renewal error -202 delta_length 20 last_success 2896995
2017-02-25 06:19:40+0100 2897055 [982]: s62 check_our_lease warning 60 last_success 2896995
2017-02-25 06:19:41+0100 2897056 [982]: s62 check_our_lease warning 61 last_success 2896995
2017-02-25 06:19:42+0100 2897057 [982]: s62 check_our_lease warning 62 last_success 2896995
2017-02-25 06:19:43+0100 2897058 [982]: s62 check_our_lease warning 63 last_success 2896995
2017-02-25 06:19:44+0100 2897059 [982]: s62 check_our_lease warning 64 last_success 2896995
2017-02-25 06:19:45+0100 2897060 [982]: s62 check_our_lease warning 65 last_success 2896995
2017-02-25 06:19:46+0100 2897061 [982]: s62 check_our_lease warning 66 last_success 2896995
2017-02-25 06:19:47+0100 2897062 [982]: s62 check_our_lease warning 67 last_success 2896995
2017-02-25 06:19:48+0100 2897063 [982]: s62 check_our_lease warning 68 last_success 2896995
2017-02-25 06:19:49+0100 2897064 [982]: s62 check_our_lease warning 69 last_success 2896995
2017-02-25 06:19:50+0100 2897065 [982]: s62 check_our_lease warning 70 last_success 2896995
2017-02-25 06:19:51+0100 2897066 [982]: s62 check_our_lease warning 71 last_success 2896995
2017-02-25 06:19:51+0100 2897067 [15983]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552d07000 ioto 10 to_count 11
2017-02-25 06:19:51+0100 2897067 [15983]: s62 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 06:19:51+0100 2897067 [15983]: s62 renewal error -202 delta_length 20 last_success 2896995
2017-02-25 06:19:52+0100 2897067 [982]: s62 check_our_lease warning 72 last_success 2896995
2017-02-25 06:19:53+0100 2897068 [982]: s62 check_our_lease warning 73 last_success 2896995
2017-02-25 06:19:54+0100 2897069 [982]: s62 check_our_lease warning 74 last_success 2896995
2017-02-25 06:19:55+0100 2897070 [982]: s62 check_our_lease warning 75 last_success 2896995
2017-02-25 06:19:56+0100 2897071 [982]: s62 check_our_lease warning 76 last_success 2896995
2017-02-25 06:19:57+0100 2897072 [982]: s62 check_our_lease warning 77 last_success 2896995
2017-02-25 06:19:58+0100 2897073 [982]: s62 check_our_lease warning 78 last_success 2896995
2017-02-25 06:19:59+0100 2897074 [982]: s62 check_our_lease warning 79 last_success 2896995
2017-02-25 06:20:00+0100 2897075 [982]: s62 check_our_lease failed 80
2017-02-25 06:20:00+0100 2897075 [15983]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe552303000 result 1048576:0 other free r
2017-02-25 06:20:00+0100 2897075 [15983]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe556d09000 result 1048576:0 other free r
2017-02-25 06:20:00+0100 2897075 [15983]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552d07000 result 1048576:0 match reap
2017-02-25 06:20:00+0100 2897075 [982]: s62 all pids clear
2017-02-25 06:20:10+0100 2897086 [991]: s63 lockspace 310b5348-f034-479d-b208-37e6a72bc21f:3:/rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids:0
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 1 147 2884267 46cab34d-7520-431a-8cdc-d6e9d8ac373e.dell-r420-
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 2 306 2886699 44c5b0d0-8c19-4bd9-8c13-2d7fe413d5cd.dell-r420-
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 3 333 2897086 786be63c-2902-4daf-807e-2f772711bd8f.hp-dl360g8
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 4 234 2893587 d1ce5e77-0d14-49a7-bc13-734bcd836a9f.hp-dl360g8
2017-02-25 06:20:32+0100 2897107 [982]: s63 host 250 1 0 2439df30-8403-4e53-8e2b-08132df86445.hp-dl360g8
2017-02-25 08:48:20+0100 2905976 [11649]: 310b5348 aio timeout RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 ioto 10 to_count 1
2017-02-25 08:48:20+0100 2905976 [11649]: s63 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 08:48:20+0100 2905976 [11649]: s63 renewal error -202 delta_length 10 last_success 2905945
2017-02-25 08:48:41+0100 2905996 [11649]: 310b5348 aio timeout RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 ioto 10 to_count 2
2017-02-25 08:48:41+0100 2905996 [11649]: s63 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 08:48:41+0100 2905996 [11649]: s63 renewal error -202 delta_length 20 last_success 2905945
2017-02-25 08:48:50+0100 2906005 [982]: s63 check_our_lease warning 60 last_success 2905945
2017-02-25 08:48:51+0100 2906006 [982]: s63 check_our_lease warning 61 last_success 2905945
2017-02-25 08:48:52+0100 2906007 [982]: s63 check_our_lease warning 62 last_success 2905945
2017-02-25 08:48:53+0100 2906008 [982]: s63 check_our_lease warning 63 last_success 2905945
2017-02-25 08:48:54+0100 2906009 [982]: s63 check_our_lease warning 64 last_success 2905945
2017-02-25 08:48:55+0100 2906010 [982]: s63 check_our_lease warning 65 last_success 2905945
2017-02-25 08:48:56+0100 2906011 [982]: s63 check_our_lease warning 66 last_success 2905945
2017-02-25 08:48:57+0100 2906012 [982]: s63 check_our_lease warning 67 last_success 2905945
2017-02-25 08:48:58+0100 2906013 [982]: s63 check_our_lease warning 68 last_success 2905945
2017-02-25 08:48:59+0100 2906014 [982]: s63 check_our_lease warning 69 last_success 2905945
2017-02-25 08:49:00+0100 2906015 [982]: s63 check_our_lease warning 70 last_success 2905945
2017-02-25 08:49:01+0100 2906016 [982]: s63 check_our_lease warning 71 last_success 2905945
2017-02-25 08:49:01+0100 2906017 [11649]: 310b5348 aio timeout RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 ioto 10 to_count 3
2017-02-25 08:49:01+0100 2906017 [11649]: s63 delta_renew read timeout 10 sec offset 0 /rhev/data-center/mnt/netapp-vserver-rhev.lab.eng.brq.redhat.com:_rhev0/310b5348-f034-479d-b208-37e6a72bc21f/dom_md/ids
2017-02-25 08:49:01+0100 2906017 [11649]: s63 renewal error -202 delta_length 20 last_success 2905945
2017-02-25 08:49:02+0100 2906017 [982]: s63 check_our_lease warning 72 last_success 2905945
2017-02-25 08:49:03+0100 2906018 [982]: s63 check_our_lease warning 73 last_success 2905945
2017-02-25 08:49:04+0100 2906019 [982]: s63 check_our_lease warning 74 last_success 2905945
2017-02-25 08:49:05+0100 2906020 [982]: s63 check_our_lease warning 75 last_success 2905945
2017-02-25 08:49:06+0100 2906021 [982]: s63 check_our_lease warning 76 last_success 2905945
2017-02-25 08:49:06+0100 2906021 [11649]: 310b5348 aio collect RD 0x7fe53c0008c0:0x7fe53c0008d0:0x7fe556d09000 result 1048576:0 other free r
2017-02-25 08:49:06+0100 2906021 [11649]: 310b5348 aio collect RD 0x7fe53c000910:0x7fe53c000920:0x7fe552d07000 result 1048576:0 other free r
2017-02-25 08:49:06+0100 2906021 [11649]: 310b5348 aio collect RD 0x7fe53c000960:0x7fe53c000970:0x7fe552303000 result 1048576:0 match reap

--- Additional comment from Yaniv Bronhaim on 2017-03-07 13:23:53 CET ---

Can you please elaborate what happens when this storage issue occurs? you wrote in the bug description "in this case which causes vdsm not to respond to any calls. Restart of the vdsm solves the issue."

Is the bug about the storage issue that stuck or about ovirt treatment when such storage issue occurs? probably if the storage is stuck or not-responsive we need to trigger our fencing logic..
Timeout is something that ends. I guess that after timeout expiration vdsm should be responsive again. how the restart solves it if the storage still not responsive? maybe it could be solved also by just waiting for the timeout and trigger the operation again?

--- Additional comment from Yaniv Bronhaim on 2017-03-09 11:51:11 CET ---

After long discussion in mails, the resolution should be:
1. Adding logs for incoming requests with source description (ip or service name).
2. Wait for reproduction - Denis please change vdsm log lever to LOG - it will help, and ping us (Piotr or me) when this happens before restarting vdsmd.
3. Denis - check if you can also provide Nir's requests. maybe it will assist to reproduce it offline.

--- Additional comment from Yaniv Bronhaim on 2017-03-09 11:51 CET ---



--- Additional comment from Yaniv Bronhaim on 2017-03-09 11:52 CET ---



--- Additional comment from Nir Soffer on 2017-03-09 13:28:53 CET ---

Basically this is bug is about storage issues causing queues to become full.

The noise this bug made show that our logs make it too hard to understand 
unresponsive storage issues.

Adding log when starting a request that may block is good idea, we usually do this
in storage, but it is not enough. What we need is an active system that report
during runtime if a request is blocked for long time.

I think the best place to implement this is in the executor - we already have
a mechanism to discard tasks, we can extend it to warn about stuck workers.

Possible design:

    executor.dispatch(callback, warn_timeout=None, discard_timeout=None)

Each warn_timeout seconds, the executor can check the the worker and warn about
blocker worker. This log can use the worker repr() to show the function, arguments and context of this worker.

If a worker is blocked for long time, we will see multiple warnings about it, this is good, makes it easy to detect issues by greping the logs.

If discard_timeout is set, the worker will be discarded, same logic as we have
today.

Eventually everything in vdsm should be done in an executor, so this will solve
this issue globally in the application.

One area will cannot use executors is storage monitoring, done in dedicated thread
this thread currently does not log warning when storage checkers are stuck. This
is handled in these patches:
https://gerrit.ovirt.org/#/q/topic:monitor-alerts

I'm adding this bug to the patches, as this is a good example of case that would
be easier to handle with these logs.

--- Additional comment from Nir Soffer on 2017-03-09 13:36:22 CET ---

Francesco, can you check comment 13?

--- Additional comment from Francesco Romani on 2017-03-09 13:50:06 CET ---

(In reply to Nir Soffer from comment #13)
> Basically this is bug is about storage issues causing queues to become full.
> 
> The noise this bug made show that our logs make it too hard to understand 
> unresponsive storage issues.
> 
> Adding log when starting a request that may block is good idea, we usually
> do this
> in storage, but it is not enough. What we need is an active system that
> report
> during runtime if a request is blocked for long time.

I agree with this assessment

> I think the best place to implement this is in the executor - we already have
> a mechanism to discard tasks, we can extend it to warn about stuck workers.

Agreed, executor should be in charge

> 
> Possible design:
> 
>     executor.dispatch(callback, warn_timeout=None, discard_timeout=None)
> 
> Each warn_timeout seconds, the executor can check the the worker and warn
> about
> blocker worker. This log can use the worker repr() to show the function,
> arguments and context of this worker.

Why two timeouts? the more knobs we have, the harder it gets to have good defaults.
Another option could be:

  executor.dispatch(callback, timeout=None, discard_on_timeout=True)

with defaults, same behaviour as today.
with discard_on_timeout=False
  if timeout expires
    emit warning in the logs about stuck task, using repr() and the infra as
    you just described
    reschedule the check after <timeout> secs

For discarded workers, I think it is enough if we periodically dump the executor state (again no new infra needed, we can use repr()) if there are any stuck worker. Otherwise better be silent.

Summary: I agree with the concept, but I think we should have only one timeout if possible, to reduce the number of knobs.

> If a worker is blocked for long time, we will see multiple warnings about
> it, this is good, makes it easy to detect issues by greping the logs.

Agreed, also having reliable executor state dump is a good thing when things are blocked; today we do the dump only if we discarded a worker, if executor pool is exhausted we don't have dumps anymore.

> Eventually everything in vdsm should be done in an executor, so this will
> solve this issue globally in the application.

Agreed. If we don't need the full power of executor, it is also easy to make it behave like a standard thread pool.

--- Additional comment from Francesco Romani on 2017-03-09 13:57:36 CET ---

restoring NEEDINFO cleared by mistake

--- Additional comment from Nir Soffer on 2017-03-09 14:04:48 CET ---

(In reply to Francesco Romani from comment #15)
I agree with the suggested simplification, this will make it easier to implement
and backport to 4.1.z with less risk.

Not sure what is the timeout value for warnings for jsonrpc executor. At this level
we don't have any context, so we must use large timeout that will not make unwanted
noise for slow operations (storage sync operations, setup network, etc.)

--- Additional comment from Francesco Romani on 2017-03-09 15:22:00 CET ---

patches https://gerrit.ovirt.org/73845 - https://gerrit.ovirt.org/73849 are attached to this BZ, but will not solve it - they cannot, as the above comments explained. We are just aiming to make the troubleshooting easier.

--- Additional comment from Roy Golan on 2017-03-14 14:38:59 CET ---

Can you add this error to vdsmfake so we would more easily integrate test? at least it would easy to simulate the error and test the engine behaviour easily. If you need help with it let me know

--- Additional comment from Roy Golan on 2017-03-14 14:42:54 CET ---

Guy, Eyal, anyone seen 'TooManyTasks' error in vdsm while scale-up tests?

--- Additional comment from Piotr Kliczewski on 2017-03-14 17:10:50 CET ---

@Roy what do you want to test specifically? This error would trigger soft fencing which don't make sense with fake vdsm.

Comment 1 Martin Perina 2017-03-17 13:59:55 UTC
Let's use this bug for fencing related changes, for now it's targeted to 4.2, but once patches are finalized we should backport to 4.1.z

Comment 3 Petr Matyáš 2017-09-26 17:07:27 UTC
When testing this I used 20 (and once 50) infinite cycles of `vdsm-client Host getAllVmIoTunePolicies` but this only caused the host to be non responsive and freezed, however only soft fencing was executed which seems like the scenario you sought.

What confuses me is that this did not produce any ResourceExhausted errors in log, which I would expect since the host obviously had all resources exhausted.

Comment 4 Piotr Kliczewski 2017-09-27 07:06:09 UTC
It is not about host having no resources but the thread pool which is used has no more threads when invoking the verbs. You can try to emulate it by changing values (worker_threads, tasks_per_worker) in config.py, rpc section.

Comment 5 Petr Matyáš 2017-10-19 13:49:15 UTC
Verified on ovirt-engine-4.2.0-0.0.master.20171013142622.git15e767c.el7.centos.noarch

When host is throwing too many tasks (I set 1 worker with 1 thread) only soft fencing is used, also VMs are migrated away from this host.

Comment 6 Sandro Bonazzola 2017-12-20 11:09:08 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.