Bug 1522878 - GetCapabilitiesVDS failed: Not enough resources
Summary: GetCapabilitiesVDS failed: Not enough resources
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.20.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.0
: 4.20.9.1
Assignee: Milan Zamazal
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-06 16:29 UTC by Evgheni Dereveanchin
Modified: 2018-02-12 10:09 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-12 10:09:41 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 85199 0 master ABANDONED virt: Remove locking from attribute read access in storage.Drive 2020-12-31 22:52:36 UTC
oVirt gerrit 85277 0 master ABANDONED vm: drive: find diskType before monitoring starts 2020-12-31 22:52:34 UTC
oVirt gerrit 85281 0 ovirt-4.2.0 MERGED storage: Fix deadlock in Drive.diskType 2020-12-31 22:52:36 UTC

Description Evgheni Dereveanchin 2017-12-06 16:29:32 UTC
Description of problem:
After upgrading a host from vdsm-4.19.31 to vdsm-4.20.9 and migrating several VMs to it I started hitting issues: first the VMs changed to Unknown status, then the host went non responsive. After that I got several errors like this in event log:

VDSM ovirt-srv01 command GetCapabilitiesVDS failed: Not enough resources:
 {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}

Later VDSM restarted and the host went back online.

Version-Release number of selected component (if applicable):
vdsm-4.20.9-1.el7.centos.x86_64

How reproducible:
I hit this after live-migrating 7 VMs residing on an iSCSI storage domain.

Steps to Reproduce:
1. upgrade host to vdsm-4.20.9
2. live-migrate VMs from vdsm-4.19.31 host

Actual results:
VDSM goes unresponsive, supposedly due to an unexpectedly high number of tasks

Expected results:
live migration finishes successfully

Additional info:
will attach VDSM log in a bit

Comment 2 Evgheni Dereveanchin 2017-12-06 17:12:28 UTC
I can reproduce this by migrating a few more VMs to the host so it does seem related to live migration.

On a side note, attempts to migrate VMs off vdsm-4.20.9 back to vdsm-4.19.31 cause VMs to crash but that's probably material for another BZ

Comment 3 Piotr Kliczewski 2017-12-07 09:15:03 UTC
It seems to be related to well known mom issue:

2017-12-06 16:07:42,853+0000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/4 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getAllVmIoTunePolicies', 'id': u'55585710-3120-4d94-86c8-3383db23f794'} at 0x3f17990> timeout=60, duration=60 at 0x3f17290> task#=930 at 0x3233290> (executor:358)

Comment 4 Michal Skrivanek 2017-12-07 10:33:13 UTC
yes, but supposedly that's happening due to the worker queue being full of DriveWatermarkMonitors

<Executor periodic workers=4 max_workers=30 <TaskQueue periodic max_tasks=400 tasks(400)

Could be a regression in high watermark events, might be also related to bug 1522901 as a trigger. Evgheni, you can try to rule out or confirm the watermark changes by flipping the enable_block_threshold_event option in vdsm.conf to false

Comment 5 Yaniv Kaul 2017-12-07 12:03:18 UTC
(In reply to Michal Skrivanek from comment #4)
> yes, but supposedly that's happening due to the worker queue being full of
> DriveWatermarkMonitors

I remember asking for a feature to dump all workers when the Q is full...
No idea where it is.

> 
> <Executor periodic workers=4 max_workers=30 <TaskQueue periodic
> max_tasks=400 tasks(400)
> 
> Could be a regression in high watermark events, might be also related to bug
> 1522901 as a trigger. Evgheni, you can try to rule out or confirm the
> watermark changes by flipping the enable_block_threshold_event option in
> vdsm.conf to false

Comment 6 Evgheni Dereveanchin 2017-12-07 12:25:38 UTC
(In reply to Michal Skrivanek from comment #4)
> Could be a regression in high watermark events, might be also related to bug
> 1522901 as a trigger. Evgheni, you can try to rule out or confirm the
> watermark changes by flipping the enable_block_threshold_event option in
> vdsm.conf to false

I set enable_block_threshold_event = false in [vars] of vdsm.conf and sent an inbound live migration to force a restart of VDSM on the host. I assume at this point it would re-read the config file. In any case, nothing seems to have changed since after the host came back up, incoming migrations still cause VDSM restarts.

Comment 9 Evgheni Dereveanchin 2017-12-07 22:45:36 UTC
Patch applied to affected host. Incoming migration no longer causes VDSM to restart.

Comment 10 Michal Skrivanek 2017-12-08 15:38:47 UTC
(In reply to Yaniv Kaul from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > yes, but supposedly that's happening due to the worker queue being full of
> > DriveWatermarkMonitors
> 
> I remember asking for a feature to dump all workers when the Q is full...
> No idea where it is.

So that's there, in the log you can see it's full of DriveWatermarkMonitor threads

Comment 11 Francesco Romani 2017-12-11 09:05:48 UTC
(In reply to Michal Skrivanek from comment #10)
> (In reply to Yaniv Kaul from comment #5)
> > (In reply to Michal Skrivanek from comment #4)
> > > yes, but supposedly that's happening due to the worker queue being full of
> > > DriveWatermarkMonitors
> > 
> > I remember asking for a feature to dump all workers when the Q is full...
> > No idea where it is.
> 
> So that's there, in the log you can see it's full of DriveWatermarkMonitor
> threads

Yep: https://gerrit.ovirt.org/#/c/81624/

Comment 12 Israel Pinto 2018-02-06 14:21:32 UTC
Verify with:
Engine:4.2.1.3-0.1.el7
Host 4.1:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.17.1.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.13.1
LIBVIRT Version:libvirt-3.2.0-14.el7_4.7
VDSM Version:vdsm-4.20.17-1.el7ev
Host 4.2:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.14
LIBVIRT Version:libvirt-3.2.0-14.el7_4.9
VDSM Version:vdsm-4.19.45-1.el7ev


Steps:
Migrate VM from 4.1 host to 4.2 host.

Note:
I found https://bugzilla.redhat.com/show_bug.cgi?id=1542117
while migrate vm.

Comment 13 Sandro Bonazzola 2018-02-12 10:09:41 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.