1522878 – GetCapabilitiesVDS failed: Not enough resources

Bug 1522878 - GetCapabilitiesVDS failed: Not enough resources

Summary: GetCapabilitiesVDS failed: Not enough resources

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	General
Sub Component:
Version:	4.20.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	4.20.9.1
Assignee:	Milan Zamazal
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-06 16:29 UTC by Evgheni Dereveanchin
Modified:	2018-02-12 10:09 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-02-12 10:09:41 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	rule-engine: ovirt-4.2+ rule-engine: blocker+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	85199	master	ABANDONED	virt: Remove locking from attribute read access in storage.Drive	2020-12-31 22:52:36 UTC
oVirt gerrit	85277	master	ABANDONED	vm: drive: find diskType before monitoring starts	2020-12-31 22:52:34 UTC
oVirt gerrit	85281	ovirt-4.2.0	MERGED	storage: Fix deadlock in Drive.diskType	2020-12-31 22:52:36 UTC

Description Evgheni Dereveanchin 2017-12-06 16:29:32 UTC

Description of problem:
After upgrading a host from vdsm-4.19.31 to vdsm-4.20.9 and migrating several VMs to it I started hitting issues: first the VMs changed to Unknown status, then the host went non responsive. After that I got several errors like this in event log:

VDSM ovirt-srv01 command GetCapabilitiesVDS failed: Not enough resources:
 {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}

Later VDSM restarted and the host went back online.

Version-Release number of selected component (if applicable):
vdsm-4.20.9-1.el7.centos.x86_64

How reproducible:
I hit this after live-migrating 7 VMs residing on an iSCSI storage domain.

Steps to Reproduce:
1. upgrade host to vdsm-4.20.9
2. live-migrate VMs from vdsm-4.19.31 host

Actual results:
VDSM goes unresponsive, supposedly due to an unexpectedly high number of tasks

Expected results:
live migration finishes successfully

Additional info:
will attach VDSM log in a bit

Comment 2 Evgheni Dereveanchin 2017-12-06 17:12:28 UTC

I can reproduce this by migrating a few more VMs to the host so it does seem related to live migration.

On a side note, attempts to migrate VMs off vdsm-4.20.9 back to vdsm-4.19.31 cause VMs to crash but that's probably material for another BZ

Comment 3 Piotr Kliczewski 2017-12-07 09:15:03 UTC

It seems to be related to well known mom issue:

2017-12-06 16:07:42,853+0000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/4 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getAllVmIoTunePolicies', 'id': u'55585710-3120-4d94-86c8-3383db23f794'} at 0x3f17990> timeout=60, duration=60 at 0x3f17290> task#=930 at 0x3233290> (executor:358)

Comment 4 Michal Skrivanek 2017-12-07 10:33:13 UTC

yes, but supposedly that's happening due to the worker queue being full of DriveWatermarkMonitors

<Executor periodic workers=4 max_workers=30 <TaskQueue periodic max_tasks=400 tasks(400)

Could be a regression in high watermark events, might be also related to bug 1522901 as a trigger. Evgheni, you can try to rule out or confirm the watermark changes by flipping the enable_block_threshold_event option in vdsm.conf to false

Comment 5 Yaniv Kaul 2017-12-07 12:03:18 UTC

(In reply to Michal Skrivanek from comment #4)
> yes, but supposedly that's happening due to the worker queue being full of
> DriveWatermarkMonitors

I remember asking for a feature to dump all workers when the Q is full...
No idea where it is.

> 
> <Executor periodic workers=4 max_workers=30 <TaskQueue periodic
> max_tasks=400 tasks(400)
> 
> Could be a regression in high watermark events, might be also related to bug
> 1522901 as a trigger. Evgheni, you can try to rule out or confirm the
> watermark changes by flipping the enable_block_threshold_event option in
> vdsm.conf to false

Comment 6 Evgheni Dereveanchin 2017-12-07 12:25:38 UTC

(In reply to Michal Skrivanek from comment #4)
> Could be a regression in high watermark events, might be also related to bug
> 1522901 as a trigger. Evgheni, you can try to rule out or confirm the
> watermark changes by flipping the enable_block_threshold_event option in
> vdsm.conf to false

I set enable_block_threshold_event = false in [vars] of vdsm.conf and sent an inbound live migration to force a restart of VDSM on the host. I assume at this point it would re-read the config file. In any case, nothing seems to have changed since after the host came back up, incoming migrations still cause VDSM restarts.

Comment 9 Evgheni Dereveanchin 2017-12-07 22:45:36 UTC

Patch applied to affected host. Incoming migration no longer causes VDSM to restart.

Comment 10 Michal Skrivanek 2017-12-08 15:38:47 UTC

(In reply to Yaniv Kaul from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > yes, but supposedly that's happening due to the worker queue being full of
> > DriveWatermarkMonitors
> 
> I remember asking for a feature to dump all workers when the Q is full...
> No idea where it is.

So that's there, in the log you can see it's full of DriveWatermarkMonitor threads

Comment 11 Francesco Romani 2017-12-11 09:05:48 UTC

(In reply to Michal Skrivanek from comment #10)
> (In reply to Yaniv Kaul from comment #5)
> > (In reply to Michal Skrivanek from comment #4)
> > > yes, but supposedly that's happening due to the worker queue being full of
> > > DriveWatermarkMonitors
> > 
> > I remember asking for a feature to dump all workers when the Q is full...
> > No idea where it is.
> 
> So that's there, in the log you can see it's full of DriveWatermarkMonitor
> threads

Yep: https://gerrit.ovirt.org/#/c/81624/

Comment 12 Israel Pinto 2018-02-06 14:21:32 UTC

Verify with:
Engine:4.2.1.3-0.1.el7
Host 4.1:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.17.1.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.13.1
LIBVIRT Version:libvirt-3.2.0-14.el7_4.7
VDSM Version:vdsm-4.20.17-1.el7ev
Host 4.2:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.14
LIBVIRT Version:libvirt-3.2.0-14.el7_4.9
VDSM Version:vdsm-4.19.45-1.el7ev


Steps:
Migrate VM from 4.1 host to 4.2 host.

Note:
I found https://bugzilla.redhat.com/show_bug.cgi?id=1542117
while migrate vm.

Comment 13 Sandro Bonazzola 2018-02-12 10:09:41 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.