Bug 1522878

Summary:	GetCapabilitiesVDS failed: Not enough resources
Product:	[oVirt] vdsm	Reporter:	Evgheni Dereveanchin <ederevea>
Component:	General	Assignee:	Milan Zamazal <mzamazal>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Israel Pinto <ipinto>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.20.9	CC:	bugs, ederevea, fromani, michal.skrivanek, pkliczew
Target Milestone:	ovirt-4.2.0	Flags:	rule-engine: ovirt-4.2+ rule-engine: blocker+
Target Release:	4.20.9.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-02-12 10:09:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Evgheni Dereveanchin 2017-12-06 16:29:32 UTC

Description of problem:
After upgrading a host from vdsm-4.19.31 to vdsm-4.20.9 and migrating several VMs to it I started hitting issues: first the VMs changed to Unknown status, then the host went non responsive. After that I got several errors like this in event log:

VDSM ovirt-srv01 command GetCapabilitiesVDS failed: Not enough resources:
 {'reason': 'Too many tasks', 'resource': 'jsonrpc', 'current_tasks': 80}

Later VDSM restarted and the host went back online.

Version-Release number of selected component (if applicable):
vdsm-4.20.9-1.el7.centos.x86_64

How reproducible:
I hit this after live-migrating 7 VMs residing on an iSCSI storage domain.

Steps to Reproduce:
1. upgrade host to vdsm-4.20.9
2. live-migrate VMs from vdsm-4.19.31 host

Actual results:
VDSM goes unresponsive, supposedly due to an unexpectedly high number of tasks

Expected results:
live migration finishes successfully

Additional info:
will attach VDSM log in a bit

Comment 2 Evgheni Dereveanchin 2017-12-06 17:12:28 UTC

I can reproduce this by migrating a few more VMs to the host so it does seem related to live migration.

On a side note, attempts to migrate VMs off vdsm-4.20.9 back to vdsm-4.19.31 cause VMs to crash but that's probably material for another BZ

Comment 3 Piotr Kliczewski 2017-12-07 09:15:03 UTC

It seems to be related to well known mom issue:

2017-12-06 16:07:42,853+0000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: <Worker name=jsonrpc/4 running <Task <JsonRpcTask {'params': {}, 'jsonrpc': '2.0', 'method': u'Host.getAllVmIoTunePolicies', 'id': u'55585710-3120-4d94-86c8-3383db23f794'} at 0x3f17990> timeout=60, duration=60 at 0x3f17290> task#=930 at 0x3233290> (executor:358)

Comment 4 Michal Skrivanek 2017-12-07 10:33:13 UTC

yes, but supposedly that's happening due to the worker queue being full of DriveWatermarkMonitors

<Executor periodic workers=4 max_workers=30 <TaskQueue periodic max_tasks=400 tasks(400)

Could be a regression in high watermark events, might be also related to bug 1522901 as a trigger. Evgheni, you can try to rule out or confirm the watermark changes by flipping the enable_block_threshold_event option in vdsm.conf to false

Comment 5 Yaniv Kaul 2017-12-07 12:03:18 UTC

(In reply to Michal Skrivanek from comment #4)
> yes, but supposedly that's happening due to the worker queue being full of
> DriveWatermarkMonitors

I remember asking for a feature to dump all workers when the Q is full...
No idea where it is.

> 
> <Executor periodic workers=4 max_workers=30 <TaskQueue periodic
> max_tasks=400 tasks(400)
> 
> Could be a regression in high watermark events, might be also related to bug
> 1522901 as a trigger. Evgheni, you can try to rule out or confirm the
> watermark changes by flipping the enable_block_threshold_event option in
> vdsm.conf to false

Comment 6 Evgheni Dereveanchin 2017-12-07 12:25:38 UTC

(In reply to Michal Skrivanek from comment #4)
> Could be a regression in high watermark events, might be also related to bug
> 1522901 as a trigger. Evgheni, you can try to rule out or confirm the
> watermark changes by flipping the enable_block_threshold_event option in
> vdsm.conf to false

I set enable_block_threshold_event = false in [vars] of vdsm.conf and sent an inbound live migration to force a restart of VDSM on the host. I assume at this point it would re-read the config file. In any case, nothing seems to have changed since after the host came back up, incoming migrations still cause VDSM restarts.

Comment 9 Evgheni Dereveanchin 2017-12-07 22:45:36 UTC

Patch applied to affected host. Incoming migration no longer causes VDSM to restart.

Comment 10 Michal Skrivanek 2017-12-08 15:38:47 UTC

(In reply to Yaniv Kaul from comment #5)
> (In reply to Michal Skrivanek from comment #4)
> > yes, but supposedly that's happening due to the worker queue being full of
> > DriveWatermarkMonitors
> 
> I remember asking for a feature to dump all workers when the Q is full...
> No idea where it is.

So that's there, in the log you can see it's full of DriveWatermarkMonitor threads

Comment 11 Francesco Romani 2017-12-11 09:05:48 UTC

(In reply to Michal Skrivanek from comment #10)
> (In reply to Yaniv Kaul from comment #5)
> > (In reply to Michal Skrivanek from comment #4)
> > > yes, but supposedly that's happening due to the worker queue being full of
> > > DriveWatermarkMonitors
> > 
> > I remember asking for a feature to dump all workers when the Q is full...
> > No idea where it is.
> 
> So that's there, in the log you can see it's full of DriveWatermarkMonitor
> threads

Yep: https://gerrit.ovirt.org/#/c/81624/

Comment 12 Israel Pinto 2018-02-06 14:21:32 UTC

Verify with:
Engine:4.2.1.3-0.1.el7
Host 4.1:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.17.1.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.13.1
LIBVIRT Version:libvirt-3.2.0-14.el7_4.7
VDSM Version:vdsm-4.20.17-1.el7ev
Host 4.2:
OS Version:RHEL - 7.4 - 18.el7
Kernel Version:3.10.0 - 693.el7.x86_64
KVM Version:2.9.0 - 16.el7_4.14
LIBVIRT Version:libvirt-3.2.0-14.el7_4.9
VDSM Version:vdsm-4.19.45-1.el7ev


Steps:
Migrate VM from 4.1 host to 4.2 host.

Note:
I found https://bugzilla.redhat.com/show_bug.cgi?id=1542117
while migrate vm.

Comment 13 Sandro Bonazzola 2018-02-12 10:09:41 UTC

This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.