Bug 1266579

Summary:	DestroyVDSCommand times out when hypervisor is under load
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Tim Speetjens <tspeetje>
Component:	vdsm	Assignee:	Dan Kenigsberg <danken>
Status:	CLOSED DUPLICATE	QA Contact:	Aharon Canan <acanan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.4.5	CC:	bazulay, ecohen, gklein, lsurette, tspeetje, ycui, yeylon
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	virt
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-05 11:24:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tim Speetjens 2015-09-25 17:08:27 UTC

Description of problem:
Under load, the DestroyVDSCommand requires significantly more time which
may exceed vdsTimeout and as a result the HV/SPM is set to non-responsive

Version-Release number of selected component (if applicable):
vdsm-4.14.18-4.el6ev.x86_64

How reproducible:
Difficult, happens in an environment that is always busy.

Steps to Reproduce:
Not clear yet

Actual results:
TimeoutException in engine.log
Hypervisor / SPM is marked as non-operational

Expected results:
DestroyVDSCommand must return in a timely fashion, even under load

Comment 2 Dan Kenigsberg 2015-09-26 13:54:52 UTC

Can you specify the nature of the load? How many VMs? What is the CPU consumption? How many host CPUs? How loaded is the management network and its underlying NIC?

Would the customer be willing to test if the fix for bug 1247075 ? It task-setting Vdsm to a single CPU is reported to improve Vdsm responsiveness.

Comment 3 Tim Speetjens 2015-09-29 12:14:07 UTC

To have an idea about the load:

This environment is API driven, with many templates/vms created. In between the start and finish of the job in VDSM, lots of vmGetStats were seen, and also multiple disk creation activities.

The load of the hypervisors is not alarming for a dual CPU 8-core/16 thread system. Networks are used, but the timeouts, to my knowledge are not network related.

Unsure if they can test the patch easily.

Comment 4 Tim Speetjens 2015-11-05 11:24:37 UTC


*** This bug has been marked as a duplicate of bug 1270220 ***