1167229 – [scale] VM balloon query performance impact in sampling

Bug 1167229 - [scale] VM balloon query performance impact in sampling

Summary: [scale] VM balloon query performance impact in sampling

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.6.0
Assignee:	Francesco Romani
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-11-24 09:23 UTC by Francesco Romani
Modified:	2015-08-19 07:22 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	1102147
Environment:
Last Closed:	2015-08-19 07:22:18 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Francesco Romani 2014-11-24 09:23:20 UTC

During the verification of Bug #1102147 , which was about preventing timeout and network exception due to system slow to respond to ballooning queries,
concerns where raised about the performance side.

This bug is to track and investigate these concerns.

Relevant history from the bz 1102147:

+++ This bug was initially created as a clone of Bug #1102147 +++


--- Additional comment from Yuri Obshansky on 2014-11-10 02:13:30 EST ---

Bug verification failed.
RHEVM - 3.5.0-0.17.beta.el6ev
VDSM - vdsm-4.16.7.1-1.el6ev

I ran simple script which perform command getAllVmStats
and measure response time
See results:
getAllVmStats response time w/o patch
min        37.319 sec  
average 46.398 sec
max       52.005 sec   

getAllVmStats response time with patch
min        33.058 sec
average 48.268 sec
max       56.701 sec

--- Additional comment from Yaniv Bronhaim on 2014-11-11 06:30:02 EST ---

I suggest to open new bug . the fix here refers to virt by polling the vms status and it decreases the penalty of it - this was merged to 3.5 as the bug is targeted. although performance issues regarding running large scale of vms can lead to many bottle necks
this bottle neck is solved in the bugzilla (fromani - please change the bug title according to the fix)

Yuri, 5 second better is also something meaningful. but we need to understand more deeply what takes long and where to investigate. please open new bug with the latest logs and we'll try to investigate more and close this one

--- Additional comment from Francesco Romani on 2014-11-11 06:40:26 EST ---

(In reply to Yaniv Bronhaim from comment #59)
> I suggest to open new bug . the fix here refers to virt by polling the vms
> status and it decreases the penalty of it - this was merged to 3.5 as the
> bug is targeted. although performance issues regarding running large scale
> of vms can lead to many bottle necks
> this bottle neck is solved in the bugzilla (fromani - please change the bug
> title according to the fix)

Done

--- Additional comment from Yuri Obshansky on 2014-11-16 09:00:41 EST ---

I ran script with getAllVmStats to test Francesco patch 
which is fix for bug 1110835(it is a clone of current bug).
Anyway, I rerun test using original flow
with only one difference - I repeated that flow (create 200 VMs) twice:
1. host created without jsonrpc protocol (using old xmlrpc) 
2. host created with jsonrpc protocol.

Executive summary:
- Test with jsonrpc protocol failed. 
- VM creation process started to be slower and slower after we've created more that 80 VMs.
- DC and Storage Domain crashed each 1/2 hours of running.
- Last 111 VMs creation process got 7 minutes. 
- Environment was not functional from my point of view.
Exceptions raised in engine log:
2014-11-16 14:22:45,120 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-15) [69a1a5c] Correlation ID: 69a1a5c, Job ID: b1d0760c-c346-4037-84c5-30bc8b73c58e, Call Stack: null, Custom Event ID: -1, Message: Failed to Reconstruct Master Domain for Data Center DC-REAL.
2014-11-16 14:25:42,260 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-60) [409d1fa0] ERROR, org.ovirt.engine.core.vdsbroker.irsbroker.GetStoragePoolInfoVDSCommand, exception: VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues', log id: 7d4be17c: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues'
2014-11-16 14:25:42,275 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-60) [409d1fa0] IRS failover failed - cant allocate vds server
2014-11-16 14:25:42,520 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (ajp-/127.0.0.1:8702-7) Operation Failed: [Cannot run VM. The relevant Storage Domain's status is Inactive.]
Error raised in vdsm log:
GuestMonitor-scale-107::DEBUG::2014-11-16 14:27:08,414::vm::486::vm.Vm::(_getUserCpuTuneInfo) vmId=`535fb4ef-a64b-4754-b700-87a8e32d31b5`::Domain Metadata is not set
Thread-140622::DEBUG::2014-11-16 14:27:08,419::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 80 edom: 20 level: 2 message: metadata not found: Requested metadata element is not present

Host details:
vdsm-4.16.7.1-1.el6ev
libvirt-0.10.2-46.el6_6.1
32 CPU Cores
400 G RAM


--- Additional comment from Francesco Romani on 2014-11-24 04:07:14 EST ---

This bug is becoming too complex to follow.

I don't think the issues found in ReconstructMasterDomainCommand flow with json rpc, although of course important, belong in this bug. I suggest to open a new one to better track them.

I also believe that is best to file a new one addressing the performance characteristics of this patch as concerns were raised in https://bugzilla.redhat.com/show_bug.cgi?id=1102147#c58

This bug was about avoiding timeouts and exceptions, and verfication *did not* proved it faulty (although of course more verification is welcome).

Barak, Michal, what do you think about the above?

Comment 1 Francesco Romani 2015-08-19 07:22:18 UTC

This is not relevant anymore due to complete overhaul of sampling subsystem in VDSM that was implemented in 3.6.0.

Now all the libvirt calls always happen on a thread pool so they always are asynchronous with respect to the server threads - the one that answers to queries from Engine/clients.

The queries, be them event notifications or poll requests, always fetch cached data.

Note You need to log in before you can comment on or make changes to this bug.

bazulay
cpelland
danken
dkuznets
dsulliva
ecohen
emarcian
fromani
gklein
kabbott
laravot
lpeer
lsurette
mgoldboi
michal.skrivanek
nbarcet
nsoffer
oourfali
pkliczew
pstehlik
sherold
ybronhei
yeylon
yobshans