1430876 – [RFE] Increase supported per-manager host limit

Bug 1430876 - [RFE] Increase supported per-manager host limit

Summary: [RFE] Increase supported per-manager host limit

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Daniel Gur
QA Contact:	Daniel Gur
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1698310 (view as bug list)
Depends On:
Blocks:	1520566
TreeView+	depends on / blocked

Reported:	2017-03-09 18:44 UTC by Ashton Davis
Modified:	2021-09-09 12:11 UTC (History)
CC List:	22 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2018-05-15 17:41:09 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3448491	None	None	None	2018-05-17 05:52:52 UTC
Red Hat Knowledge Base (Solution)	4056901	Troubleshoot	None	Cannot start Virtual Machines in RHV	2019-04-12 00:09:06 UTC
Red Hat Product Errata	RHEA-2018:1488	None	None	None	2018-05-15 17:42:35 UTC
oVirt gerrit	77969	master	ABANDONED	engine: Improve HostMonitoring by using non blocking threads	2020-05-19 16:58:01 UTC
oVirt gerrit	84184	master	MERGED	engine : Refactor HostMonitoring and VdsManager code	2020-05-19 16:58:01 UTC
oVirt gerrit	84400	master	MERGED	engine: Make additional changes needed to use non blocking threads	2020-05-19 16:58:02 UTC
oVirt gerrit	84401	master	MERGED	engine : Make Async getCapabilities calls in HostMonitoring	2020-05-19 16:58:03 UTC
oVirt gerrit	84402	master	MERGED	engine : Make Async getStats calls in HostMonitoring	2020-05-19 16:58:03 UTC
oVirt gerrit	84403	master	MERGED	engine : Refactor getHardwareInfo code to make easier async calls	2020-05-19 16:58:03 UTC
oVirt gerrit	84404	master	MERGED	engine : Make Async getHardwareInfo calls in HostMonitoring	2020-05-19 16:58:03 UTC
oVirt gerrit	87269	ovirt-engine-4.2	MERGED	engine : Refactor HostMonitoring and VdsManager code	2020-05-19 16:58:03 UTC
oVirt gerrit	87270	ovirt-engine-4.2	MERGED	engine: Make additional changes needed to use non blocking threads	2020-05-19 16:58:03 UTC
oVirt gerrit	87271	ovirt-engine-4.2	MERGED	engine : Make Async getCapabilities calls in HostMonitoring	2020-05-19 16:58:04 UTC
oVirt gerrit	87272	ovirt-engine-4.2	MERGED	engine : Make Async getStats calls in HostMonitoring	2020-05-19 16:58:04 UTC
oVirt gerrit	87273	ovirt-engine-4.2	MERGED	engine : Refactor getHardwareInfo code to make easier async calls	2020-05-19 16:58:04 UTC
oVirt gerrit	87274	ovirt-engine-4.2	MERGED	engine : Make Async getHardwareInfo calls in HostMonitoring	2020-05-19 16:58:04 UTC

Description Ashton Davis 2017-03-09 18:44:17 UTC

Description of problem:
According to our documentation [1] we only support 200 hypervisors on one RHV-M. In today's infrastructure sizes, that's a low number and should be higher. I have at least one real-world example where 200 is too low for a production deployment.

[1] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.0/html-single/technical_reference/#Data_center_limitations

Additional info:

A limit of 400 or 500 is more reasonable (this would better match our competition - vCenter can do up to 500 hosts in a cluster).

Comment 3 Yaniv Kaul 2017-06-07 04:00:37 UTC

Roy, did we understand what were the bottlenecks QE saw, or was it in their environment?

Comment 4 Roy Golan 2017-07-05 07:25:26 UTC

After continuous profiling sessions it is clear that most of the engine effort is put on host and vms statistics collection. Even though my env is mostly with fake vms and hosts, it simulates the load on the engine without a problem. I looks like we can increase the number of hosts with no real problem, and with the help of Postgresql 9.5 which is coming in 4.2 and of course a decent drive for it, it is doable.

What I still want to do is do decrease the polling interval of the statistics to 30 seconds instead of the current 15s. This is essentially a config option change. With just a tiny effort we can leave the vms monitoring still polling the vm list on 15s just to cover gaps and to keep the system behaving as is (ahadas's advice).

Over all the cpu consumption of the engine on this big setup didn't surpass 40% and was mostly at 15%
Over all memory consumption was fluctuating between 200-1200 Mb but with frequent GC cycles (every ~30s) - most of the garbage is young objects created by monitoring code.

Supporting large number of hosts should also take into consideration the VM density. High density is usually a VDI deployment, thin VMs and this means more effort on VDSM side to monitor disk watermark - should be better by libvirt event in 4.2 as well. There is nothing preventing deploying lots of hosts with high density but we usually don't see this (cmiiw here)

Comment 5 Yaniv Kaul 2017-07-05 13:14:20 UTC

I'm fine with decreasing the polling. I wonder if we should do it by default or only to large environments. Please send an email to devel mailing list asking about the pros/cons.

Comment 6 Yaniv Lavi 2017-07-06 12:24:37 UTC

Can we please estimate the improvement from this change, so we can decide the benefit on that base?

Comment 9 Yaniv Kaul 2017-08-21 12:01:31 UTC

This bug should be moved to MODIFIED or ON_QA as soon as:
1. PG 9.5 is in. (https://gerrit.ovirt.org/#/q/status:open+project:ovirt-engine+branch:master+topic:postgres9.5 )
2. Ravi's native threads is in (https://gerrit.ovirt.org/#/q/status:merged+project:ovirt-engine+branch:master+topic:threading ) - already is.
2. Ravi's 2nd part of the series for threads is in (https://gerrit.ovirt.org/#/q/status:open+project:ovirt-engine+branch:master+topic:threading )

Comment 20 Martin Perina 2018-02-08 13:54:23 UTC

Reverting changes done by automatic bots

Comment 22 Yaniv Kaul 2018-02-28 07:33:49 UTC

We've completed all the work that we've intended to perform for RHV 4.2 in this RFE. We've already seen QE running with 400 hosts and we believe we can get to higher numbers with additional improvements we've had in 4.2.2. Moving to ON_QA for QE to verify.

Comment 25 Daniel Gur 2018-04-25 09:25:48 UTC

Removing Need Info as this bug is already closed. And info provided

Comment 28 errata-xmlrpc 2018-05-15 17:41:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 29 Germano Veit Michel 2019-04-11 23:10:12 UTC

*** Bug 1698310 has been marked as a duplicate of this bug. ***

Comment 30 Franta Kust 2019-05-16 13:06:31 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.

ahoness
dougsland
Egarciad
fgarciad
hyupark
jclaretm
lbopf
lsurette
mgoldboi
mkalinin
molasaga
mperina
rbalakri
rgolan
Rhev-m-bugs
rzaleski
sradco
srevivo
subhat
usurse
ykaul
ylavi