Bug 1301587

Summary: [scale] - hosts initialization taking too long (with 500 fake hosts and 10K vms)
Product: [oVirt] ovirt-engine Reporter: Eldad Marciano <emarcian>
Component: Backend.CoreAssignee: Martin Perina <mperina>
Status: CLOSED CURRENTRELEASE QA Contact: eberman
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.2CC: bugs, mperina
Target Milestone: ovirt-4.2.0Keywords: Performance
Target Release: 4.2.0Flags: rule-engine: ovirt-4.2+
rule-engine: planning_ack+
rule-engine: devel_ack+
eberman: testing_ack+
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-12 12:57:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1364791    
Bug Blocks:    
Attachments:
Description Flags
engine thread dumps none

Description Eldad Marciano 2016-01-25 12:56:51 UTC
Description of problem:
hosts initialization takes too long, one the engine restarted.
we used loaded engine with 500 fake hosts and 10K vms.

the fake hosts powered by ovirt-vdsmfake.

they latency is very fast 1 sec, and it supports multiple request (so it might not be the bottleneck). 


Version-Release number of selected component (if applicable):
rhevm 3.6.2.0-1

How reproducible:
100%

Steps to Reproduce:
1. just restart loaded engine.


Actual results:
hosts initialization takes too long (more than 30 min).

Expected results:
faster results.

Additional info:

Comment 1 Eldad Marciano 2016-01-25 13:00:43 UTC
Created attachment 1117980 [details]
engine thread dumps

Comment 2 Yaniv Kaul 2016-01-25 13:12:12 UTC
Eldad,
- Can you provide engine logs?
- What is 'long', in the sense that how many are initialized per minute? (is it linear, is it slowing down?). Is it still same for 100 hosts, for example?
- Does the number change depending on the number of VMs? Is it the same without the VMs?
- Have you seen any difference between fake and real hosts?

Lastly, rhevm 3.6.2.0-1 is a bit old. While I don't think there were critical changes in this area, the latest is rhevm-3.6.2.6-0.1

Comment 3 Eldad Marciano 2016-01-27 11:57:27 UTC
(In reply to Yaniv Kaul from comment #2)
> Eldad,
> - Can you provide engine logs?
Yes i'll.

> - What is 'long', in the sense that how many are initialized per minute? (isthe 
> it linear, is it slowing down?). Is it still same for 100 hosts, for example?
didn't test, i notice that problem when i restart the engine. and seems like it's serial.

> - Does the number change depending on the number of VMs? Is it the same
> without the VMs?
didn't test it.
> - Have you seen any difference between fake and real hosts?
yes, we have 37 real hosts vs 500 fake hosts.

> 
> Lastly, rhevm 3.6.2.0-1 is a bit old. While I don't think there were
> critical changes in this area, the latest is rhevm-3.6.2.6-0.1
we'll upgrade it ASAP.

Comment 4 Oved Ourfali 2016-01-27 14:01:25 UTC
(In reply to Eldad Marciano from comment #3)
> (In reply to Yaniv Kaul from comment #2)
> > Eldad,
> > - Can you provide engine logs?
> Yes i'll.
> 
> > - What is 'long', in the sense that how many are initialized per minute? (isthe 
> > it linear, is it slowing down?). Is it still same for 100 hosts, for example?
> didn't test, i notice that problem when i restart the engine. and seems like
> it's serial.
> 

I'd be interested in knowing exactly how long it takes.

Comment 5 Oved Ourfali 2016-01-27 14:02:25 UTC
Currently targeting to 3.6.4, but for such scale we might address it only on 4.0.

Comment 6 Red Hat Bugzilla Rules Engine 2016-01-27 14:11:51 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 7 Sandro Bonazzola 2016-05-02 09:59:46 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 8 Yaniv Lavi 2016-05-23 13:16:12 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 9 Yaniv Lavi 2016-05-23 13:20:00 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 11 Red Hat Bugzilla Rules Engine 2016-05-25 14:04:01 UTC
Bug tickets must have version flags set prior to targeting them to a release. Please ask maintainer to set the correct version flags and only then set the target milestone.

Comment 12 Eldad Marciano 2016-08-16 14:29:35 UTC
(In reply to Oved Ourfali from comment #4)
> (In reply to Eldad Marciano from comment #3)
> > (In reply to Yaniv Kaul from comment #2)
> > > Eldad,
> > > - Can you provide engine logs?
> > Yes i'll.
> > 
> > > - What is 'long', in the sense that how many are initialized per minute? (isthe 
> > > it linear, is it slowing down?). Is it still same for 100 hosts, for example?
> > didn't test, i notice that problem when i restart the engine. and seems like
> > it's serial.
> > 
> 
> I'd be interested in knowing exactly how long it takes.

Currently we dont have this such of scale capacity, i'll update one we have it.

Comment 13 Eldad Marciano 2017-05-28 14:35:40 UTC
Is this bug still relevant in terms of topology ?!
we would like to reproduce it with 500 hosts and 10K vms ?!

Comment 14 Martin Perina 2017-05-29 07:33:09 UTC
(In reply to Eldad Marciano from comment #13)
> Is this bug still relevant in terms of topology ?!
> we would like to reproduce it with 500 hosts and 10K vms ?!

We haven't done any improvements for engine startup time, but I think that fixes for BZ1438497 might help also here. Please test 500 hosts and 10K VMs, if there is still issue we will try to optimize.

Comment 15 eberman 2017-12-31 09:24:26 UTC
(In reply to Martin Perina from comment #14)
> (In reply to Eldad Marciano from comment #13)
> > Is this bug still relevant in terms of topology ?!
> > we would like to reproduce it with 500 hosts and 10K vms ?!
> 
> We haven't done any improvements for engine startup time, but I think that
> fixes for BZ1438497 might help also here. Please test 500 hosts and 10K VMs,
> if there is still issue we will try to optimize.


Tested with:
Version: 4.2.0-0.0.master.20171121184703.git173fe83.el7.centos
3 DC
3 Clusters
406 Hosts
Hera : 6 Hosts
leopard : 2 Hosts
UCS : 1
Nested hosts : 400 VMS
3  SD

1700 VMS up (from 1800)

Hera : 1005 VMS
leopard : 515 VMS
UCS : 180


Scenario matrix

Test Step	
1700 VMS
400 Nested hosts

From UI perspective all response time were very reasonable, and everything responded much better  

Tested with chrome and Firefox didnt noticed any latency issues from UX/UI

relevant scenarios preformed:

Sent maintenance to 10 hosts	0:00:08
Reboot 10 nested hosts	0:05:21
Reboot 50 nested hosts	0:06:00
Reboot 80 nested hosts	0:10:14
Reboot 100 nested hosts	0:11:15
Engine restart	0:01:08