Bug 1445483

Summary:	[starter][starter-us-east-1] cluster horizontal load elapsed time unexpectedly slow - monitoring tools/data for investigation needed
Product:	OpenShift Online	Reporter:	Mike Fiedler <mifiedle>
Component:	Unknown	Assignee:	Abhishek Gupta <abhgupta>
Status:	NEW ---	QA Contact:	Mike Fiedler <mifiedle>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.x	CC:	aos-bugs, jeder
Target Milestone:	---	Keywords:	OnlineStarter
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Fiedler 2017-04-25 18:30:30 UTC

Description of problem:

Loading a starter-us-east-1 with 1000 projects, 3000 deployments and 4000 running pods along with secrets, routes, services, secrets, etc as defined here: https://github.com/openshift/svt/blob/master/openshift_scalability/config/pyconfigMasterVirtScalePause.yaml

Running this workload on starter-us-east-1 took 1h 38m for the load to complete.
 
Running on an internal OpenStack cluster with less powerful compute nodes takes 1h 12m.   This cluster has comparable master/etc/compute node sizes and also has ~300 nodes.

Running on a CNCF OpenStack cluster with equivalent or slightly more powerful computes takes 52m.   This has comparable node sizes but has more nodes (2K).

Part of the issue is that the monitoring data needed to do a deeper investigation of this issue are not available in this environment.   We have detailed data from the other environments available for comparison.  There are some efforts underway to start collection this data, but it is not there right now.

For this particular issue, the master/etcd looked under-utilized from a CPU and disk perspective, vmstat showed average cpu of 7% busy (40 cores) and IOPS on the etcd storage volume peaked around 450 IOPS vs 900 peak IOPS for comparable cluster load ups.

The cluster load up was successful.  No functional issues were observed.   

Version-Release number of selected component (if applicable):  3.5.5.9