1243938 – [Scale] - Inventory of 10k vm provider, 90minutes spent between Updating Folders To Vms relationships to Updating Clusters To Resource Pools relationships

Bug 1243938 - [Scale] - Inventory of 10k vm provider, 90minutes spent between Updating Folders To Vms relationships to Updating Clusters To Resource Pools relationships

Summary: [Scale] - Inventory of 10k vm provider, 90minutes spent between Updating Fold...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	GA
Target Release:	5.5.0
Assignee:	dmetzger
QA Contact:	Taras Lehinevych
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-16 15:44 UTC by Alex Krzos
Modified:	2019-10-10 09:58 UTC (History)
CC List:	6 users (show)
Fixed In Version:	5.5.0.8
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-08 13:22:46 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
evm.log from refresh worker while refreshing large scale vmware provider (557.52 KB, application/zip) 2015-07-16 15:50 UTC, Alex Krzos	no flags	Details
Appliance CPU Usage (78.13 KB, image/png) 2015-07-16 16:00 UTC, Alex Krzos	no flags	Details
top_output log from appliance during refresh (159.23 KB, application/zip) 2015-07-16 17:38 UTC, Alex Krzos	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:2551	0	normal	SHIPPED_LIVE	Moderate: CFME 5.5.0 bug fixes and enhancement update	2015-12-08 17:58:09 UTC

Description Alex Krzos 2015-07-16 15:44:11 UTC

Description of problem:
Initial Refresh of a large 10,000 virtual machine provider hangs for 90minutes between:

[----] I, [2015-07-16T09:36:56.898159 #32890:9b9eac]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships.
[----] I, [2015-07-16T11:07:44.545745 #32890:9b9eac]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships.


Version-Release number of selected component (if applicable):
5.4.0.5

How reproducible:
Always with large provider

Steps to Reproduce:
1. Add provider and tailf evm.log | grep "#(pid of refresh worker)"
2. Witness process hanging on above log line for 90 minutes
3.

Actual results:
Inventory hangs at this point of time.  A larger provider will most likely timeout the message since refresh timeout is 7200s.

A timeout was actually witnessed on an appliance refreshing two large providers concurrently.

Expected results:
Refresh and inventory to occur much faster.


Additional info:

Timings per refresh worker:
Refreshing targets for EMS...Complete - Timings: {:get_ems_data=>23.835465669631958, :get_vc_data=>72.03464484214783, :get_vc_data_ems_customization_specs=>0.028870820999145508, :filter_vc_data=>0.0004057884216308594, :get_vc_data_host_scsi=>17.657812118530273, :get_vc_data_total=>113.55852603912354, :parse_vc_data=>131.7532970905304, :db_save_inventory=>6145.719205617905, :post_refresh_ems=>104.97212195396423, :total_time=>6496.003431558609}

The time spent between:
Updating Folders To Vms relationships.
Updating Clusters To Resource Pools relationships.
is 89% of total time spent in db_save_inventory and 83.9% total time spent during this refresh.  Optimizing what is occurring here will provide tremendous gains.

Comment 1 Alex Krzos 2015-07-16 15:50:14 UTC

Created attachment 1052742 [details]
evm.log from refresh worker while refreshing large scale vmware provider

Comment 3 Alex Krzos 2015-07-16 16:00:33 UTC

Created attachment 1052743 [details]
Appliance CPU Usage

Shows one vCPU pegged with nice utilization during refresh.

Comment 4 Alex Krzos 2015-07-16 17:38:37 UTC

Created attachment 1052784 [details]
top_output log from appliance during refresh

Comment 5 Jason Frey 2015-07-24 17:28:06 UTC

Since almost all of the time is in database, my guess is that we are going to have to do a refresh in a totally different way to get this scalable...something like skeletal refresh might be the way we have to go.

Comment 6 dmetzger 2015-08-27 13:01:30 UTC

Modified EmsRefresh::LinkInventory to pass the array of object id's to be added / linked to set_child (low level method doing the work) rather than 1 call be object id.

Comment 7 Taras Lehinevych 2015-11-13 14:07:37 UTC

Verified fixed in version 5.5.0.10

Change:
1.Appliance memory bump from 6GiB to 10iB
2.Memory threshold on:
3.Connection Broker - 4 GB
4.Refresh - 6 GB
5.ems_refresh_core_worker: 1.gigabytes

Add provider, all vms appeared after 85 minutes, however we are interested how long an EmsRefresh.refresh takes on a large scale provider. The time spent between: Updating Folders To Vms relationships and Updating Clusters To Resource Pools relationships was significantly reduced. On rhos6 took 2 minutes:

[----] I, [2015-11-12T11:50:59.754105 #14570:b3798c]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships.
[----] I, [2015-11-12T11:52:11.051469 #14570:b3798c]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships.

The whole refresh process took around 4 minutes.

Comment 9 errata-xmlrpc 2015-12-08 13:22:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551

Note You need to log in before you can comment on or make changes to this bug.