Bug 1243938 - [Scale] - Inventory of 10k vm provider, 90minutes spent between Updating Folders To Vms relationships to Updating Clusters To Resource Pools relationships
Summary: [Scale] - Inventory of 10k vm provider, 90minutes spent between Updating Fold...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Performance
Version: 5.4.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: GA
: 5.5.0
Assignee: dmetzger
QA Contact: Taras Lehinevych
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-16 15:44 UTC by Alex Krzos
Modified: 2019-10-10 09:58 UTC (History)
6 users (show)

Fixed In Version: 5.5.0.8
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-12-08 13:22:46 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
evm.log from refresh worker while refreshing large scale vmware provider (557.52 KB, application/zip)
2015-07-16 15:50 UTC, Alex Krzos
no flags Details
Appliance CPU Usage (78.13 KB, image/png)
2015-07-16 16:00 UTC, Alex Krzos
no flags Details
top_output log from appliance during refresh (159.23 KB, application/zip)
2015-07-16 17:38 UTC, Alex Krzos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:2551 0 normal SHIPPED_LIVE Moderate: CFME 5.5.0 bug fixes and enhancement update 2015-12-08 17:58:09 UTC

Description Alex Krzos 2015-07-16 15:44:11 UTC
Description of problem:
Initial Refresh of a large 10,000 virtual machine provider hangs for 90minutes between:

[----] I, [2015-07-16T09:36:56.898159 #32890:9b9eac]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships.
[----] I, [2015-07-16T11:07:44.545745 #32890:9b9eac]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships.


Version-Release number of selected component (if applicable):
5.4.0.5

How reproducible:
Always with large provider

Steps to Reproduce:
1. Add provider and tailf evm.log | grep "#(pid of refresh worker)"
2. Witness process hanging on above log line for 90 minutes
3.

Actual results:
Inventory hangs at this point of time.  A larger provider will most likely timeout the message since refresh timeout is 7200s.

A timeout was actually witnessed on an appliance refreshing two large providers concurrently.

Expected results:
Refresh and inventory to occur much faster.


Additional info:

Timings per refresh worker:
Refreshing targets for EMS...Complete - Timings: {:get_ems_data=>23.835465669631958, :get_vc_data=>72.03464484214783, :get_vc_data_ems_customization_specs=>0.028870820999145508, :filter_vc_data=>0.0004057884216308594, :get_vc_data_host_scsi=>17.657812118530273, :get_vc_data_total=>113.55852603912354, :parse_vc_data=>131.7532970905304, :db_save_inventory=>6145.719205617905, :post_refresh_ems=>104.97212195396423, :total_time=>6496.003431558609}

The time spent between:
Updating Folders To Vms relationships.
Updating Clusters To Resource Pools relationships.
is 89% of total time spent in db_save_inventory and 83.9% total time spent during this refresh.  Optimizing what is occurring here will provide tremendous gains.

Comment 1 Alex Krzos 2015-07-16 15:50:14 UTC
Created attachment 1052742 [details]
evm.log from refresh worker while refreshing large scale vmware provider

Comment 3 Alex Krzos 2015-07-16 16:00:33 UTC
Created attachment 1052743 [details]
Appliance CPU Usage

Shows one vCPU pegged with nice utilization during refresh.

Comment 4 Alex Krzos 2015-07-16 17:38:37 UTC
Created attachment 1052784 [details]
top_output log from appliance during refresh

Comment 5 Jason Frey 2015-07-24 17:28:06 UTC
Since almost all of the time is in database, my guess is that we are going to have to do a refresh in a totally different way to get this scalable...something like skeletal refresh might be the way we have to go.

Comment 6 dmetzger 2015-08-27 13:01:30 UTC
Modified EmsRefresh::LinkInventory to pass the array of object id's to be added / linked to set_child (low level method doing the work) rather than 1 call be object id.

Comment 7 Taras Lehinevych 2015-11-13 14:07:37 UTC
Verified fixed in version 5.5.0.10

Change:
1.Appliance memory bump from 6GiB to 10iB
2.Memory threshold on:
3.Connection Broker - 4 GB
4.Refresh - 6 GB
5.ems_refresh_core_worker: 1.gigabytes

Add provider, all vms appeared after 85 minutes, however we are interested how long an EmsRefresh.refresh takes on a large scale provider. The time spent between: Updating Folders To Vms relationships and Updating Clusters To Resource Pools relationships was significantly reduced. On rhos6 took 2 minutes:

[----] I, [2015-11-12T11:50:59.754105 #14570:b3798c]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships.
[----] I, [2015-11-12T11:52:11.051469 #14570:b3798c]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships.

The whole refresh process took around 4 minutes.

Comment 9 errata-xmlrpc 2015-12-08 13:22:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551


Note You need to log in before you can comment on or make changes to this bug.