Bug 1243938

Summary: [Scale] - Inventory of 10k vm provider, 90minutes spent between Updating Folders To Vms relationships to Updating Clusters To Resource Pools relationships
Product: Red Hat CloudForms Management Engine Reporter: Alex Krzos <akrzos>
Component: PerformanceAssignee: dmetzger
Status: CLOSED ERRATA QA Contact: Taras Lehinevych <tlehinev>
Severity: medium Docs Contact:
Priority: high    
Version: 5.4.0CC: apatters, dajohnso, jhardy, obarenbo, perfbz, tlehinev
Target Milestone: GA   
Target Release: 5.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 5.5.0.8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-08 13:22:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
evm.log from refresh worker while refreshing large scale vmware provider
none
Appliance CPU Usage
none
top_output log from appliance during refresh none

Description Alex Krzos 2015-07-16 15:44:11 UTC
Description of problem:
Initial Refresh of a large 10,000 virtual machine provider hangs for 90minutes between:

[----] I, [2015-07-16T09:36:56.898159 #32890:9b9eac]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships.
[----] I, [2015-07-16T11:07:44.545745 #32890:9b9eac]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships.


Version-Release number of selected component (if applicable):
5.4.0.5

How reproducible:
Always with large provider

Steps to Reproduce:
1. Add provider and tailf evm.log | grep "#(pid of refresh worker)"
2. Witness process hanging on above log line for 90 minutes
3.

Actual results:
Inventory hangs at this point of time.  A larger provider will most likely timeout the message since refresh timeout is 7200s.

A timeout was actually witnessed on an appliance refreshing two large providers concurrently.

Expected results:
Refresh and inventory to occur much faster.


Additional info:

Timings per refresh worker:
Refreshing targets for EMS...Complete - Timings: {:get_ems_data=>23.835465669631958, :get_vc_data=>72.03464484214783, :get_vc_data_ems_customization_specs=>0.028870820999145508, :filter_vc_data=>0.0004057884216308594, :get_vc_data_host_scsi=>17.657812118530273, :get_vc_data_total=>113.55852603912354, :parse_vc_data=>131.7532970905304, :db_save_inventory=>6145.719205617905, :post_refresh_ems=>104.97212195396423, :total_time=>6496.003431558609}

The time spent between:
Updating Folders To Vms relationships.
Updating Clusters To Resource Pools relationships.
is 89% of total time spent in db_save_inventory and 83.9% total time spent during this refresh.  Optimizing what is occurring here will provide tremendous gains.

Comment 1 Alex Krzos 2015-07-16 15:50:14 UTC
Created attachment 1052742 [details]
evm.log from refresh worker while refreshing large scale vmware provider

Comment 3 Alex Krzos 2015-07-16 16:00:33 UTC
Created attachment 1052743 [details]
Appliance CPU Usage

Shows one vCPU pegged with nice utilization during refresh.

Comment 4 Alex Krzos 2015-07-16 17:38:37 UTC
Created attachment 1052784 [details]
top_output log from appliance during refresh

Comment 5 Jason Frey 2015-07-24 17:28:06 UTC
Since almost all of the time is in database, my guess is that we are going to have to do a refresh in a totally different way to get this scalable...something like skeletal refresh might be the way we have to go.

Comment 6 dmetzger 2015-08-27 13:01:30 UTC
Modified EmsRefresh::LinkInventory to pass the array of object id's to be added / linked to set_child (low level method doing the work) rather than 1 call be object id.

Comment 7 Taras Lehinevych 2015-11-13 14:07:37 UTC
Verified fixed in version 5.5.0.10

Change:
1.Appliance memory bump from 6GiB to 10iB
2.Memory threshold on:
3.Connection Broker - 4 GB
4.Refresh - 6 GB
5.ems_refresh_core_worker: 1.gigabytes

Add provider, all vms appeared after 85 minutes, however we are interested how long an EmsRefresh.refresh takes on a large scale provider. The time spent between: Updating Folders To Vms relationships and Updating Clusters To Resource Pools relationships was significantly reduced. On rhos6 took 2 minutes:

[----] I, [2015-11-12T11:50:59.754105 #14570:b3798c]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships.
[----] I, [2015-11-12T11:52:11.051469 #14570:b3798c]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships.

The whole refresh process took around 4 minutes.

Comment 9 errata-xmlrpc 2015-12-08 13:22:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551