Bug 1243938

Summary:

[Scale] - Inventory of 10k vm provider, 90minutes spent between Updating Folders To Vms relationships to Updating Clusters To Resource Pools relationships

Product:

Red Hat CloudForms Management Engine

Reporter:

Alex Krzos <akrzos>

Component:

Performance

Assignee:

dmetzger

Status:

CLOSED ERRATA

QA Contact:

Taras Lehinevych <tlehinev>

Severity:

medium

Docs Contact:

Priority:

high

Version:

5.4.0

CC:

apatters, dajohnso, jhardy, obarenbo, perfbz, tlehinev

Target Milestone:

Target Release:

5.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

5.5.0.8

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-12-08 13:22:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
evm.log from refresh worker while refreshing large scale vmware provider	none
Appliance CPU Usage	none
top_output log from appliance during refresh	none

Description Alex Krzos 2015-07-16 15:44:11 UTC

Description of problem:
Initial Refresh of a large 10,000 virtual machine provider hangs for 90minutes between:

[----] I, [2015-07-16T09:36:56.898159 #32890:9b9eac]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships.
[----] I, [2015-07-16T11:07:44.545745 #32890:9b9eac]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships.


Version-Release number of selected component (if applicable):
5.4.0.5

How reproducible:
Always with large provider

Steps to Reproduce:
1. Add provider and tailf evm.log | grep "#(pid of refresh worker)"
2. Witness process hanging on above log line for 90 minutes
3.

Actual results:
Inventory hangs at this point of time.  A larger provider will most likely timeout the message since refresh timeout is 7200s.

A timeout was actually witnessed on an appliance refreshing two large providers concurrently.

Expected results:
Refresh and inventory to occur much faster.


Additional info:

Timings per refresh worker:
Refreshing targets for EMS...Complete - Timings: {:get_ems_data=>23.835465669631958, :get_vc_data=>72.03464484214783, :get_vc_data_ems_customization_specs=>0.028870820999145508, :filter_vc_data=>0.0004057884216308594, :get_vc_data_host_scsi=>17.657812118530273, :get_vc_data_total=>113.55852603912354, :parse_vc_data=>131.7532970905304, :db_save_inventory=>6145.719205617905, :post_refresh_ems=>104.97212195396423, :total_time=>6496.003431558609}

The time spent between:
Updating Folders To Vms relationships.
Updating Clusters To Resource Pools relationships.
is 89% of total time spent in db_save_inventory and 83.9% total time spent during this refresh.  Optimizing what is occurring here will provide tremendous gains.

Comment 1 Alex Krzos 2015-07-16 15:50:14 UTC

Created attachment 1052742 [details]
evm.log from refresh worker while refreshing large scale vmware provider

Comment 3 Alex Krzos 2015-07-16 16:00:33 UTC

Created attachment 1052743 [details]
Appliance CPU Usage

Shows one vCPU pegged with nice utilization during refresh.

Comment 4 Alex Krzos 2015-07-16 17:38:37 UTC

Created attachment 1052784 [details]
top_output log from appliance during refresh

Comment 5 Jason Frey 2015-07-24 17:28:06 UTC

Since almost all of the time is in database, my guess is that we are going to have to do a refresh in a totally different way to get this scalable...something like skeletal refresh might be the way we have to go.

Comment 6 dmetzger 2015-08-27 13:01:30 UTC

Modified EmsRefresh::LinkInventory to pass the array of object id's to be added / linked to set_child (low level method doing the work) rather than 1 call be object id.

Comment 7 Taras Lehinevych 2015-11-13 14:07:37 UTC

Verified fixed in version 5.5.0.10

Change:
1.Appliance memory bump from 6GiB to 10iB
2.Memory threshold on:
3.Connection Broker - 4 GB
4.Refresh - 6 GB
5.ems_refresh_core_worker: 1.gigabytes

Add provider, all vms appeared after 85 minutes, however we are interested how long an EmsRefresh.refresh takes on a large scale provider. The time spent between: Updating Folders To Vms relationships and Updating Clusters To Resource Pools relationships was significantly reduced. On rhos6 took 2 minutes:

[----] I, [2015-11-12T11:50:59.754105 #14570:b3798c]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships.
[----] I, [2015-11-12T11:52:11.051469 #14570:b3798c]  INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships.

The whole refresh process took around 4 minutes.

Comment 9 errata-xmlrpc 2015-12-08 13:22:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551