Description of problem: Initial Refresh of a large 10,000 virtual machine provider hangs for 90minutes between: [----] I, [2015-07-16T09:36:56.898159 #32890:9b9eac] INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships. [----] I, [2015-07-16T11:07:44.545745 #32890:9b9eac] INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships. Version-Release number of selected component (if applicable): 5.4.0.5 How reproducible: Always with large provider Steps to Reproduce: 1. Add provider and tailf evm.log | grep "#(pid of refresh worker)" 2. Witness process hanging on above log line for 90 minutes 3. Actual results: Inventory hangs at this point of time. A larger provider will most likely timeout the message since refresh timeout is 7200s. A timeout was actually witnessed on an appliance refreshing two large providers concurrently. Expected results: Refresh and inventory to occur much faster. Additional info: Timings per refresh worker: Refreshing targets for EMS...Complete - Timings: {:get_ems_data=>23.835465669631958, :get_vc_data=>72.03464484214783, :get_vc_data_ems_customization_specs=>0.028870820999145508, :filter_vc_data=>0.0004057884216308594, :get_vc_data_host_scsi=>17.657812118530273, :get_vc_data_total=>113.55852603912354, :parse_vc_data=>131.7532970905304, :db_save_inventory=>6145.719205617905, :post_refresh_ems=>104.97212195396423, :total_time=>6496.003431558609} The time spent between: Updating Folders To Vms relationships. Updating Clusters To Resource Pools relationships. is 89% of total time spent in db_save_inventory and 83.9% total time spent during this refresh. Optimizing what is occurring here will provide tremendous gains.
Created attachment 1052742 [details] evm.log from refresh worker while refreshing large scale vmware provider
Created attachment 1052743 [details] Appliance CPU Usage Shows one vCPU pegged with nice utilization during refresh.
Created attachment 1052784 [details] top_output log from appliance during refresh
Since almost all of the time is in database, my guess is that we are going to have to do a refresh in a totally different way to get this scalable...something like skeletal refresh might be the way we have to go.
Modified EmsRefresh::LinkInventory to pass the array of object id's to be added / linked to set_child (low level method doing the work) rather than 1 call be object id.
Verified fixed in version 5.5.0.10 Change: 1.Appliance memory bump from 6GiB to 10iB 2.Memory threshold on: 3.Connection Broker - 4 GB 4.Refresh - 6 GB 5.ems_refresh_core_worker: 1.gigabytes Add provider, all vms appeared after 85 minutes, however we are interested how long an EmsRefresh.refresh takes on a large scale provider. The time spent between: Updating Folders To Vms relationships and Updating Clusters To Resource Pools relationships was significantly reduced. On rhos6 took 2 minutes: [----] I, [2015-11-12T11:50:59.754105 #14570:b3798c] INFO -- : MIQ(EmsRefresh.update_relats) Updating Folders To Vms relationships. [----] I, [2015-11-12T11:52:11.051469 #14570:b3798c] INFO -- : MIQ(EmsRefresh.update_relats) Updating Clusters To Resource Pools relationships. The whole refresh process took around 4 minutes.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2015:2551