Description of problem: When Cloud Forms is inventorying a RHEVM provider it must query multiple API collections in order to retrieve a full inventory. This is fine for small scale environments but as the environments are scaled up the amount of time spent querying for detailed information from the API grows considerably. If RHEVM offered a bulk request that could provide all information with less requests we could significantly improvement the amount of time spent querying for data. Version-Release number of selected component (if applicable): RHEV 3.4 CFME 5.4 ManageIQ Master How reproducible: Always Steps to Reproduce: 1. Create large scale RHEVM environment (I have a 3,000 simulated virtual machine environment to test against) 2. Add provide to cloud forms and observe amount of time spent querying the RHEVM API 3. Actual results: 11minutes spent obtaining inventory, 3minutes spent afterwards wrapping up refresh on cloud forms many requests against RHEVM in ssl_access_log/ssl_request_log that are proportional to the number of VMs and other queried inventory components Expected results: Far fewer requests to obtain the entire datacenter's detailed inventory >3000 for 3000VMs Additional info: Additionally, there is a spike in cpu usage on the RHEVM machine as CFME/ManageIQ requests this data.
Created attachment 1073726 [details] Network utilization from Cloud Forms perspective while obtaining inventory of RHEVM environments Attached are three graphs showing the network utilization from a Cloud Forms appliance to 3 separate RHEVM environments. RHEVM environment sizes: Top - 100vms Middle - 1000vms Bottom - 3000vms Note the growth in amount of time spent retrieving inventory as the environment size grows. A bulk transfer of the information would help alleviate the first portion of growth as environments scale.
So, to understand what we are doing on the CloudForms side: When collecting inventory, we cannot request detailed information for an entity without creating a new request for that entity. So, taking Hosts->Vms as an example, when we query for hosts, we can see the VMs and their ids, but cannot get detailed information for each VM in that initial query. Instead we have to have 1 request per VM, so 1000 VMs means 1000 requests. Even worse, since we need the disks, snapshots, and nics on the VMs, those are each separate queries as well, turning that into 4000 requests on 1000 VMs. To make this even doable, we create a pool of threads (1 per CPU, so an 8 CPU machine would have a pool of 8 threads), and farm out the requests to the pool, to parallelize the requests. If we don't parallelize it takes forever to collect all the inventory. Here is the inventory collection code, if you want an idea of how it works: https://github.com/ManageIQ/ovirt/blob/master/lib/ovirt/inventory.rb
Juan, while I don't think we've implemented this, I believe the issue is far less problematic with pipelining, multiplexing, compression and other improvements? Can we close this for the time being?
Yes, with those improvements the inventory collection will be much faster. See the following example script: https://github.com/oVirt/ovirt-engine-sdk-ruby/blob/master/sdk/examples/asynchronous_inventory.rb In an environment with approx 4000 virtual machines, 10000 disks and 150ms of latency that completes in less than 3 minutes, without reducing the number of requests or using multiple threads or workers. Note however that these changes haven't been applied to ManageIQ yet. We are also currently working in what we call "link following", which is a mechanism to retrieve related objects with a single request. We will, for example, be able to retrieve a virtual machine with its disks and NICs in only one request. That should improve further the inventory collecting time. So, I am closing the bug, as the work is tracked in other bugs.