Description of problem: Running Capacity and Utilization benchmarks and comparing rhevm vs vmware performance capture scheduling with exact same sized environments, scheduling all realtime captures on RHEVM takes more time (> 3x) than an environment consisting of VMware managed objects. Environment sizes: vmware - 3k vms, 1.5k online, 100 hosts rhevm - 3k vms, 1.5k online, 100 hosts Version-Release number of selected component (if applicable): 5.5.0.1-alpha1 - I will present same scenario to 5.4 appliance to see if issue is with alpha How reproducible: on this build Steps to Reproduce: 1. 2. 3. Actual results: provider, min, avg, median, max, stddev, 90th, 99th vmware, 36.4, 37.9, 37.7, 39.7, 1.3, 39.4, 39.7 rhevm, 131.4, 134.6, 134.1, 138.7, 2.8, 137.8, 138.7 *Compare 99th percentile timings for the benchmark. Expected results: My understanding is that there is no interaction with the provider on mere scheduling of captures, thus I would expect scheduling 1,500 captures on VMs and 100 hosts to be the same for each provider. Additional info: I can/will profile the benchmark as well to see if anything sticks out as a sore thumb consuming the additional time during scheduling.
Re-Ran the benchmarks and captured evm.log while benchmark was running. RHEVM appears to take far more time queuing Storage.perf_capture. Also noted is slight difference in number of "clusters" between two providers however that does not appear to be the source of the large discrepancy between the two nearly identically sized providers. Attached are the evm.log files for respective providers.
Created attachment 1076270 [details] evm.log files for respective providers
Measuring the Memory and GC.count on the node handling RHEVM: irb(main):002:0> mem_rss_start = MiqProcess.processInfo()[:memory_usage] => 122482688 irb(main):003:0> the_zone = Zone.find_by_name('default') => #<Zone id: 1, name: "default", description: "Default Zone", created_on: "2015-09-22 20:01:50", updated_on: "2015-09-22 20:01:50", settings: {}, log_file_depot_id: nil> irb(main):004:0> Benchmark.realtime{Metric::Capture.perf_capture_timer(the_zone)} => 99.270481548 irb(main):005:0> mem_rss_end = MiqProcess.processInfo()[:memory_usage] => 227196928 irb(main):006:0> mem_rss_change = mem_rss_end - mem_rss_start => 104714240 irb(main):007:0> GC.stat => {:count=>493, :heap_used=>2912, :heap_length=>4177, :heap_increment=>1265, :heap_live_num=>890940, :heap_free_num=>426151, :heap_final_num=>0, :total_allocated_object=>84540091, :total_freed_object=>83649151} VMware: irb(main):002:0> mem_rss_start = MiqProcess.processInfo()[:memory_usage] => 122499072 irb(main):003:0> the_zone = Zone.find_by_name('default') => #<Zone id: 1, name: "default", description: "Default Zone", created_on: "2015-09-22 20:02:01", updated_on: "2015-09-22 20:02:01", settings: {}, log_file_depot_id: nil> irb(main):004:0> Benchmark.realtime{Metric::Capture.perf_capture_timer(the_zone)} => 34.97297929 irb(main):005:0> mem_rss_end = MiqProcess.processInfo()[:memory_usage] => 209465344 irb(main):006:0> mem_rss_change = mem_rss_end - mem_rss_start => 86966272 irb(main):007:0> GC.stat => {:count=>188, :heap_used=>2396, :heap_length=>4311, :heap_increment=>1915, :heap_live_num=>899762, :heap_free_num=>116896, :heap_final_num=>0, :total_allocated_object=>17272673, :total_freed_object=>16372911} I see far less GC calls and slightly less memory consumed (216MiB for RHEVM and 199MiB for Vmware)
FYI Initial BZ opened against 5.5.0.1-alpha Comment #3 and Comment #5 are on CFME 5.4.2.0 thus this issue exists on both versions. I attempted to get perftools.rb running on 5.4.2.0 appliances in order to profile for what is taking a large percentage of time but ran into multiple issues getting it to run. I did have an older 5.4.0.2 appliance that already had perftools.rb installed and running so I reset/cleaned the database and re-ran the benchmarks on that appliance. I found several issues. First lets look at the profiling: Methodology (code in rails console executed) was: require 'perftools' the_zone = Zone.find_by_name('default') PerfTools::CpuProfiler.start("/tmp/profile-perf_capture_timer-rhevm-1") do Metric::Capture.perf_capture_timer(the_zone) end RHEVM profile (top 20): [root@dhcp23-113 tmp]# /opt/rh/cfme-gemset/bin/pprof.rb --text profile-perf_capture_timer-rhevm-1 | head -n 21 Using local file /opt/rh/ruby200/root/usr/bin/ruby. Using local file profile-perf_capture_timer-rhevm-1. Total: 10415 samples 5798 55.7% 55.7% 5798 55.7% garbage_collector 683 6.6% 62.2% 685 6.6% Hash.[] 371 3.6% 65.8% 371 3.6% PGresult#values 361 3.5% 69.3% 361 3.5% Array#zip 183 1.8% 71.0% 183 1.8% Regexp#match 173 1.7% 72.7% 1962 18.8% ActiveRecord::Associations::Preloader::Association#associated_records_by_owner 170 1.6% 74.3% 170 1.6% PGconn#async_exec 137 1.3% 75.6% 193 1.9% Array#map! 93 0.9% 76.5% 93 0.9% Marshal.dump 89 0.9% 77.4% 89 0.9% Module#=== 83 0.8% 78.2% 83 0.8% Marshal.load_without_autoload_missing_constants 80 0.8% 78.9% 151 1.4% #<Module:0x0000000507c218>#__temp__ 60 0.6% 79.5% 1274 12.2% Array#map 48 0.5% 80.0% 151 1.4% Class#value_to_integer 46 0.4% 80.4% 46 0.4% Class#logger 42 0.4% 80.8% 109 1.0% #<Module:0x00000001b78aa8>#__temp__ 42 0.4% 81.2% 132 1.3% Arel::Visitors::Visitor#visit 39 0.4% 81.6% 208 2.0% ActiveRecord::Base#== 35 0.3% 81.9% 35 0.3% String#intern 34 0.3% 82.3% 34 0.3% String#to_i I performed the same profiling again x 2 and deleted all perf_captures on the queue to confirm scheduling would occur: MiqQueue.where('method_name = \'perf_capture\'').delete_all was used to delete all queued perf_captures. Between the 3 profiles 53-55% was spent in the garbage collector and 6.6-7.1% of time was spent in Hash.[] VMware Profiling (Top 20): [root@dhcp23-113 tmp]# /opt/rh/cfme-gemset/bin/pprof.rb --text profile-perf_capture_timer-vmware-2 | head -n 21 Using local file /opt/rh/ruby200/root/usr/bin/ruby. Using local file profile-perf_capture_timer-vmware-2. Total: 2824 samples 1084 38.4% 38.4% 1084 38.4% garbage_collector 183 6.5% 44.9% 183 6.5% Regexp#match 90 3.2% 48.1% 92 3.3% Marshal.dump 66 2.3% 50.4% 66 2.3% Marshal.load_without_autoload_missing_constants 40 1.4% 51.8% 41 1.5% Hash.[] 37 1.3% 53.1% 37 1.3% PGconn#async_exec 27 1.0% 54.1% 27 1.0% Module#=== 25 0.9% 55.0% 25 0.9% Class#logger 23 0.8% 55.8% 23 0.8% PGresult#values 22 0.8% 56.6% 194 6.9% Array#map 21 0.7% 57.3% 77 2.7% Arel::Visitors::Visitor#visit 20 0.7% 58.0% 21 0.7% Kernel#respond_to? 19 0.7% 58.7% 19 0.7% Regexp#=== 19 0.7% 59.3% 19 0.7% Symbol#to_s 18 0.6% 60.0% 104 3.7% ActiveRecord::AttributeMethods#respond_to? 18 0.6% 60.6% 43 1.5% ActiveRecord::Relation#initialize 18 0.6% 61.3% 18 0.6% String#=~ 17 0.6% 61.9% 34 1.2% ActiveSupport::Callbacks::ClassMethods#__callback_runner_name 17 0.6% 62.5% 17 0.6% Module#module_eval 16 0.6% 63.0% 16 0.6% Hash#[]= I also performed the same profiling for vmware 2 more times and deleted all perf_captures on the queue before testing. Likewise to RHEVM, scheduling perf_captures against VMware also spent the most time in the garbage collector at between 38.4-40.4% of time. In addition to the less percentage of time spent in the garbage collector, the order of subsequent procedures is different between the two providers. While getting the profiler to work I had added the RHEVM large provider first, performed profiling and then created a second zone on the appliance since C&U scheduling is zone based. I moved the RHEVM provider to the "Unmanaged Zone" which is a zone without any appliances. I then added the VMware large provider into the default zone and attempted to reproduce the faster C&U scheduling. Although I can clearly see that the perf_capture_timer is only scheduling C&U captures per the default zone while running, the VMware provider is now suffering the same slow scheduling timings. This means that if a provider is just in the same region, it can adversely affect the scheduling of C&U in another zone. This possibly could be considered a separate BZ. Since the above evm.log output suggested something slow with Storage.perf_captures, I investigated datastores on the appliance and noted that even rendering of the datastore UI page is slow for any of the RHEVM datastores. That is when I determined that the RHEVM datastores have a host relationship to every single host mean 100 hosts, 120 datastores = 12000 entries in hosts_storages as the Vmware provider's relationships are different, only 300 entries in the hosts_storages table. Lastly I cleaned out the RHEVM provider and was able to perform the profiling of the VMware provider scheduling C&U captures. In summary the issues currently are: 1. Scale and performance of perf_capture_timer can vary based on the number of host to datastore relationships (This might be the actual BZ rather than RHEVM slower than VMware) 2. Scheduling of C&U captures can be affected by providers that are not being captured on. (Zone issues) 3. Datastores Web UI page and likely other components (frontend/backend) are subject to performance/scale issues as the number of host relationships grows in the VMDB.
Break down of why the hosts to datastores relationships are different between the two providers: RHEVM (NFS datastores) Each host has all 120 datastores: 100(hosts) x 120(datastores) = 12000 VMware (VMFS datastores) Each host has a single datastore: 100(hosts) x 1(datastores) = 100 Each cluster (20 total, ea. 5 hosts) has a datastore: 20(clusters) x 5(hosts) x 1(datastores) = 100 One global datastore that every host has: 100(hosts) x 1(datastores) = 100 Sum all those up and you have only 300 host to data store relationships with the VMware provider. Hence 300 relationships vs 12000 relationships although each provider has identical number of Hosts/VMs and almost the same number of datastores (120 vs 121). *Technically RHEVM has 121 but the ISODOMAIN isn't activated.
storages specifically have an inefficient way of looking up the ems and were brought back in large numbers. I have fixed the ems issue in particular and with the reworking in the capture timer code, should have reduced the host to storages data. Since our timer is much quicker now, am marking as closed, but if this comes up again, please reopen / open another.