Created attachment 1033419 [details] Metric::Capture.perf_capture_timer timeout causing worker exit Description of problem: In a large scale environment, Generic and Priority Workers can end up exiting and restarting due to Metric::Capture.perf_capture_timer message timeouts. The default timeout for these messages is 600seconds. As the message is queueing perf_captures it will be interrupted and told to exit. The worker will then restart. Version-Release number of selected component (if applicable): 5.3.4.2 How reproducible: Environment size specific. This is occurring on Zones attempting to capture on 5,000 online VMware VMs. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Attached is grepped output showing a worker executing several Metric::Capture.perf_capture_timer messages until it is unable to complete the work within 10 minutes on one of the timers.
This message is currently a bottleneck for scheduling perf_captures on large scale environments. In the attempt to determine where the majority of time is spent I had instrumented the time spent on each target. In the attached log you can see the time spent per target and see large gaps between scheduling some messages such as line 654: [----] I, [2015-06-15T17:03:37.418877 #23072:4fb884] INFO -- : MIQ(MiqQueue.put) Message id: [21478351], id: [], Zone: [VC0], Role: [ems_metrics_collector], Server: [], Ident: [vmware], Target id: [], Instance id: [202], Task id: [], Comma nd: [Storage.perf_capture], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["hourly"] is not followed up with another MiqQueue put until line 1225: [----] I, [2015-06-15T17:04:32.153286 #23072:4fb884] INFO -- : MIQ(MiqQueue.put) Message id: [21478501], id: [], Zone: [VC0], Role: [ems_metrics_collector], Server: [], Ident: [vmware], Target id: [], Instance id: [5559], Task id: [], Comm and: [VmVmware.perf_capture], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["realtime"] Note the instrumentation of timings added some overhead to the total processing time for this message, since writing a log line per item is not free however the large gaps in time are even more apparent. As an un-tested workaround, I assume we could break a provider down with separate admin user accounts to zone each part of the provider to process less targets per perf_capture_timer and therefore decreasing the delay between perf_captures scheduling.
Created attachment 1039203 [details] Log of perf_capture_timer scheduling perf_captures and displaying time spent per target.
I profiled perf_capture_timer in the scale environment on CFME 5.3 (Ruby 1.9.3, Rails 3.2.17) w/ perftools.rb to understand where the majority of time is spent. To schedule an entire zone of 5,000 VMs, 20 clusters and 200 hosts, 35.3% of time was spent in the garbage collector. Using local file perftools-perf-capture_timer-nosandbox-with5k-load-runner-VC0. Total: 45135 samples 15933 35.3% 35.3% 15933 35.3% garbage_collector 1279 2.8% 38.1% 1279 2.8% PGconn#async_exec 671 1.5% 39.6% 673 1.5% Marshal.dump 567 1.3% 40.9% 567 1.3% Class#logger 520 1.2% 42.0% 1642 3.6% Arel::Visitors::Visitor#visit 469 1.0% 43.1% 469 1.0% Marshal.load_without_autoload_missing_constants 455 1.0% 44.1% 489 1.1% Hash.[] 441 1.0% 45.1% 441 1.0% Symbol#to_s 421 0.9% 46.0% 894 2.0% ActiveRecord::AttributeMethods#respond_to? 413 0.9% 46.9% 418 0.9% Kernel#respond_to? 404 0.9% 47.8% 988 2.2% ActiveRecord::Relation#initialize 399 0.9% 48.7% 2142 4.7% ActiveRecord::Base#init_with 390 0.9% 49.5% 390 0.9% String#intern 328 0.7% 50.3% 328 0.7% Regexp#match 319 0.7% 51.0% 2834 6.3% ActiveRecord::Inheritance::ClassMethods#instantiate 308 0.7% 51.7% 2868 6.4% ActiveSupport::Callbacks#run_callbacks 307 0.7% 52.3% 586 1.3% ActiveRecord::SpawnMethods#merge 284 0.6% 53.0% 3190 7.1% Array#map 258 0.6% 53.5% 258 0.6% Module#=== The time required to complete the scheduling of captures in this instance took 747s with profiling on.
I am now able to reproduce this behavior on 5.4.3.1 as well. I had to bump the number of targets we are collecting c&u on in a single zone in order to reproduce this behavior. In comment 0, I had produced this behavior with 5.3 on Sandy Bridge hardware. My 5.4 testbed is housed on Haswell. (Thus the greater number of targets could be a result of both the cfme version change and the more powerful hardware.) I had bumped the number of targets to 9k online virtual machines. I would expect with a greater number of targets that the workers handling this message would also require a larger memory threshold as well. I bumped the threshold from the default of 400MiB to 1.1GiB and Generic/Priority Workers are still exceeding the memory threshold. Thus if they are not terminated by exceeding the message timeout, they exceed memory eventually. In the future I plan to turn the number of vms down to 5k again (Thus a perf_capture_timer can complete) and then raise the memory limits on Generic/Priority Workers until they no longer are cycled. If they continue to cycle or approach an unreasonable amount of memory usage, then we may have to investigate a memory leak. In summary there is several issues here: 1. Memory Performance of Generic/Priority Workers - They are growing in memory usage beyond reasonable amounts on an appliance with C&U coordinator role turned on and connected to a x-large scale environment (10k vms total, 5k or more vms online) 2. C&U data collection performance - the scheduling of C&U collections becomes an issue when a single zone has >5k-9k vms as the scheduling can now add almost 10minutes of time before a target is even queued to have a C&U metrics collection. This will result in missing usage metrics. 3. Scalability of C&U scheduling - The concurrency of scheduling perf_captures is limited to a single process per zone. If you wish to collect beyond 5k-9k VMs, you must create additional administrator accounts for the same provider and create more zones in Cloud Forms.
Updates: I have turned the number of VMs (and thus targets) down to 5k online VMs, raised the memory threshold for Priority/Generic Workers to 5GiB in the CFME advanced configuration and PriorityWorkers are still exceeding memory threshold: [----] I, [2015-11-10T10:27:34.819561 #2351:381e90] INFO -- : MIQ(MiqQueue.put) Message id: [1000004242072], id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["MiqServer", 1000000000003], "evm_worker_memory_exceeded", {:event_details=>"Worker [MiqPriorityWorker] with ID: [1000000001198], PID: [46465], GUID: [70c5e7a2-873f-11e5-af05-0001a4ae710a] process memory usage [5377486848] exceeded limit [5368709120], requesting worker to exit", :type=>"MiqPriorityWorker"}] Since the time period for workers to grow in memory and eventually be recycled has grown with the raised memory threshold, the time between this event has increased. The above referenced worker was alive for ~15hrs before it hit 5GiB RSS Memory usage. It is unknown at this time if the GenericWorkers will eventually hit 5GiB as well but currently in the testbed they have an RSS above the limits on the Workers configuration tab (1.5GiB) Attached is a screenshot of a Memory usage graph of the entire appliance representing when the worker was recycled at ~10:30.
Created attachment 1092394 [details] Memory usage of appliance in which a 5GiB MiqPriorityWorker is recycled
Created attachment 1095683 [details] Single Priority Worker evm.log file with rss/virt memory per log line This log file contains a single PriorityWorker's worker with instrumented RSS/Virt Memory Usage on each log line. This should give us a pretty good idea of what work is making the PriorityWorker jump the most in memory usage. With the appliance tuned to 1 PriorityWorker and 0 GenericWorkers I have not seen the same RSS memory usage as I have in the past. As of right now the priority worker has consumed 3.3GiB of memory rather than the 6GiB I have seen in the past.
This has re-emerged as a bigger issue in 5.5. I have perf_capture_timer messages that require greater than 1 hour to complete and only queue 1-2 vm perf_captures per second in this environment. The environment is 10,000VMs with 5,000 online. There is 221 data-stores and 200 hosts which results in a total of 5421 captures on the queue at the completion of a perf_capture_timer if there are no captures queued at the start. I have instrutmented the environment with stackprof in order to attempt to isolate what is consuming all of the time while queueing here perf_capture. To do so i turned off the coordinator role such that no new captures will be queued by existing workers. I then turned off the collectors on each appliance as well so there will be limited if any database interaction from the other appliances that could interrupt a profiled perf_capture_timer. I then profiled using the following code in the rails console: StackProf.run(mode: :cpu, out: '/root/stackprof-perf_capture_timer-0') { timing = Benchmark.realtime { Metric::Capture.perf_capture_timer(the_zone) } } The results: # stackprof stackprof-perf_capture_timer-0 --limit 50 ================================== Mode: cpu(1000) Samples: 2436845 (1.72% miss rate) GC: 443507 (18.20%) ================================== TOTAL (pct) SAMPLES (pct) FRAME 349146 (14.3%) 349146 (14.3%) ActiveRecord::Attribute#initialize 92010 (3.8%) 92010 (3.8%) block in ActiveRecord::Result#hash_rows 83768 (3.4%) 83768 (3.4%) ActiveRecord::LazyAttributeHash#key? 83689 (3.4%) 83689 (3.4%) ActiveRecord::Type::Integer#type_cast_from_database 77107 (3.2%) 77098 (3.2%) block in ActiveRecord::Reflection::AssociationReflection#inverse_name 73869 (3.0%) 73869 (3.0%) ActiveRecord::Core#init_internals 426312 (17.5%) 72538 (3.0%) ActiveRecord::LazyAttributeHash#assign_default_value 62292 (2.6%) 62292 (2.6%) block (4 levels) in Class#class_attribute 61987 (2.5%) 61987 (2.5%) #<Module:0x00000006ad3710>.convert 103199 (4.2%) 58171 (2.4%) block (2 levels) in Class#class_attribute 58136 (2.4%) 58136 (2.4%) ActiveRecord::Reflection::AssociationReflection#foreign_key 70279 (2.9%) 57580 (2.4%) ActiveRecord::Core::ClassMethods#allocate 52903 (2.2%) 52903 (2.2%) ActiveRecord::Associations::Association#reset 50972 (2.1%) 50972 (2.1%) block (2 levels) in <class:Numeric> 97896 (4.0%) 49353 (2.0%) ActiveRecord::ModelSchema::ClassMethods#inheritance_column 42101 (1.7%) 42101 (1.7%) ActiveRecord::Associations#association_instance_set 42037 (1.7%) 42037 (1.7%) ActiveRecord::Reflection::AssociationReflection#polymorphic? 468220 (19.2%) 41908 (1.7%) ActiveRecord::LazyAttributeHash#[] 39784 (1.6%) 39784 (1.6%) ActiveRecord::Type::String#cast_value 37738 (1.5%) 37738 (1.5%) ActiveRecord::LazyAttributeHash#initialize 198515 (8.1%) 37031 (1.5%) ActiveRecord::Attribute#value 71562 (2.9%) 31228 (1.3%) ActiveRecord::Type::Value#type_cast 27833 (1.1%) 27833 (1.1%) NilClass#blank? 552101 (22.7%) 27636 (1.1%) block in ActiveRecord::Result#each 25791 (1.1%) 25180 (1.0%) ActiveRecord::AttributeMethods::ClassMethods#define_attribute_methods 21767 (0.9%) 21767 (0.9%) ActiveSupport::Callbacks::CallbackChain#empty? 21732 (0.9%) 21732 (0.9%) ActiveRecord::Associations::Association#reset_scope 20810 (0.9%) 20810 (0.9%) ActiveRecord::Associations#association_instance_get 73145 (3.0%) 18285 (0.8%) ActiveRecord::AttributeSet::Builder#build_from_database 17122 (0.7%) 17122 (0.7%) ActiveRecord::AttributeSet#initialize 31593 (1.3%) 15233 (0.6%) ActiveRecord::Reflection::ClassMethods#_reflect_on_association 15007 (0.6%) 14688 (0.6%) block in ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_cache 10729 (0.4%) 10729 (0.4%) ActiveRecord::Result#initialize 8882 (0.4%) 8879 (0.4%) ActiveRecord::ModelSchema::ClassMethods#attributes_builder 8520 (0.3%) 8520 (0.3%) ActiveRecord::Reflection::BelongsToReflection#macro 65078 (2.7%) 8220 (0.3%) ActiveRecord::Reflection::AbstractReflection#inverse_of 7783 (0.3%) 7783 (0.3%) ActiveRecord::Persistence::ClassMethods#discriminate_class_for_record 5466 (0.2%) 5466 (0.2%) ActiveRecord::Attribute#initialized? 39996 (1.6%) 5414 (0.2%) ActiveSupport::Callbacks#__run_callbacks__ 5120 (0.2%) 5120 (0.2%) ActiveSupport::PerThreadRegistry#instance 353749 (14.5%) 4700 (0.2%) ActiveRecord::Attribute.from_database 435440 (17.9%) 4479 (0.2%) block (2 levels) in AssignmentMixin::ClassMethods#get_assigned_for_target 761805 (31.3%) 4325 (0.2%) ActiveRecord::Associations::Association#set_inverse_instance 472523 (19.4%) 4303 (0.2%) ActiveRecord::AttributeSet#[] 3547 (0.1%) 3547 (0.1%) #<Module:0x000000022d6b60>.load_with_autoloading 80620 (3.3%) 3513 (0.1%) ActiveRecord::Reflection::AssociationReflection#inverse_name 3848 (0.2%) 3447 (0.1%) Date._parse 3435 (0.1%) 3435 (0.1%) ThreadSafe::NonConcurrentCacheBackend#[] 181279 (7.4%) 3347 (0.1%) ActiveRecord::Associations::Association#loaded! 2727 (0.1%) 2727 (0.1%) ActiveRecord::Base.logger The majority of the time is spent within ActiveRecord as shown by the profiler.
https://github.com/ManageIQ/manageiq/pull/6599
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/7b674c4c042cba0c8dc3665f98985bcb99328bd1 commit 7b674c4c042cba0c8dc3665f98985bcb99328bd1 Author: Keenan Brock <kbrock> AuthorDate: Wed Feb 10 13:20:41 2016 -0500 Commit: Keenan Brock <kbrock> CommitDate: Wed Feb 10 16:41:05 2016 -0500 don't load miq_server every config param read https://bugzilla.redhat.com/show_bug.cgi?id=1227008 |VmOrTemplate|Storage|ExtManagementSystem|MiqRegion|Zone| | ---:| ---:| ---:| ---:|---:| | 100 | 13 | 1 | 1 | 1 | | ms | sql | sqlms | comments | ---:| ---:| ---:| --- | 3344.1 | 1263 | 270.5 | a100-master-1 | 818.6 | 449 | 97.1 | a100-master-2 | 847.6 | 449 | 90.8 | a100-master-3 | 968.1 | 449 | 115.6 | a100-master-4 | 3351.0 |1175 | 262.9 | a100-mtime-1 | 730.4 | 361 | 82.0 | a100-mtime-2 | 789.5 | 361 | 79.6 | a100-mtime-3 | 746.8 | 361 | 87.5 | a100-mtime-4 lib/vmdb/config.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
https://github.com/ManageIQ/manageiq/pull/6622
https://github.com/ManageIQ/manageiq/pull/6656
https://github.com/ManageIQ/manageiq/pull/6577
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/38f95c8add65beb02f2c5cc274f2e30736f13fd5 commit 38f95c8add65beb02f2c5cc274f2e30736f13fd5 Author: Keenan Brock <kbrock> AuthorDate: Mon Feb 1 10:30:02 2016 -0500 Commit: Keenan Brock <kbrock> CommitDate: Sat Feb 13 19:32:00 2016 -0500 Don't create a region object my_region does not change. Just use a cached version of it. https://bugzilla.redhat.com/show_bug.cgi?id=1227008 |VmOrTemplate|Storage|ExtManagementSystem|MiqRegion|Zone| | ---:| ---:| ---:| ---:|---:| | 100 | 13 | 1 | 1 | 1 | | ms | sql | sqlms | comments | ---:| ---:| ---:| --- | 3344.1 | 1263 | 270.5 | a100-master-1 | 818.6 | 449 | 97.1 | a100-master-2 | 847.6 | 449 | 90.8 | a100-master-3 | 968.1 | 449 | 115.6 | a100-master-4 | ms | sql | sqlms | comments | ---:| ---:| ---:| --- | 3296.4 | 1250 | 268.5 | a100-region-cache-1 | 777.0 | 436 | 92.3 | a100-region-cache-2 | 837.0 | 436 | 91.2 | a100-region-cache-3 | 825.3 | 436 | 102.7 | a100-region-cache-4 app/models/miq_region.rb | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/65339fdc0f37bf587b8b910a63d598b2ee5e27b6 commit 65339fdc0f37bf587b8b910a63d598b2ee5e27b6 Author: Keenan Brock <kbrock> AuthorDate: Wed Feb 10 17:38:05 2016 -0500 Commit: Keenan Brock <kbrock> CommitDate: Sat Feb 13 19:32:00 2016 -0500 Fetch all ems hosts instead of going through cluster https://bugzilla.redhat.com/show_bug.cgi?id=1227008 This will reuse ems, and no longer lazy load all hosts and tags twice - resulting in almost half the queries (read that many fewer objects in memory) |VmOrTemplate|Host|Storage|ExtManagementSystem|MiqRegion|Zone| |-----------:|---:|------:|------------------:|--------:|---:| | 10000 |200 | 221 | 1 | 1 | 1 | before: has region fix | ms | sql | sqlms | comments | ---:| ---:| ---:| --- | 30,429.2 | 11,772 | 13,952.0 | before-1 | 28,832.5 | 11,745 | 13,796.6 | before-2 | 28,794.5 | 11,745 | 13,651.3 | before-3 | 28,695.7 | 11,745 | 13,749.3 | before-4 after has region fix and this pr | ms | sql | sql ms | comments | ---:| ---:| ---:| --- | 26,707.0 | 6,371 | 13,600.0 | after-1 | 24,813.1 | 6,344 | 13,165.1 | after-2 | 24,959.9 | 6,344 | 13,215.7 | after-3 | 25,540.2 | 6,344 | 13,464.6 | after-4 app/models/metric/targets.rb | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-)
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/d9f1c2ffc511d742ea3b0b232a356ea3884e0b44 commit d9f1c2ffc511d742ea3b0b232a356ea3884e0b44 Author: Keenan Brock <kbrock> AuthorDate: Mon Feb 8 17:04:05 2016 -0500 Commit: Keenan Brock <kbrock> CommitDate: Sat Feb 13 18:31:02 2016 -0500 pass zone into queue_captures It is expensive to determine the zone, especially for Storages that need to look up a number of objects. Since it is known, just pass it into enqueue methods https://bugzilla.redhat.com/show_bug.cgi?id=1227008 |VmOrTemplate|Storage|ExtManagementSystem|MiqRegion|Zone| | ---:| ---:| ---:| ---:|---:| | 100 | 13 | 1 | 1 | 1 | | ms | sql | sqlms | comments | ---:| ---:| ---:| --- | 3344.1 | 1263 | 270.5 | a100-master-1 | 818.6 | 449 | 97.1 | a100-master-2 | 847.6 | 449 | 90.8 | a100-master-3 | 968.1 | 449 | 115.6 | a100-master-4 | ms | sql | sqlms | comments | ---:| ---:| ---:| --- | 3195.0 |1177 | 261.6 | a100-zone-1 | 742.8 | 363 | 89.8 | a100-zone-2 | 799.5 | 363 | 97.9 | a100-zone-3 | 708.0 | 363 | 84.7 | a100-zone-4 app/models/metric/capture.rb | 8 ++++---- app/models/metric/ci_mixin/capture.rb | 5 +++-- 2 files changed, 7 insertions(+), 6 deletions(-)
https://github.com/ManageIQ/manageiq/pull/7338
A bunch of performance enhancements have been merged into master. Let me know if this is still an issue.
I have since tested this with 5.6.0.11 attempting to schedule perf_captures on 5,000 vm targets again on Haswell hardware with ample memory for the appliance (16GiB) and I currently can not get the scheduling to complete without the message timing out. I will add another appliance to the mix to move the schedule off the database appliance or vice-verse to see if this can complete on time on an appliance with less work occurring on it.
Assigning back to me. this is still an issue
tests for #1349062 should be sufficient
PR: https://github.com/ManageIQ/manageiq/pull/9448
single db appliance, 4 worker appliances, 0 recycled workers. 5k online vms from Vmware provider. ======================================================================= showing output from log file: ~/perf_capture_timer_last_avg.log: This reading was recorded on: 2016-12-12 08:12:08,268 ...Reading logs from /var/www/miq/vmdb/log/evm.log and extracting last 100 events. Begin time: 2016-12-12T07:34 End time: 2016-12-12T08:04 10 IDs found.. # | Timestamp | IDs | Metric::Capture.perf_capture_timer ---------------------------------------------------------------------- 0 | Mon Dec 12 07:34:52 EST 2016 | 8667 | 20.486461867 1 | Mon Dec 12 07:38:00 EST 2016 | 10880 | 27.291034255 2 | Mon Dec 12 07:41:09 EST 2016 | 29407 | 32.840495136 3 | Mon Dec 12 07:44:04 EST 2016 | 37936 | 31.357679297 4 | Mon Dec 12 07:47:00 EST 2016 | 45778 | 27.009139387 5 | Mon Dec 12 07:49:56 EST 2016 | 53909 | 23.166915821 6 | Mon Dec 12 07:52:54 EST 2016 | 55273 | 19.152661771 7 | Mon Dec 12 07:55:53 EST 2016 | 56148 | 21.572165016 8 | Mon Dec 12 07:58:56 EST 2016 | 57016 | 21.522156136 9 | Mon Dec 12 08:01:54 EST 2016 | 57898 | 22.586555143 ---------------------------------------------------------------------- ▁▅█▇▅▃▁▂▂▂ Average: 24.69850000000000000000 ---------------------------------------------------------------------- 10 events were queued, out of which 10 events were delivered with [ok] status...