Bug 1227008

Summary:

[Scale] perf_capture_timer message timeout, cycles Generic/Priority Workers

Product:

Red Hat CloudForms Management Engine

Reporter:

Alex Krzos <akrzos>

Component:

Performance

Assignee:

Keenan Brock <kbrock>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Pradeep Kumar Surisetty <psuriset>

Severity:

medium

Docs Contact:

Priority:

high

Version:

5.3.0

CC:

arcsharm, cpelland, jeder, jhardy, obarenbo, perfbz, simaishi

Target Milestone:

Keywords:

TestOnly, ZStream

Target Release:

5.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

perf:c&u:worker

Fixed In Version:

5.7.0.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1349062 (view as bug list)

Environment:

Last Closed:

2016-12-12 13:31:58 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1349062

Attachments:

Description	Flags
Metric::Capture.perf_capture_timer timeout causing worker exit	none
Log of perf_capture_timer scheduling perf_captures and displaying time spent per target.	none
Memory usage of appliance in which a 5GiB MiqPriorityWorker is recycled	none
Single Priority Worker evm.log file with rss/virt memory per log line	none

Description Alex Krzos 2015-06-01 16:33:30 UTC

Created attachment 1033419 [details]
Metric::Capture.perf_capture_timer timeout causing worker exit

Description of problem:
In a large scale environment, Generic and Priority Workers can end up exiting and restarting due to Metric::Capture.perf_capture_timer message timeouts.  The default timeout for these messages is 600seconds.  As the message is queueing perf_captures it will be interrupted and told to exit.  The worker will then restart.

Version-Release number of selected component (if applicable):
5.3.4.2

How reproducible:
Environment size specific.  This is occurring on Zones attempting to capture on 5,000 online VMware VMs.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Attached is grepped output showing a worker executing several Metric::Capture.perf_capture_timer messages until it is unable to complete the work within 10 minutes on one of the timers.

Comment 2 Alex Krzos 2015-06-15 19:27:53 UTC

This message is currently a bottleneck for scheduling perf_captures on large scale environments.  In the attempt to determine where the majority of time is spent I had instrumented the time spent on each target.  In the attached log you can see the time spent per target and see large gaps between scheduling some messages such as line 654:

[----] I, [2015-06-15T17:03:37.418877 #23072:4fb884]  INFO -- : MIQ(MiqQueue.put)        Message id: [21478351],  id: [], Zone: [VC0], Role: [ems_metrics_collector], Server: [], Ident: [vmware], Target id: [], Instance id: [202], Task id: [], Comma     nd: [Storage.perf_capture], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["hourly"]

is not followed up with another MiqQueue put until line 1225:

[----] I, [2015-06-15T17:04:32.153286 #23072:4fb884]  INFO -- : MIQ(MiqQueue.put)        Message id: [21478501],  id: [], Zone: [VC0], Role: [ems_metrics_collector], Server: [], Ident: [vmware], Target id: [], Instance id: [5559], Task id: [], Comm     and: [VmVmware.perf_capture], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: ["realtime"]

Note the instrumentation of timings added some overhead to the total processing time for this message, since writing a log line per item is not free however the large gaps in time are even more apparent.

As an un-tested workaround, I assume we could break a provider down with separate admin user accounts to zone each part of the provider to process less targets per perf_capture_timer and therefore decreasing the delay between perf_captures scheduling.

Comment 3 Alex Krzos 2015-06-15 19:30:29 UTC

Created attachment 1039203 [details]
Log of perf_capture_timer scheduling perf_captures and displaying time spent per target.

Comment 4 Alex Krzos 2015-06-30 19:52:14 UTC

I profiled perf_capture_timer in the scale environment on CFME 5.3 (Ruby 1.9.3, Rails 3.2.17) w/ perftools.rb to understand where the majority of time is spent.  To schedule an entire zone of 5,000 VMs, 20 clusters and 200 hosts, 35.3% of time was spent in the garbage collector.

Using local file perftools-perf-capture_timer-nosandbox-with5k-load-runner-VC0.
Total: 45135 samples
   15933  35.3%  35.3%    15933  35.3% garbage_collector
    1279   2.8%  38.1%     1279   2.8% PGconn#async_exec
     671   1.5%  39.6%      673   1.5% Marshal.dump
     567   1.3%  40.9%      567   1.3% Class#logger
     520   1.2%  42.0%     1642   3.6% Arel::Visitors::Visitor#visit
     469   1.0%  43.1%      469   1.0% Marshal.load_without_autoload_missing_constants
     455   1.0%  44.1%      489   1.1% Hash.[]
     441   1.0%  45.1%      441   1.0% Symbol#to_s
     421   0.9%  46.0%      894   2.0% ActiveRecord::AttributeMethods#respond_to?
     413   0.9%  46.9%      418   0.9% Kernel#respond_to?
     404   0.9%  47.8%      988   2.2% ActiveRecord::Relation#initialize
     399   0.9%  48.7%     2142   4.7% ActiveRecord::Base#init_with
     390   0.9%  49.5%      390   0.9% String#intern
     328   0.7%  50.3%      328   0.7% Regexp#match
     319   0.7%  51.0%     2834   6.3% ActiveRecord::Inheritance::ClassMethods#instantiate
     308   0.7%  51.7%     2868   6.4% ActiveSupport::Callbacks#run_callbacks
     307   0.7%  52.3%      586   1.3% ActiveRecord::SpawnMethods#merge
     284   0.6%  53.0%     3190   7.1% Array#map
     258   0.6%  53.5%      258   0.6% Module#===

The time required to complete the scheduling of captures in this instance took 747s with profiling on.

Comment 5 Alex Krzos 2015-11-06 19:17:33 UTC

I am now able to reproduce this behavior on 5.4.3.1 as well.

I had to bump the number of targets we are collecting c&u on in a single zone in order to reproduce this behavior. In comment 0, I had produced this behavior with 5.3 on Sandy Bridge hardware. My 5.4 testbed is housed on Haswell. (Thus the greater number of targets could be a result of both the cfme version change and the more powerful hardware.)

I had bumped the number of targets to 9k online virtual machines. I would expect with a greater number of targets that the workers handling this message would also require a larger memory threshold as well. I bumped the threshold from the default of 400MiB to 1.1GiB and Generic/Priority Workers are still exceeding the memory threshold. Thus if they are not terminated by exceeding the message timeout, they exceed memory eventually. In the future I plan to turn the number of vms down to 5k again (Thus a perf_capture_timer can complete) and then raise the memory limits on Generic/Priority Workers until they no longer are cycled. If they continue to cycle or approach an unreasonable amount of memory usage, then we may have to investigate a memory leak.

In summary there is several issues here:

1. Memory Performance of Generic/Priority Workers - They are growing in memory usage beyond reasonable amounts on an appliance with C&U coordinator role turned on and connected to a x-large scale environment (10k vms total, 5k or more vms online)

2. C&U data collection performance - the scheduling of C&U collections becomes an issue when a single zone has >5k-9k vms as the scheduling can now add almost 10minutes of time before a target is even queued to have a C&U metrics collection. This will result in missing usage metrics.

3. Scalability of C&U scheduling - The concurrency of scheduling perf_captures is limited to a single process per zone. If you wish to collect beyond 5k-9k VMs, you must create additional administrator accounts for the same provider and create more zones in Cloud Forms.

Comment 6 Alex Krzos 2015-11-10 18:49:51 UTC

Updates:

I have turned the number of VMs (and thus targets) down to 5k online VMs, raised the memory threshold for Priority/Generic Workers to 5GiB in the CFME advanced configuration and PriorityWorkers are still exceeding memory threshold:


[----] I, [2015-11-10T10:27:34.819561 #2351:381e90]  INFO -- : MIQ(MiqQueue.put)        Message id: [1000004242072],  id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["MiqServer", 1000000000003], "evm_worker_memory_exceeded", {:event_details=>"Worker [MiqPriorityWorker] with ID: [1000000001198], PID: [46465], GUID: [70c5e7a2-873f-11e5-af05-0001a4ae710a] process memory usage [5377486848] exceeded limit [5368709120], requesting worker to exit", :type=>"MiqPriorityWorker"}]

Since the time period for workers to grow in memory and eventually be recycled has grown with the raised memory threshold, the time between this event has increased.  The above referenced worker was alive for ~15hrs before it hit 5GiB RSS Memory usage.  It is unknown at this time if the GenericWorkers will eventually hit 5GiB as well but currently in the testbed they have an RSS above the limits on the Workers configuration tab (1.5GiB)

Attached is a screenshot of a Memory usage graph of the entire appliance representing when the worker was recycled at ~10:30.

Comment 7 Alex Krzos 2015-11-10 18:50:54 UTC

Created attachment 1092394 [details]
Memory usage of appliance in which a 5GiB MiqPriorityWorker is recycled

Comment 8 Alex Krzos 2015-11-17 21:14:04 UTC

Created attachment 1095683 [details]
Single Priority Worker evm.log file with rss/virt memory per log line

This log file contains a single PriorityWorker's worker with instrumented RSS/Virt Memory Usage on each log line.  This should give us a pretty good idea of what work is making the PriorityWorker jump the most in memory usage.

With the appliance tuned to 1 PriorityWorker and 0 GenericWorkers I have not seen the same RSS memory usage as I have in the past.  As of right now the priority worker has consumed 3.3GiB of memory rather than the 6GiB I have seen in the past.

Comment 9 Alex Krzos 2016-01-19 18:14:07 UTC

This has re-emerged as a bigger issue in 5.5.  I have perf_capture_timer messages that require greater than 1 hour to complete and only queue 1-2 vm perf_captures per second in this environment.  

The environment is 10,000VMs with 5,000 online.  There is 221 data-stores and 200 hosts which results in a total of 5421 captures on the queue at the completion of a perf_capture_timer if there are no captures queued at the start.  I have instrutmented the environment with stackprof in order to attempt to isolate what is consuming all of the time while queueing here perf_capture.  To do so i turned off the coordinator role such that no new captures will be queued by existing workers.  I then turned off the collectors on each appliance as well so there will be limited if any database interaction from the other appliances that could interrupt a profiled perf_capture_timer.

I then profiled using the following code in the rails console:

StackProf.run(mode: :cpu, out: '/root/stackprof-perf_capture_timer-0') { timing = Benchmark.realtime { Metric::Capture.perf_capture_timer(the_zone) } }

The results:

# stackprof stackprof-perf_capture_timer-0 --limit 50
==================================
  Mode: cpu(1000)
  Samples: 2436845 (1.72% miss rate)
  GC: 443507 (18.20%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
    349146  (14.3%)      349146  (14.3%)     ActiveRecord::Attribute#initialize
     92010   (3.8%)       92010   (3.8%)     block in ActiveRecord::Result#hash_rows
     83768   (3.4%)       83768   (3.4%)     ActiveRecord::LazyAttributeHash#key?
     83689   (3.4%)       83689   (3.4%)     ActiveRecord::Type::Integer#type_cast_from_database
     77107   (3.2%)       77098   (3.2%)     block in ActiveRecord::Reflection::AssociationReflection#inverse_name
     73869   (3.0%)       73869   (3.0%)     ActiveRecord::Core#init_internals
    426312  (17.5%)       72538   (3.0%)     ActiveRecord::LazyAttributeHash#assign_default_value
     62292   (2.6%)       62292   (2.6%)     block (4 levels) in Class#class_attribute
     61987   (2.5%)       61987   (2.5%)     #<Module:0x00000006ad3710>.convert
    103199   (4.2%)       58171   (2.4%)     block (2 levels) in Class#class_attribute
     58136   (2.4%)       58136   (2.4%)     ActiveRecord::Reflection::AssociationReflection#foreign_key
     70279   (2.9%)       57580   (2.4%)     ActiveRecord::Core::ClassMethods#allocate
     52903   (2.2%)       52903   (2.2%)     ActiveRecord::Associations::Association#reset
     50972   (2.1%)       50972   (2.1%)     block (2 levels) in <class:Numeric>
     97896   (4.0%)       49353   (2.0%)     ActiveRecord::ModelSchema::ClassMethods#inheritance_column
     42101   (1.7%)       42101   (1.7%)     ActiveRecord::Associations#association_instance_set
     42037   (1.7%)       42037   (1.7%)     ActiveRecord::Reflection::AssociationReflection#polymorphic?
    468220  (19.2%)       41908   (1.7%)     ActiveRecord::LazyAttributeHash#[]
     39784   (1.6%)       39784   (1.6%)     ActiveRecord::Type::String#cast_value
     37738   (1.5%)       37738   (1.5%)     ActiveRecord::LazyAttributeHash#initialize
    198515   (8.1%)       37031   (1.5%)     ActiveRecord::Attribute#value
     71562   (2.9%)       31228   (1.3%)     ActiveRecord::Type::Value#type_cast
     27833   (1.1%)       27833   (1.1%)     NilClass#blank?
    552101  (22.7%)       27636   (1.1%)     block in ActiveRecord::Result#each
     25791   (1.1%)       25180   (1.0%)     ActiveRecord::AttributeMethods::ClassMethods#define_attribute_methods
     21767   (0.9%)       21767   (0.9%)     ActiveSupport::Callbacks::CallbackChain#empty?
     21732   (0.9%)       21732   (0.9%)     ActiveRecord::Associations::Association#reset_scope
     20810   (0.9%)       20810   (0.9%)     ActiveRecord::Associations#association_instance_get
     73145   (3.0%)       18285   (0.8%)     ActiveRecord::AttributeSet::Builder#build_from_database
     17122   (0.7%)       17122   (0.7%)     ActiveRecord::AttributeSet#initialize
     31593   (1.3%)       15233   (0.6%)     ActiveRecord::Reflection::ClassMethods#_reflect_on_association
     15007   (0.6%)       14688   (0.6%)     block in ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_cache
     10729   (0.4%)       10729   (0.4%)     ActiveRecord::Result#initialize
      8882   (0.4%)        8879   (0.4%)     ActiveRecord::ModelSchema::ClassMethods#attributes_builder
      8520   (0.3%)        8520   (0.3%)     ActiveRecord::Reflection::BelongsToReflection#macro
     65078   (2.7%)        8220   (0.3%)     ActiveRecord::Reflection::AbstractReflection#inverse_of
      7783   (0.3%)        7783   (0.3%)     ActiveRecord::Persistence::ClassMethods#discriminate_class_for_record
      5466   (0.2%)        5466   (0.2%)     ActiveRecord::Attribute#initialized?
     39996   (1.6%)        5414   (0.2%)     ActiveSupport::Callbacks#__run_callbacks__
      5120   (0.2%)        5120   (0.2%)     ActiveSupport::PerThreadRegistry#instance
    353749  (14.5%)        4700   (0.2%)     ActiveRecord::Attribute.from_database
    435440  (17.9%)        4479   (0.2%)     block (2 levels) in AssignmentMixin::ClassMethods#get_assigned_for_target
    761805  (31.3%)        4325   (0.2%)     ActiveRecord::Associations::Association#set_inverse_instance
    472523  (19.4%)        4303   (0.2%)     ActiveRecord::AttributeSet#[]
      3547   (0.1%)        3547   (0.1%)     #<Module:0x000000022d6b60>.load_with_autoloading
     80620   (3.3%)        3513   (0.1%)     ActiveRecord::Reflection::AssociationReflection#inverse_name
      3848   (0.2%)        3447   (0.1%)     Date._parse
      3435   (0.1%)        3435   (0.1%)     ThreadSafe::NonConcurrentCacheBackend#[]
    181279   (7.4%)        3347   (0.1%)     ActiveRecord::Associations::Association#loaded!
      2727   (0.1%)        2727   (0.1%)     ActiveRecord::Base.logger

The majority of the time is spent within ActiveRecord as shown by the profiler.

Comment 10 CFME Bot 2016-02-10 19:56:13 UTC

https://github.com/ManageIQ/manageiq/pull/6599

Comment 11 CFME Bot 2016-02-11 15:21:20 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/7b674c4c042cba0c8dc3665f98985bcb99328bd1

commit 7b674c4c042cba0c8dc3665f98985bcb99328bd1
Author:     Keenan Brock <kbrock>
AuthorDate: Wed Feb 10 13:20:41 2016 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Wed Feb 10 16:41:05 2016 -0500

    don't load miq_server every config param read
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1227008
    
    |VmOrTemplate|Storage|ExtManagementSystem|MiqRegion|Zone|
    |        ---:|   ---:|               ---:|     ---:|---:|
    |        100 |    13 |                 1 |       1 |  1 |
    
    |     ms |  sql | sqlms | comments
    |    ---:|  ---:|   ---:| ---
    | 3344.1 | 1263 | 270.5 | a100-master-1
    |  818.6 |  449 |  97.1 | a100-master-2
    |  847.6 |  449 |  90.8 | a100-master-3
    |  968.1 |  449 | 115.6 | a100-master-4
    
    |  3351.0 |1175 |  262.9 | a100-mtime-1
    |   730.4 | 361 |   82.0 | a100-mtime-2
    |   789.5 | 361 |   79.6 | a100-mtime-3
    |   746.8 | 361 |   87.5 | a100-mtime-4

 lib/vmdb/config.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 12 CFME Bot 2016-02-11 17:46:49 UTC

https://github.com/ManageIQ/manageiq/pull/6622

Comment 13 CFME Bot 2016-02-13 23:40:51 UTC

https://github.com/ManageIQ/manageiq/pull/6656

Comment 14 CFME Bot 2016-02-15 15:43:33 UTC

https://github.com/ManageIQ/manageiq/pull/6577

Comment 15 CFME Bot 2016-02-15 16:56:03 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/38f95c8add65beb02f2c5cc274f2e30736f13fd5

commit 38f95c8add65beb02f2c5cc274f2e30736f13fd5
Author:     Keenan Brock <kbrock>
AuthorDate: Mon Feb 1 10:30:02 2016 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Sat Feb 13 19:32:00 2016 -0500

    Don't create a region object
    
    my_region does not change. Just use a cached
    version of it.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1227008
    
    |VmOrTemplate|Storage|ExtManagementSystem|MiqRegion|Zone|
    |        ---:|   ---:|               ---:|     ---:|---:|
    |        100 |    13 |                 1 |       1 |  1 |
    
    |     ms |  sql | sqlms | comments
    |    ---:|  ---:|   ---:| ---
    | 3344.1 | 1263 | 270.5 | a100-master-1
    |  818.6 |  449 |  97.1 | a100-master-2
    |  847.6 |  449 |  90.8 | a100-master-3
    |  968.1 |  449 | 115.6 | a100-master-4
    
    |     ms |  sql |  sqlms | comments
    |    ---:|  ---:|    ---:| ---
    | 3296.4 | 1250 |  268.5 | a100-region-cache-1
    |  777.0 |  436 |   92.3 | a100-region-cache-2
    |  837.0 |  436 |   91.2 | a100-region-cache-3
    |  825.3 |  436 |  102.7 | a100-region-cache-4

 app/models/miq_region.rb | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Comment 16 CFME Bot 2016-02-15 16:56:07 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/65339fdc0f37bf587b8b910a63d598b2ee5e27b6

commit 65339fdc0f37bf587b8b910a63d598b2ee5e27b6
Author:     Keenan Brock <kbrock>
AuthorDate: Wed Feb 10 17:38:05 2016 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Sat Feb 13 19:32:00 2016 -0500

    Fetch all ems hosts instead of going through cluster
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1227008
    
    This will reuse ems, and no longer lazy load all hosts and
    tags twice - resulting in almost half the queries
    (read that many fewer objects in memory)
    
    |VmOrTemplate|Host|Storage|ExtManagementSystem|MiqRegion|Zone|
    |-----------:|---:|------:|------------------:|--------:|---:|
    |      10000 |200 |   221 |                 1 |       1 |  1 |
    
    before:
    has region fix
    
    |        ms |    sql |    sqlms | comments
    |       ---:|    ---:|      ---:| ---
    |  30,429.2 | 11,772 | 13,952.0 | before-1
    |  28,832.5 | 11,745 | 13,796.6 | before-2
    |  28,794.5 | 11,745 | 13,651.3 | before-3
    |  28,695.7 | 11,745 | 13,749.3 | before-4
    
    after
    has region fix and this pr
    
    |        ms |   sql |   sql ms | comments
    |       ---:|   ---:|      ---:| ---
    |  26,707.0 | 6,371 | 13,600.0 | after-1
    |  24,813.1 | 6,344 | 13,165.1 | after-2
    |  24,959.9 | 6,344 | 13,215.7 | after-3
    |  25,540.2 | 6,344 | 13,464.6 | after-4

 app/models/metric/targets.rb | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

Comment 17 CFME Bot 2016-03-16 16:06:03 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/d9f1c2ffc511d742ea3b0b232a356ea3884e0b44

commit d9f1c2ffc511d742ea3b0b232a356ea3884e0b44
Author:     Keenan Brock <kbrock>
AuthorDate: Mon Feb 8 17:04:05 2016 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Sat Feb 13 18:31:02 2016 -0500

    pass zone into queue_captures
    
    It is expensive to determine the zone, especially for
    Storages that need to look up a number of objects.
    Since it is known, just pass it into enqueue methods
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1227008
    
    |VmOrTemplate|Storage|ExtManagementSystem|MiqRegion|Zone|
    |        ---:|   ---:|               ---:|     ---:|---:|
    |        100 |    13 |                 1 |       1 |  1 |
    
    |     ms |  sql | sqlms | comments
    |    ---:|  ---:|   ---:| ---
    | 3344.1 | 1263 | 270.5 | a100-master-1
    |  818.6 |  449 |  97.1 | a100-master-2
    |  847.6 |  449 |  90.8 | a100-master-3
    |  968.1 |  449 | 115.6 | a100-master-4
    
    |      ms | sql |  sqlms | comments
    |     ---:| ---:|    ---:| ---
    |  3195.0 |1177 |  261.6 | a100-zone-1
    |   742.8 | 363 |   89.8 | a100-zone-2
    |   799.5 | 363 |   97.9 | a100-zone-3
    |   708.0 | 363 |   84.7 | a100-zone-4

 app/models/metric/capture.rb          | 8 ++++----
 app/models/metric/ci_mixin/capture.rb | 5 +++--
 2 files changed, 7 insertions(+), 6 deletions(-)

Comment 18 CFME Bot 2016-03-17 21:06:07 UTC

https://github.com/ManageIQ/manageiq/pull/7338

Comment 19 Keenan Brock 2016-04-27 13:27:40 UTC

A bunch of performance enhancements have been merged into master. Let me know if this is still an issue.

Comment 20 Alex Krzos 2016-06-20 23:28:58 UTC

I have since tested this with 5.6.0.11 attempting to schedule perf_captures on 5,000 vm targets again on Haswell hardware with ample memory for the appliance (16GiB) and I currently can not get the scheduling to complete without the message timing out.  I will add another appliance to the mix to move the schedule off the database appliance or vice-verse to see if this can complete on time on an appliance with less work occurring on it.

Comment 21 Keenan Brock 2016-06-22 01:04:25 UTC

Assigning back to me. this is still an issue

Comment 24 Keenan Brock 2016-08-16 16:15:43 UTC

tests for #1349062 should be sufficient

Comment 25 Satoe Imaishi 2016-08-29 12:10:24 UTC

PR: https://github.com/ManageIQ/manageiq/pull/9448

Comment 26 Archit Sharma 2016-12-12 13:17:20 UTC

single db appliance, 4 worker appliances, 0 recycled workers. 5k online vms from Vmware provider.
=======================================================================

showing output from log file: ~/perf_capture_timer_last_avg.log:

This reading was recorded on: 2016-12-12 08:12:08,268
...Reading logs from /var/www/miq/vmdb/log/evm.log
      and extracting last 100 events.
      Begin time: 2016-12-12T07:34
      End time: 2016-12-12T08:04
10 IDs found..

#	 | Timestamp			 | IDs | Metric::Capture.perf_capture_timer
----------------------------------------------------------------------
0	 | Mon Dec 12 07:34:52 EST 2016	 | 8667 | 20.486461867
1	 | Mon Dec 12 07:38:00 EST 2016	 | 10880 | 27.291034255
2	 | Mon Dec 12 07:41:09 EST 2016	 | 29407 | 32.840495136
3	 | Mon Dec 12 07:44:04 EST 2016	 | 37936 | 31.357679297
4	 | Mon Dec 12 07:47:00 EST 2016	 | 45778 | 27.009139387
5	 | Mon Dec 12 07:49:56 EST 2016	 | 53909 | 23.166915821
6	 | Mon Dec 12 07:52:54 EST 2016	 | 55273 | 19.152661771
7	 | Mon Dec 12 07:55:53 EST 2016	 | 56148 | 21.572165016
8	 | Mon Dec 12 07:58:56 EST 2016	 | 57016 | 21.522156136
9	 | Mon Dec 12 08:01:54 EST 2016	 | 57898 | 22.586555143
----------------------------------------------------------------------
▁▅█▇▅▃▁▂▂▂
Average: 24.69850000000000000000
----------------------------------------------------------------------
10 events were queued, out of which 10 events were delivered with [ok] status...