Bug 1251259 - [Scale] MiqWidget.generate_content at large scale consumes tremendous amount of memory and times - out
Summary: [Scale] MiqWidget.generate_content at large scale consumes tremendous amount ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: Performance
Version: 5.4.0
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: GA
: 5.8.0
Assignee: Keenan Brock
QA Contact: Pradeep Kumar Surisetty
URL:
Whiteboard: perf:widget:dashboard
Depends On:
Blocks: 1365254 1434150
TreeView+ depends on / blocked
 
Reported: 2015-08-06 20:28 UTC by Alex Krzos
Modified: 2019-08-06 20:04 UTC (History)
10 users (show)

Fixed In Version: 5.8.0.7
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1365254 1434150 (view as bug list)
Environment:
Last Closed: 2017-06-12 16:12:32 UTC
Category: ---
Cloudforms Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
evm.log snippets and grahs of memory and cpu usage (534.43 KB, application/zip)
2015-08-06 20:28 UTC, Alex Krzos
no flags Details
Multiple workers spawning due to memory threshold exceeding evm.log (103.95 KB, application/zip)
2015-08-17 20:25 UTC, Alex Krzos
no flags Details
memory usage due to reporting workers exceeding memory and spawning new workers (90.73 KB, image/png)
2015-08-17 20:28 UTC, Alex Krzos
no flags Details
cpu usage due to reporting workers exceeding memory and spawning new workers (70.28 KB, image/png)
2015-08-17 20:28 UTC, Alex Krzos
no flags Details
Graph of Memory usage of collector workers over time displaying overlap of 3 workers when the desired threshold is 2 workers. (83.89 KB, image/png)
2015-11-02 19:54 UTC, Alex Krzos
no flags Details

Description Alex Krzos 2015-08-06 20:28:36 UTC
Created attachment 1060072 [details]
evm.log snippets and grahs of memory and cpu usage

Description of problem:
In large scale environment (100k VMware VMs), turning on the reporting role schedules generating dashboard widgets which consumes a tremendous amount of memory( up to >2GiB), 100% nice cpu per worker for duration of hour and eventually the work will timeout as the dashboard can not be completed in 3600s.

Widgets that appear affected are:
Top Storage Consumers
Vendor and Guest OS Chart
Guest OS Information

Version-Release number of selected component (if applicable):
CFME 5.4

How reproducible:
Always in this environment

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Three examples of evm log lines demonstrating the reports being processed and appliance memory and cpu usage graphs are attached.

Comment 2 Alex Krzos 2015-08-17 18:38:51 UTC
Another side affect of this issue is that additional reporting workers will be spawned and can consume the remaining available cpu and memory in an appliance.

Since the worker runs over the threshold of memory and the stopping process for the worker begins, a new worker will be spawned in its place.  During that time frame the new reporting worker can get another queued message that could be another large report that will consume a reporting worker for 1hr (Message timeout) + CPU + lots of memory, the cycle will continue until either:

No more messages remain on the queue for reporting workers to get
Messages on the queue don't cause the new spawned reporting workers to break memory threshold


With the default widgets + a second RBAC administrator (Not super administrator) this causes 6 messages queued that will consume all 4 vCPUs on the appliance and consume more than 16GiB of memory in the 100k VM environment.

evm:status showing 6 Reporting workers being stopped, with 2 brand new ones on deck without any more work.

 Worker Type        | Status   |    ID |   PID | SPID  | Queue Name / URL      | Started On           | Last Heartbeat
--------------------+----------+-------+-------+-------+-----------------------+----------------------+----------------------
 MiqGenericWorker   | started  | 19030 |  2723 | 2761  | generic               | 2015-08-17T17:29:02Z | 2015-08-17T18:27:27Z
 MiqGenericWorker   | started  | 19031 |  2726 | 2763  | generic               | 2015-08-17T17:29:05Z | 2015-08-17T18:27:27Z
 MiqPriorityWorker  | started  | 19033 |  2732 | 2750  | generic               | 2015-08-17T17:28:59Z | 2015-08-17T18:27:28Z
 MiqPriorityWorker  | started  | 19032 |  2729 | 2749  | generic               | 2015-08-17T17:28:59Z | 2015-08-17T18:27:28Z
 MiqReportingWorker | stopping | 19046 | 13054 | 13065 | reporting             | 2015-08-17T18:16:52Z | 2015-08-17T18:21:56Z
 MiqReportingWorker | started  | 19050 | 15256 | 15272 | reporting             | 2015-08-17T18:26:45Z | 2015-08-17T18:27:27Z
 MiqReportingWorker | started  | 19049 | 15253 | 15273 | reporting             | 2015-08-17T18:26:47Z | 2015-08-17T18:27:28Z
 MiqReportingWorker | stopping | 19045 | 13051 | 13066 | reporting             | 2015-08-17T18:16:52Z | 2015-08-17T18:21:55Z
 MiqReportingWorker | stopping | 19041 |  2771 | 2836  | reporting             | 2015-08-17T17:29:21Z | 2015-08-17T18:15:57Z
 MiqReportingWorker | stopping | 19048 | 14361 | 14373 | reporting             | 2015-08-17T18:22:32Z | 2015-08-17T18:25:59Z
 MiqReportingWorker | stopping | 19047 | 14358 | 14374 | reporting             | 2015-08-17T18:22:33Z | 2015-08-17T18:25:56Z
 MiqReportingWorker | stopping | 19042 |  2774 | 2837  | reporting             | 2015-08-17T17:29:21Z | 2015-08-17T18:15:59Z
 MiqScheduleWorker  | started  | 19034 |  2738 | 2756  |                       | 2015-08-17T17:29:01Z | 2015-08-17T18:27:27Z
 MiqUiWorker        | started  | 19043 |  2777 |       | http://127.0.0.1:3000 | 2015-08-17T17:29:23Z | 2015-08-17T18:27:24Z




Sample top output during this event:

Note 0% idle cpu available, and large resident memory sizes of reporting workers.

top - 18:33:59 up  1:06,  4 users,  load average: 6.72, 6.31, 3.92
Tasks:   8 total,   6 running,   2 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.6%us,  0.3%sy, 98.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.1%st
Mem:  16325028k total, 14998164k used,  1326864k free,    46524k buffers
Swap:  9957372k total,        0k used,  9957372k free,  1222888k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
14358 root      27   7 2068m 1.8g 6160 R 84.6 11.4   5:12.60 /var/www/miq/vmdb/lib/workers/bin/worker.rb
13051 root      27   7 1774m 1.5g 6188 R 71.6  9.6   8:52.34 /var/www/miq/vmdb/lib/workers/bin/worker.rb
 2771 root      27   7 2571m 2.3g 6204 R 64.9 14.5  14:48.37 /var/www/miq/vmdb/lib/workers/bin/worker.rb
 2774 root      27   7 2561m 2.3g 6204 R 64.9 14.5  14:56.04 /var/www/miq/vmdb/lib/workers/bin/worker.rb
13054 root      27   7 1753m 1.5g 6184 R 53.9  9.4   8:44.48 /var/www/miq/vmdb/lib/workers/bin/worker.rb
14361 root      27   7 2069m 1.8g 6156 R 51.3 11.4   5:14.42 /var/www/miq/vmdb/lib/workers/bin/worker.rb
15256 root      27   7  380m 135m 6132 S  0.0  0.8   0:10.73 /var/www/miq/vmdb/lib/workers/bin/worker.rb
15253 root      27   7  365m 119m 6132 S  0.0  0.8   0:10.34 /var/www/miq/vmdb/lib/workers/bin/worker.rb

Comment 3 Jason Frey 2015-08-17 19:33:51 UTC
Side question, why is the report timeout set to 1 hour?  Seems excessive.

Comment 4 Jason Frey 2015-08-17 19:38:21 UTC
Alex,

Can you show the config settings for these workers (should be in the log).  I'm curious about the kill algorithm and the threshold values there.

Comment 5 Alex Krzos 2015-08-17 20:24:29 UTC
The report settings are the default for those widgets so 3600s is set for them and happens to be not enough time in this environment for the dashboard widget to be built. (100k vm environment of course)

I was able to spawn a total of 8 report workers that were building three separate reports concurrently over two EVMGroups (EvmGroup-administrator, EvmGroup-super_administrator)

A total of 6 of the large reports that end up timing out were queued.  I progressively queued more reports although due to the schedule worker this also occurs nightly on the appliance.  

Config Settings per worker:

:count: 2
:gc_interval: 900
:heartbeat_freq: 60
:heartbeat_timeout: 120
:memory_threshold: 419430400
:nice_delta: 7
:parent_time_threshold: 180
:poll: 3
:poll_escalate_max: 30
:poll_method: :normal
:restart_interval: 0
:starting_timeout: 600
:cpu_usage_threshold: 100
:dequeue_method: :drb
:queue_timeout: 600

I will test bumping the memory threshold to see if this prevents spawning workers until all large scale reporting widgets are consumed, however we would still want to prevent the condition where exceeding the memory threshold creates another worker that immediately beings working (if work is available) and consumes resources in what is likely to be assumed a resource constrained environment.

Attached is an evm log snippet show all of activity logged to the database appliance which runs the reporting workers:

MessageID	Widget	Group	Report Worker PID
1019910	Guest OS Information	EvmGroup-administrator	2771
1019911	Guest OS Information	EvmGroup-super_administrator	2774
1020037	Vendor and Guest OS Chart	EvmGroup-administrator	13051
1020038	Vendor and Guest OS Chart	EvmGroup-super_administrator	13054
1020113	Top Storage Consumers	EvmGroup-administrator	14358
1020114	Top Storage Consumers	EvmGroup-super_administrator	14361

Comment 6 Alex Krzos 2015-08-17 20:25:09 UTC
Created attachment 1064065 [details]
Multiple workers spawning due to memory threshold exceeding evm.log

Comment 7 Alex Krzos 2015-08-17 20:28:10 UTC
Created attachment 1064066 [details]
memory usage due to reporting workers exceeding memory and spawning new workers

Comment 8 Alex Krzos 2015-08-17 20:28:44 UTC
Created attachment 1064067 [details]
cpu usage due to reporting workers exceeding memory and spawning new workers

Comment 9 Alex Krzos 2015-08-17 20:29:36 UTC
Updated grpahs are attached showing the affect of more reports spawning additional workers and then consuming all cpu/memory resources on the appliance.

Comment 10 Alex Krzos 2015-08-18 13:11:11 UTC
This is not limited to CFME 5.4, I was able to reproduce on a 5.3.3.2.  

I turned on an older 5.3 appliance with a large number of VMs in the vmdb.  I turned on the reporting role, ensured 2 workers were allowed to spawn and tuned the memory threshold down just incase there wasn't enough data in the database to have the default reports exceed memory.  With the lower threshold, they did, and the same affect was witnessed:

[root@CFME-R99-Master-VMDB vmdb]# cat VERSION
5.3.3.2
[root@CFME-R99-Master-VMDB vmdb]# service evmserverd status
Checking EVM status...
 Zone    | Server Name          | Status  |             ID |  PID | SPID | URL                     | Started On           | Last Heartbeat       | Active Roles
---------+----------------------+---------+----------------+------+------+-------------------------+----------------------+----------------------+-----------------------------------------------------------------------
 default | Perf-MasterRegion-DB | started | 99000000000001 | 2383 | 2422 | druby://127.0.0.1:39795 | 2015-08-17T19:54:42Z | 2015-08-18T13:07:22Z | database_operations:database_owner:reporting:scheduler:user_interface

 Worker Type        | Status   |             ID |   PID | SPID  | Queue Name / URL    | Started On           | Last Heartbeat
--------------------+----------+----------------+-------+-------+---------------------+----------------------+----------------------
 MiqGenericWorker   | started  | 99000000000306 |  2646 | 2702  | generic             | 2015-08-17T19:55:02Z | 2015-08-18T13:07:37Z
 MiqGenericWorker   | started  | 99000000000305 |  2642 | 2703  | generic             | 2015-08-17T19:55:02Z | 2015-08-18T13:07:35Z
 MiqPriorityWorker  | started  | 99000000000307 |  2652 | 2686  | generic             | 2015-08-17T19:54:55Z | 2015-08-18T13:07:37Z
 MiqPriorityWorker  | started  | 99000000000308 |  2656 | 2684  | generic             | 2015-08-17T19:54:54Z | 2015-08-18T13:07:37Z
 MiqReportingWorker | stopping | 99000000000318 | 63360 | 63371 | reporting           | 2015-08-18T12:59:03Z | 2015-08-18T13:00:49Z
 MiqReportingWorker | stopping | 99000000000319 | 63982 | 64000 | reporting           | 2015-08-18T13:01:44Z | 2015-08-18T13:02:07Z
 MiqReportingWorker | stopping | 99000000000315 | 35746 | 35764 | reporting           | 2015-08-18T02:04:38Z | 2015-08-18T12:58:07Z
 MiqReportingWorker | started  | 99000000000322 | 64462 | 64489 | reporting           | 2015-08-18T13:03:27Z | 2015-08-18T13:07:35Z
 MiqReportingWorker | stopping | 99000000000320 | 63986 | 63999 | reporting           | 2015-08-18T13:01:43Z | 2015-08-18T13:02:07Z
 MiqReportingWorker | stopping | 99000000000317 | 63356 | 63370 | reporting           | 2015-08-18T12:59:03Z | 2015-08-18T13:00:49Z
 MiqReportingWorker | started  | 99000000000321 | 64459 | 64488 | reporting           | 2015-08-18T13:03:27Z | 2015-08-18T13:07:35Z
 MiqReportingWorker | stopping | 99000000000316 | 35749 | 35765 | reporting           | 2015-08-18T02:04:38Z | 2015-08-18T12:58:07Z
 MiqScheduleWorker  | started  | 99000000000310 |  2666 | 2694  |                     | 2015-08-17T19:54:57Z | 2015-08-18T13:07:19Z
 MiqUiWorker        | started  | 99000000000311 |  2670 |       | http://0.0.0.0:3000 | 2015-08-17T19:54:57Z | 2015-08-18T13:07:09Z

[root@CFME-R99-Master-VMDB ~]# top -p 63982,63986,35746,63356,63360,35749,64462,64459
top - 09:08:12 up 17:15,  2 users,  load average: 7.02, 5.70, 2.90
Tasks:   8 total,   6 running,   2 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.7%us,  2.8%sy, 88.3%ni,  0.0%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   5983704k total,  5837316k used,   146388k free,     1300k buffers
Swap:  9957372k total,  1271080k used,  8686292k free,   127360k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
35749 root      27   7  937m 489m 1512 R 78.3  8.4   8:01.61 /var/www/miq/vmdb/lib/workers/bin/worker.rb
63982 root      27   7 1195m 913m 2456 R 65.7 15.6   3:57.47 /var/www/miq/vmdb/lib/workers/bin/worker.rb
35746 root      27   7  934m 649m 1512 R 64.7 11.1   8:03.85 /var/www/miq/vmdb/lib/workers/bin/worker.rb
63986 root      27   7 1207m 924m 2452 R 59.1 15.8   3:59.58 /var/www/miq/vmdb/lib/workers/bin/worker.rb
63356 root      27   7 1145m 766m 2128 R 42.5 13.1   4:55.38 /var/www/miq/vmdb/lib/workers/bin/worker.rb
63360 root      27   7 1150m 809m 2128 R 36.2 13.8   4:51.38 /var/www/miq/vmdb/lib/workers/bin/worker.rb
64462 root      27   7  324m  41m 1824 S  0.3  0.7   0:09.08 /var/www/miq/vmdb/lib/workers/bin/worker.rb
64459 root      27   7  324m  39m 1824 S  0.0  0.7   0:08.94 /var/www/miq/vmdb/lib/workers/bin/worker.rb

Comment 11 Alex Krzos 2015-11-02 19:53:33 UTC
This issue is not limited to Reporting workers and not limited to 5.3/5.4.  I have reproduced the side effect on a 5.5.0.8 appliance with a different workload from reporting.

In this case a CFME 5.5.0.8 appliance is connected to a RHEVM provider and turned on C&U.  During the initial captures, one worker picks up a Storage.perf_capture on a specific rhevm data storage domain which for this environment can grow that worker's memory to above the threshold and occupies the worker for a extended period of time.  During that time frame the existing worker is asked to exit, however it does not until it completes its message.  While that is occurring a new collector is spawned which immediately begins to pick up work and consume memory and cpu resources.  

Attached is a graph showing all Collector workers memory usage graphed over time and displays that pid 63470 overlaps with the new collector worker (pid 5773) which should have "replaced" pid 63470.

This creates a situation where an appliance could unknowingly allow it self to exceed all available memory and cpu resources causing everything to slow down if the appliance must move memory into swap and contend for usable cpu time.

Comment 12 Alex Krzos 2015-11-02 19:54:59 UTC
Created attachment 1088711 [details]
Graph of Memory usage of collector workers over time displaying overlap of 3 workers when the desired threshold is 2 workers.

Comment 13 Alex Krzos 2015-11-13 20:01:50 UTC
I created a separate bug for seperate issue of multiple workers occurring when a worker exceeds its memory threshold during a long running message.

https://bugzilla.redhat.com/show_bug.cgi?id=1281921

Comment 16 CFME Bot 2017-03-08 14:01:19 UTC
New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/9dab1c435796c9945588d3a23d936a3794f156b6

commit 9dab1c435796c9945588d3a23d936a3794f156b6
Author:     Keenan Brock <kbrock>
AuthorDate: Tue Mar 7 17:18:46 2017 -0500
Commit:     Keenan Brock <kbrock>
CommitDate: Tue Mar 7 18:23:57 2017 -0500

    Don't pull back all records when detecting metrics
    
    This avoids `recs.to_a` and uses `recs.klass` instead
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1251259

 app/models/metric/helper.rb | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comment 17 Keenan Brock 2017-03-17 16:09:39 UTC
There are 2 merged PRs that effect this:

https://github.com/ManageIQ/manageiq/pull/14224
https://github.com/ManageIQ/manageiq/pull/14285

I'm working on a 3rd, but that probably won't make the deadline

Comment 18 Keenan Brock 2017-03-17 21:10:14 UTC
Third done, but probably won't make it into this run

https://github.com/ManageIQ/manageiq/pull/14386

Comment 21 Keenan Brock 2017-03-20 21:38:11 UTC
- Halved the number of reports run for each widget
- Widgets reports are bring back less data per run.

Comment 25 Archit Sharma 2017-04-12 10:33:59 UTC
>> $ egrep 'MiqReportingWorker' evm.log*| grep 'being deleted' | grep -i status | awk '{print $22}' | uniq
>> [stopped]

No messages found with 'aborted' or 'timed out' status. 

Also, a look at the intervals when the worker memory exceeded suggests that each time this happened, the deletion of the messages from the queue happened without much delay.

max and min memories consumed by the reporting worker process:
head -n 1
[1000408000] 
tail -n 1
[972955000]

Comment 28 Archit Sharma 2017-04-12 10:54:42 UTC
CFME 5809 additional information:

:workers:
  :worker_base:
    :defaults:
      :count: 1
      :gc_interval: 15.minutes
      :heartbeat_freq: 10.seconds
      :heartbeat_timeout: 2.minutes
      :memory_threshold: 400.megabytes
      :nice_delta: 10
      :parent_time_threshold: 3.minutes
      :poll: 3.seconds
      :poll_escalate_max: 30.seconds
      :poll_method: :normal
      :restart_interval: 0.hours
      :starting_timeout: 10.minutes
      :stopping_timeout: 10.minutes

..
      :reporting_worker:
        :count: 2
        :nice_delta: 7

Comment 29 CFME Bot 2017-04-20 22:31:11 UTC
New commit detected on ManageIQ/manageiq/fine:
https://github.com/ManageIQ/manageiq/commit/f482aee3f7d554c4098964dafc43761096594a1f

commit f482aee3f7d554c4098964dafc43761096594a1f
Author:     Martin Povolny <mpovolny>
AuthorDate: Wed Apr 19 09:13:53 2017 +0200
Commit:     Satoe Imaishi <simaishi>
CommitDate: Thu Apr 20 18:28:26 2017 -0400

    Merge pull request #14497 from kbrock/widget_time_sensitive_2
    
    Mark widgets as non timezone sensitive
    (cherry picked from commit 03f378c89f43d3fcd10eedc015f0df7ad3367c45)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1251259

 product/dashboard/widgets/chart_hosts_summary_by_version.yaml         | 3 ++-
 product/dashboard/widgets/chart_number_of_nodes_per_cpu_cores.yaml    | 3 ++-
 product/dashboard/widgets/chart_pods_per_ready.yaml                   | 1 +
 product/dashboard/widgets/chart_virtual_infrastructure_platforms.yaml | 3 ++-
 product/dashboard/widgets/nodes_by_cpu_capacity.yaml                  | 1 +
 product/dashboard/widgets/pods_per_ready.yaml                         | 1 +
 product/dashboard/widgets/projects_by_number_of_containers.yaml       | 1 +
 product/dashboard/widgets/projects_by_number_of_pods.yaml             | 1 +
 product/dashboard/widgets/tenant_quotas.yaml                          | 4 +++-
 9 files changed, 14 insertions(+), 4 deletions(-)


Note You need to log in before you can comment on or make changes to this bug.