1296638 – Metrics Collector Workers memory threshold displayed as 200MiB in the Web UI, however they exit at 500MiB threshold

Bug 1296638 - Metrics Collector Workers memory threshold displayed as 200MiB in the Web UI, however they exit at 500MiB threshold

Summary: Metrics Collector Workers memory threshold displayed as 200MiB in the Web UI,...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	UI - OPS
Sub Component:
Version:	5.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.8.0
Assignee:	Harpreet Kataria
QA Contact:	Pradeep Kumar Surisetty
Docs Contact:
URL:
Whiteboard:	c&u:perf
Depends On:
Blocks:	1411478
TreeView+	depends on / blocked

Reported:	2016-01-07 17:31 UTC by Alex Krzos
Modified:	2017-04-12 09:26 UTC (History)
CC List:	13 users (show)
Fixed In Version:	5.8.0.0
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1411478 (view as bug list)
Environment:
Last Closed:	2017-04-12 09:26:47 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Alex Krzos 2016-01-07 17:31:46 UTC

Description of problem:
The Web UI displays a memory threshold of 200MiB for C&U Data Collectors, however in all of my memory baseline tests in which a C&U Data Collector exceeds memory it appears that the limit is actually 400MiB (The default for queue_worker_base).  

I would recommend a minimum of 400MiB threshold for C&U collectors for VMware/RHEVM environments on 5.5.  5.4 we can seem to get away with a 200-300MiB limit.

Version-Release number of selected component (if applicable):
5.4.4.2
5.5.0.13-2
5.5.2.0


How reproducible:
I can reproduce this with my C&U memory baseline tests with RHEVM providers since those collectors regularly exceed the memory threshold.

Steps to Reproduce:
1.
2.
3.

Actual results:
The Web UI to display the default 400MiB

Expected results:


Additional info:

It is unclear if the Web UI displayed setting even affects anything.  I have not tested the functionality of it.


Relevant Log Lines from 5.5.2.0:
[----] I, [2016-01-07T10:10:30.807133 #34133:11af990]  INFO -- : MIQ(MiqQueue.put) Message id: [22667],  id: [], Zone: [default], Role: [], Server: [], Ident: [generic], Target id: [], Instance id: [], Task id: [], Command: [MiqEvent.raise_evm_event], Timeout: [600], Priority: [100], State: [ready], Deliver On: [], Data: [], Args: [["MiqServer", 1], "evm_worker_memory_exceeded", {:event_details=>"Worker [ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker] with ID: [18], PID: [48824], GUID: [f156dd68-b54d-11e5-9d80-001a4a223927] process memory usage [443781120] exceeded limit [419430400], requesting worker to exit", :type=>"ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker"}]
...
[----] I, [2016-01-07T10:12:01.508835 #48824:1229998]  INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker#log_status) [C&U Metrics Collector for RHEV] Worker ID [18], PID [48824], GUID [f156dd68-b54d-11e5-9d80-001a4a223927], Last Heartbeat [2016-01-07 15:08:21 UTC], Process Info: Memory Usage [593125376], Memory Size [934707200], Memory % [3.56], CPU Time [42567.0], CPU % [0.27], Priority [23]
[----] I, [2016-01-07T10:12:01.509028 #48824:1229998]  INFO -- : MIQ(ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker::Runner) ID [18] PID [48824] GUID [f156dd68-b54d-11e5-9d80-001a4a223927] Exit request received. Worker exiting.

Comment 2 Keenan Brock 2016-01-08 12:45:01 UTC

The setting of this functionality broke by:

commit 49dc2581fed7acfc50c5a7d4c984289de0031906
Author: Keenan Brock <kbrock>
Date:   Sat Nov 21 14:33:53 2015 -0500

    path_to_my_worker_settings

I'm still tracking down the reading of that variable and if it is/has been respected

Comment 3 Keenan Brock 2016-04-21 16:18:14 UTC

Alex,

Are you still seeing behavior like this?
We updated the code that reads/defaults these parameters.
We also rewrote the configuration system.

Comment 4 Alex Krzos 2016-05-10 13:56:31 UTC

(In reply to Keenan Brock from comment #3)
> Alex,
> 
> Are you still seeing behavior like this?
> We updated the code that reads/defaults these parameters.
> We also rewrote the configuration system.

Hi Keenan,

I reviewed 5.6.0.5 (beta2.4) and still see the memory threshold in the UI as 200MiB for C&U collectors but I have witnessed the RSS memory usage greater than 340MiB during tests with a large vmware provider.  Additionally I do not see a :memory_threshold: configured option under ems_metrics_collector_worker - defaults so I assume it is defaulting to queue_worker_base which is 400MiB.

Perhaps the patches haven't made there way into 5.6.0.5 yet?

Comment 5 Keenan Brock 2016-10-10 14:51:46 UTC

This is a configuration issue in the core.

Also suggesting moving this to 5.7

Comment 7 Joe Rafaniello 2016-11-22 21:48:58 UTC

Dan, I noticed this Bz while looking at another issue and thought it was in the queue area... Can you have someone look at this?

I tried tracking this one down but couldn't understand the code in app/views/ops/_settings_workers_tab.html.haml and
app/controllers/ops_controller/settings/common.rb

I believe the problem is that the UI code is walking the hashes for the existing settings and new settings and assuming a specific structure.

    :queue_worker_base:
      :defaults:
        :cpu_usage_threshold: 100.percent
        :dequeue_method: :drb
        :memory_threshold: 500.megabytes
        :poll_method: :normal
        :queue_timeout: 10.minutes
      :ems_metrics_collector_worker:
        :defaults:
          :count: 2
          :nice_delta: 3
          :poll_method: :escalate

I believe it's trying to look at 
[:queue_worker_base][:ems_metrics_collector_worker:][:defaults][:memory_threshold], failing to find it and defaulting back to 200.megabytes.

I don't understand where 200 is coming from though since the fallback seems to be 400 megabytes if it's not found (in common.rb:1024):

      qwb[:ems_metrics_collector_worker] ||= {}
      qwb[:ems_metrics_collector_worker][:defaults] ||= {}
      w = qwb[:ems_metrics_collector_worker][:defaults]
      raw = @edit[:current].get_raw_worker_setting(:MiqEmsMetricsCollectorWorker)
      w[:count] = raw[:defaults][:count] || 2
      w[:memory_threshold] = rails_method_to_human_size(raw[:defaults][:memory_threshold] || 400.megabytes)
      @sb[:ems_metrics_collector_threshold] = []

Comment 8 Joe Rafaniello 2016-11-22 22:09:26 UTC

Correction:

Dan, I noticed this Bz while looking at another issue and thought it was in the WRONG queue/assignment... Can you have someone look at this?

Comment 10 Harpreet Kataria 2016-12-05 22:34:33 UTC

https://github.com/ManageIQ/manageiq/pull/12999

Comment 11 Joe Rafaniello 2016-12-06 15:00:35 UTC

Note, this 400 MB value reported in this BZ was subsequently modified to 500 in https://bugzilla.redhat.com/show_bug.cgi?id=1391687 via
https://github.com/ManageIQ/manageiq/pull/12484

Sorry, changing description to reflect that change.

Note, that ems_refresh_core_worker just like Metrics Collector Workers and many other workers are inheriting the 500 MB memory_threshold from queue_worker_base and would probably exhibit similar problems as reported in this BZ.

Comment 12 CFME Bot 2016-12-06 17:17:00 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/1f2687fc89328eccb37724dbeddf6311e9dffbac

commit 1f2687fc89328eccb37724dbeddf6311e9dffbac
Author:     Harpreet Kataria <hkataria>
AuthorDate: Mon Dec 5 17:02:16 2016 -0500
Commit:     Harpreet Kataria <hkataria>
CommitDate: Mon Dec 5 17:02:16 2016 -0500

    Added a missing default memeory threshold setting
    
    Added a missing default memeory threshold setting for C & U Data Collectors that was causing drop down to the firt item in list as selected value by default.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1296638

 config/settings.yml | 1 +
 1 file changed, 1 insertion(+)

Comment 14 Archit Sharma 2017-04-11 12:09:49 UTC

while connected to a 1000 vm RHVM environment, this seems to have been fixed in 580x. 

reference: https://gist.github.com/arcolife/648c83a7f53ee6a706dd8fda278080e1

[----] W, [2017-04-11T07:50:57.579104 #40011:1045140]  WARN -- : MIQ(MiqServer#validate_worker) Worker [ManageIQ::Providers::Redhat::InfraManager::MetricsCollectorWorker] with ID: [42], PID: [21634], GUID: [ef1e1b5a-1eac-11e7-8366-001a4a22391a] process memory usage [420320000] exceeded limit [419430400], requesting worker to exit


checks out against the UI params:

      :ems_metrics_collector_worker:
        :defaults:
          :count: 2
          :memory_threshold: 400.megabytes
          :nice_delta: 3
          :poll_method: :escalate

Note You need to log in before you can comment on or make changes to this bug.