Bug 1457765 - Default capture_threshold value for OpenShift object types is too low
Summary: Default capture_threshold value for OpenShift object types is too low
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: C&U Capacity and Utilization
Version: 5.8.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: GA
: 5.9.0
Assignee: Yaacov Zamir
QA Contact: Gilad Shefer
URL:
Whiteboard: container:c&u
Depends On:
Blocks: 1478428
TreeView+ depends on / blocked
 
Reported: 2017-06-01 09:21 UTC by Peter McGowan
Modified: 2020-08-13 09:17 UTC (History)
7 users (show)

Fixed In Version: 5.9.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1478428 (view as bug list)
Environment:
Last Closed: 2018-03-06 15:46:21 UTC
Category: ---
Cloudforms Team: Container Management
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Peter McGowan 2017-06-01 09:21:16 UTC
Description of problem:
The default capture_threshold value for the OpenShift object types container, container_group and node is 10 minutes (since these object types don't have their own capture_threshold definition in advanced settings). To maintain this rate of ems_metrics_collector message processing requires a large number of C&U Data Collector worker processes, which results in scaling several CFME appliances just to manage OpenShift Metrics.

Unless there is a good reason to keep the capture_threshold definition at 10 minutes, I'd suggest extending the settings yaml to add new default values for the three OpenShift object types, as follows:

:performance:
  :capture_threshold:
    :default: 10.minutes
    :ems_cluster: 50.minutes
    :host: 50.minutes
    :storage: 60.minutes
    :vm: 50.minutes
    :container: 60.minutes
    :container_group: 60.minutes
    :node: 60.minutes

This definition would set the frequency of collection to one hour, which requires far fewer C&U Data Collector worker processes to process.

I'm not sure whether we'd need to add corresponding values to :capture_threshold_with_alerts:

Version-Release number of selected component (if applicable):
5.8.0.16

How reproducible:
Every time

Comment 2 Federico Simoncelli 2017-06-06 10:40:38 UTC
It is feasible to believe (as I heard it in several other cases) that the scalability issue here has to do with the queue; whether we collect every 10 or 50 minutes the number of data-points is unvaried, so the only thing that changes is the number of scheduled collections on the queue.

At the moment of this writing the long-term plan is to access the real-time metrics directly from OpenShift (Ad-hoc Metrics) and keep in CFME only hourly rollups for long-term reporting.

The initial 10-15 minutes collection was set in order to have the freshest data in C&U realtime pages.

If we're OK to give up that freshness (+1 from me considering the long term goals and the new Ad-hoc Metrics page) then we can change the collection interval to 50.
I assume correctness of availability of data for rollups was already assessed for other providers (who already have this threshold set to 50 minutes.

Loic any concerns from your side?
(I'd be more concerned about scalability vs freshness now)

Comment 3 Loic Avenel 2017-06-06 11:10:40 UTC
(In reply to Federico Simoncelli from comment #2)
> It is feasible to believe (as I heard it in several other cases) that the
> scalability issue here has to do with the queue; whether we collect every 10
> or 50 minutes the number of data-points is unvaried, so the only thing that
> changes is the number of scheduled collections on the queue.
> 
> At the moment of this writing the long-term plan is to access the real-time
> metrics directly from OpenShift (Ad-hoc Metrics) and keep in CFME only
> hourly rollups for long-term reporting.
> 
> The initial 10-15 minutes collection was set in order to have the freshest
> data in C&U realtime pages.
> 
> If we're OK to give up that freshness (+1 from me considering the long term
> goals and the new Ad-hoc Metrics page) then we can change the collection
> interval to 50.
> I assume correctness of availability of data for rollups was already
> assessed for other providers (who already have this threshold set to 50
> minutes.
> 
> Loic any concerns from your side?
> (I'd be more concerned about scalability vs freshness now)

No Concern for me, long term will be CF Rollup, short term will come directly from provider repository.

Comment 4 Yaacov Zamir 2017-06-06 12:16:27 UTC
submited upstream:
https://github.com/ManageIQ/manageiq/pull/15311

Comment 5 Yaacov Zamir 2017-06-29 10:19:56 UTC
merged upstream:
https://github.com/ManageIQ/manageiq/pull/15311


Note You need to log in before you can comment on or make changes to this bug.