Bug 1457765

Summary:	Default capture_threshold value for OpenShift object types is too low
Product:	Red Hat CloudForms Management Engine	Reporter:	Peter McGowan <pmcgowan>
Component:	C&U Capacity and Utilization	Assignee:	Yaacov Zamir <yzamir>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Gilad Shefer <gshefer>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.8.0	CC:	arcsharm, bsorota, jhardy, lavenel, obarenbo, snaim, tachoi
Target Milestone:	GA	Keywords:	TestOnly, ZStream
Target Release:	5.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	container:c&u
Fixed In Version:	5.9.0.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1478428 (view as bug list)		Environment:
Last Closed:	2018-03-06 15:46:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	Container Management	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1478428

Description Peter McGowan 2017-06-01 09:21:16 UTC

Description of problem:
The default capture_threshold value for the OpenShift object types container, container_group and node is 10 minutes (since these object types don't have their own capture_threshold definition in advanced settings). To maintain this rate of ems_metrics_collector message processing requires a large number of C&U Data Collector worker processes, which results in scaling several CFME appliances just to manage OpenShift Metrics.

Unless there is a good reason to keep the capture_threshold definition at 10 minutes, I'd suggest extending the settings yaml to add new default values for the three OpenShift object types, as follows:

:performance:
  :capture_threshold:
    :default: 10.minutes
    :ems_cluster: 50.minutes
    :host: 50.minutes
    :storage: 60.minutes
    :vm: 50.minutes
    :container: 60.minutes
    :container_group: 60.minutes
    :node: 60.minutes

This definition would set the frequency of collection to one hour, which requires far fewer C&U Data Collector worker processes to process.

I'm not sure whether we'd need to add corresponding values to :capture_threshold_with_alerts:

Version-Release number of selected component (if applicable):
5.8.0.16

How reproducible:
Every time

Comment 2 Federico Simoncelli 2017-06-06 10:40:38 UTC

It is feasible to believe (as I heard it in several other cases) that the scalability issue here has to do with the queue; whether we collect every 10 or 50 minutes the number of data-points is unvaried, so the only thing that changes is the number of scheduled collections on the queue.

At the moment of this writing the long-term plan is to access the real-time metrics directly from OpenShift (Ad-hoc Metrics) and keep in CFME only hourly rollups for long-term reporting.

The initial 10-15 minutes collection was set in order to have the freshest data in C&U realtime pages.

If we're OK to give up that freshness (+1 from me considering the long term goals and the new Ad-hoc Metrics page) then we can change the collection interval to 50.
I assume correctness of availability of data for rollups was already assessed for other providers (who already have this threshold set to 50 minutes.

Loic any concerns from your side?
(I'd be more concerned about scalability vs freshness now)

Comment 3 Loic Avenel 2017-06-06 11:10:40 UTC

(In reply to Federico Simoncelli from comment #2)
> It is feasible to believe (as I heard it in several other cases) that the
> scalability issue here has to do with the queue; whether we collect every 10
> or 50 minutes the number of data-points is unvaried, so the only thing that
> changes is the number of scheduled collections on the queue.
> 
> At the moment of this writing the long-term plan is to access the real-time
> metrics directly from OpenShift (Ad-hoc Metrics) and keep in CFME only
> hourly rollups for long-term reporting.
> 
> The initial 10-15 minutes collection was set in order to have the freshest
> data in C&U realtime pages.
> 
> If we're OK to give up that freshness (+1 from me considering the long term
> goals and the new Ad-hoc Metrics page) then we can change the collection
> interval to 50.
> I assume correctness of availability of data for rollups was already
> assessed for other providers (who already have this threshold set to 50
> minutes.
> 
> Loic any concerns from your side?
> (I'd be more concerned about scalability vs freshness now)

No Concern for me, long term will be CF Rollup, short term will come directly from provider repository.

Comment 4 Yaacov Zamir 2017-06-06 12:16:27 UTC

submited upstream:
https://github.com/ManageIQ/manageiq/pull/15311

Comment 5 Yaacov Zamir 2017-06-29 10:19:56 UTC

merged upstream:
https://github.com/ManageIQ/manageiq/pull/15311