Bug 1457765
| Summary: | Default capture_threshold value for OpenShift object types is too low | |||
|---|---|---|---|---|
| Product: | Red Hat CloudForms Management Engine | Reporter: | Peter McGowan <pmcgowan> | |
| Component: | C&U Capacity and Utilization | Assignee: | Yaacov Zamir <yzamir> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Gilad Shefer <gshefer> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 5.8.0 | CC: | arcsharm, bsorota, jhardy, lavenel, obarenbo, snaim, tachoi | |
| Target Milestone: | GA | Keywords: | TestOnly, ZStream | |
| Target Release: | 5.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | container:c&u | |||
| Fixed In Version: | 5.9.0.1 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1478428 (view as bug list) | Environment: | ||
| Last Closed: | 2018-03-06 15:46:21 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | Container Management | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1478428 | |||
It is feasible to believe (as I heard it in several other cases) that the scalability issue here has to do with the queue; whether we collect every 10 or 50 minutes the number of data-points is unvaried, so the only thing that changes is the number of scheduled collections on the queue. At the moment of this writing the long-term plan is to access the real-time metrics directly from OpenShift (Ad-hoc Metrics) and keep in CFME only hourly rollups for long-term reporting. The initial 10-15 minutes collection was set in order to have the freshest data in C&U realtime pages. If we're OK to give up that freshness (+1 from me considering the long term goals and the new Ad-hoc Metrics page) then we can change the collection interval to 50. I assume correctness of availability of data for rollups was already assessed for other providers (who already have this threshold set to 50 minutes. Loic any concerns from your side? (I'd be more concerned about scalability vs freshness now) (In reply to Federico Simoncelli from comment #2) > It is feasible to believe (as I heard it in several other cases) that the > scalability issue here has to do with the queue; whether we collect every 10 > or 50 minutes the number of data-points is unvaried, so the only thing that > changes is the number of scheduled collections on the queue. > > At the moment of this writing the long-term plan is to access the real-time > metrics directly from OpenShift (Ad-hoc Metrics) and keep in CFME only > hourly rollups for long-term reporting. > > The initial 10-15 minutes collection was set in order to have the freshest > data in C&U realtime pages. > > If we're OK to give up that freshness (+1 from me considering the long term > goals and the new Ad-hoc Metrics page) then we can change the collection > interval to 50. > I assume correctness of availability of data for rollups was already > assessed for other providers (who already have this threshold set to 50 > minutes. > > Loic any concerns from your side? > (I'd be more concerned about scalability vs freshness now) No Concern for me, long term will be CF Rollup, short term will come directly from provider repository. submited upstream: https://github.com/ManageIQ/manageiq/pull/15311 merged upstream: https://github.com/ManageIQ/manageiq/pull/15311 |
Description of problem: The default capture_threshold value for the OpenShift object types container, container_group and node is 10 minutes (since these object types don't have their own capture_threshold definition in advanced settings). To maintain this rate of ems_metrics_collector message processing requires a large number of C&U Data Collector worker processes, which results in scaling several CFME appliances just to manage OpenShift Metrics. Unless there is a good reason to keep the capture_threshold definition at 10 minutes, I'd suggest extending the settings yaml to add new default values for the three OpenShift object types, as follows: :performance: :capture_threshold: :default: 10.minutes :ems_cluster: 50.minutes :host: 50.minutes :storage: 60.minutes :vm: 50.minutes :container: 60.minutes :container_group: 60.minutes :node: 60.minutes This definition would set the frequency of collection to one hour, which requires far fewer C&U Data Collector worker processes to process. I'm not sure whether we'd need to add corresponding values to :capture_threshold_with_alerts: Version-Release number of selected component (if applicable): 5.8.0.16 How reproducible: Every time