Description of problem: The default capture_threshold value for the OpenShift object types container, container_group and node is 10 minutes (since these object types don't have their own capture_threshold definition in advanced settings). To maintain this rate of ems_metrics_collector message processing requires a large number of C&U Data Collector worker processes, which results in scaling several CFME appliances just to manage OpenShift Metrics. Unless there is a good reason to keep the capture_threshold definition at 10 minutes, I'd suggest extending the settings yaml to add new default values for the three OpenShift object types, as follows: :performance: :capture_threshold: :default: 10.minutes :ems_cluster: 50.minutes :host: 50.minutes :storage: 60.minutes :vm: 50.minutes :container: 60.minutes :container_group: 60.minutes :node: 60.minutes This definition would set the frequency of collection to one hour, which requires far fewer C&U Data Collector worker processes to process. I'm not sure whether we'd need to add corresponding values to :capture_threshold_with_alerts: Version-Release number of selected component (if applicable): 5.8.0.16 How reproducible: Every time
It is feasible to believe (as I heard it in several other cases) that the scalability issue here has to do with the queue; whether we collect every 10 or 50 minutes the number of data-points is unvaried, so the only thing that changes is the number of scheduled collections on the queue. At the moment of this writing the long-term plan is to access the real-time metrics directly from OpenShift (Ad-hoc Metrics) and keep in CFME only hourly rollups for long-term reporting. The initial 10-15 minutes collection was set in order to have the freshest data in C&U realtime pages. If we're OK to give up that freshness (+1 from me considering the long term goals and the new Ad-hoc Metrics page) then we can change the collection interval to 50. I assume correctness of availability of data for rollups was already assessed for other providers (who already have this threshold set to 50 minutes. Loic any concerns from your side? (I'd be more concerned about scalability vs freshness now)
(In reply to Federico Simoncelli from comment #2) > It is feasible to believe (as I heard it in several other cases) that the > scalability issue here has to do with the queue; whether we collect every 10 > or 50 minutes the number of data-points is unvaried, so the only thing that > changes is the number of scheduled collections on the queue. > > At the moment of this writing the long-term plan is to access the real-time > metrics directly from OpenShift (Ad-hoc Metrics) and keep in CFME only > hourly rollups for long-term reporting. > > The initial 10-15 minutes collection was set in order to have the freshest > data in C&U realtime pages. > > If we're OK to give up that freshness (+1 from me considering the long term > goals and the new Ad-hoc Metrics page) then we can change the collection > interval to 50. > I assume correctness of availability of data for rollups was already > assessed for other providers (who already have this threshold set to 50 > minutes. > > Loic any concerns from your side? > (I'd be more concerned about scalability vs freshness now) No Concern for me, long term will be CF Rollup, short term will come directly from provider repository.
submited upstream: https://github.com/ManageIQ/manageiq/pull/15311
merged upstream: https://github.com/ManageIQ/manageiq/pull/15311