1455186 – Fix precision and reliability of metrics collection for OpenShift

Bug 1455186 - Fix precision and reliability of metrics collection for OpenShift

Summary: Fix precision and reliability of metrics collection for OpenShift

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	C&U Capacity and Utilization
Sub Component:
Version:	5.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.10.0
Assignee:	Yaacov Zamir
QA Contact:	brahmani
Docs Contact:
URL:
Whiteboard:	container:c&u
Depends On:
Blocks:	1524626
TreeView+	depends on / blocked

Reported:	2017-05-24 12:27 UTC by Federico Simoncelli
Modified:	2020-07-16 09:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:	5.10.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1524626 (view as bug list)
Environment:
Last Closed:	2018-06-21 20:19:50 UTC
Category:	---
Cloudforms Team:	Container Management
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Federico Simoncelli 2017-05-24 12:27:46 UTC

Description of problem:
Currently the cpu utilization for nodes and containers/pods is computed (in a complex and error-prone way) using the cpu-time and the number of cores of each machine.

We should switch to use (draft, possible changes):

- cpu/node_utilization for percentage of cpu usage on nodes 
- cpu/usage_rate for millicores used for nodes, containers and pods


By using the above metrics instead of complex computations we'll get higher precision and more reliability (not dependent on number of cores of a node, changes of number of cores of a node, etc.)


This will also allow us to get rid of spurious errors that we find from time to time as side-effect of the complex computation and the dependency on the nodes cores:

WARN -- : ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [ocp-c07-node02.10.35.48.141.nip.io], id: [2] Timestamp: [2017-05-23T08:00:20Z], Column [cpu_usage_rate_average]: 'percent value 103.94 is out of range, resetting to 100.0'

ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) ContainerGroup(1000000019955) is not valid: Validation error: cores not defined


Moving to these two new metrics has dependency on the db schema (we probably need extra columns to save this information) and chargeback reports.
Anyway it's possible to make this change backward-compatible.

Comment 3 Yaacov Zamir 2017-06-21 12:11:45 UTC

Submitted upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/47

Comment 6 Yaacov Zamir 2017-11-26 13:29:03 UTC

submitted upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159 - under develpment.

old patch is deprecated:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/47 - closed

Comment 7 Yaacov Zamir 2017-12-10 14:25:44 UTC

merged upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159

Note You need to log in before you can comment on or make changes to this bug.