Bug 1455186

Summary:	Fix precision and reliability of metrics collection for OpenShift
Product:	Red Hat CloudForms Management Engine	Reporter:	Federico Simoncelli <fsimonce>
Component:	C&U Capacity and Utilization	Assignee:	Yaacov Zamir <yzamir>
Status:	CLOSED CURRENTRELEASE	QA Contact:	brahmani
Severity:	high	Docs Contact:
Priority:	high
Version:	5.8.0	CC:	dajohnso, gekis, jhardy, lavenel, obarenbo, yzamir
Target Milestone:	GA	Keywords:	TestOnly
Target Release:	5.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	container:c&u
Fixed In Version:	5.10.0.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1524626 (view as bug list)		Environment:
Last Closed:	2018-06-21 20:19:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	Container Management	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1524626

Description Federico Simoncelli 2017-05-24 12:27:46 UTC

Description of problem:
Currently the cpu utilization for nodes and containers/pods is computed (in a complex and error-prone way) using the cpu-time and the number of cores of each machine.

We should switch to use (draft, possible changes):

- cpu/node_utilization for percentage of cpu usage on nodes 
- cpu/usage_rate for millicores used for nodes, containers and pods


By using the above metrics instead of complex computations we'll get higher precision and more reliability (not dependent on number of cores of a node, changes of number of cores of a node, etc.)


This will also allow us to get rid of spurious errors that we find from time to time as side-effect of the complex computation and the dependency on the nodes cores:

WARN -- : ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [ocp-c07-node02.10.35.48.141.nip.io], id: [2] Timestamp: [2017-05-23T08:00:20Z], Column [cpu_usage_rate_average]: 'percent value 103.94 is out of range, resetting to 100.0'

ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) ContainerGroup(1000000019955) is not valid: Validation error: cores not defined


Moving to these two new metrics has dependency on the db schema (we probably need extra columns to save this information) and chargeback reports.
Anyway it's possible to make this change backward-compatible.

Comment 3 Yaacov Zamir 2017-06-21 12:11:45 UTC

Submitted upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/47

Comment 6 Yaacov Zamir 2017-11-26 13:29:03 UTC

submitted upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159 - under develpment.

old patch is deprecated:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/47 - closed

Comment 7 Yaacov Zamir 2017-12-10 14:25:44 UTC

merged upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159