Bug 1455186

Summary: Fix precision and reliability of metrics collection for OpenShift
Product: Red Hat CloudForms Management Engine Reporter: Federico Simoncelli <fsimonce>
Component: C&U Capacity and UtilizationAssignee: Yaacov Zamir <yzamir>
Status: CLOSED CURRENTRELEASE QA Contact: brahmani
Severity: high Docs Contact:
Priority: high    
Version: 5.8.0CC: dajohnso, gekis, jhardy, lavenel, obarenbo, yzamir
Target Milestone: GAKeywords: TestOnly
Target Release: 5.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: container:c&u
Fixed In Version: 5.10.0.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1524626 (view as bug list) Environment:
Last Closed: 2018-06-21 20:19:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Container Management Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1524626    

Description Federico Simoncelli 2017-05-24 12:27:46 UTC
Description of problem:
Currently the cpu utilization for nodes and containers/pods is computed (in a complex and error-prone way) using the cpu-time and the number of cores of each machine.

We should switch to use (draft, possible changes):

- cpu/node_utilization for percentage of cpu usage on nodes 
- cpu/usage_rate for millicores used for nodes, containers and pods


By using the above metrics instead of complex computations we'll get higher precision and more reliability (not dependent on number of cores of a node, changes of number of cores of a node, etc.)


This will also allow us to get rid of spurious errors that we find from time to time as side-effect of the complex computation and the dependency on the nodes cores:

WARN -- : ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode name: [ocp-c07-node02.10.35.48.141.nip.io], id: [2] Timestamp: [2017-05-23T08:00:20Z], Column [cpu_usage_rate_average]: 'percent value 103.94 is out of range, resetting to 100.0'

ERROR -- : MIQ(ManageIQ::Providers::Kubernetes::ContainerManager::MetricsCapture#perf_collect_metrics) ContainerGroup(1000000019955) is not valid: Validation error: cores not defined


Moving to these two new metrics has dependency on the db schema (we probably need extra columns to save this information) and chargeback reports.
Anyway it's possible to make this change backward-compatible.

Comment 3 Yaacov Zamir 2017-06-21 12:11:45 UTC
Submitted upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/47

Comment 6 Yaacov Zamir 2017-11-26 13:29:03 UTC
submitted upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159 - under develpment.

old patch is deprecated:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/47 - closed

Comment 7 Yaacov Zamir 2017-12-10 14:25:44 UTC
merged upstream:
https://github.com/ManageIQ/manageiq-providers-kubernetes/pull/159