Bug 1905647 - Report physical core valid-for-subscription min/max/cumulative use to telemetry
Summary: Report physical core valid-for-subscription min/max/cumulative use to telemetry
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Telemeter
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Clayton Coleman
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-08 18:23 UTC by Clayton Coleman
Modified: 2021-02-24 15:41 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:41:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1004 0 None closed Bug 1905647: Calculate physical CPU core seconds used for consumption and report via telemetry 2021-02-15 10:03:18 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:41:37 UTC

Description Clayton Coleman 2020-12-08 18:23:54 UTC
In preparation for enabling consumption tracking, we should export a metric from all clusters that represents the sum of physical cores covered under the subscription / usage agreement.  We report min and max over 5m to establish a floor and ceiling for use (since scale down may take several minutes and we are willing to undercount to benefit users).

Physical cores are our best estimate of real cores in use on a node that is not labelled as infra (excluded via agreement from use by general purpose workload) or control plane (when control plane is considered unschedulable for general purpose workloads). A physical core is an unshared CPU as seen by the operating system, ignoring hyperthreading or virtualization.  In cloud environments a vCPU is often seen as a physical core and may vary based on instance size.

The counter metric will be used to gain experience in real world scenarios about the precision and failure modes of our current prometheus -> thanos aggregation (with or without direct write) in order to put error bounds on the precision of the behavior. In the long run counters located close to the cluster are the most resilient metric option for assessing usage since they can be interpolated even in the face of significant disruption to find average use.

The metric names will be

workload:capacity_physical_cpu_cores - current physical cores available for workloads, excluding infra and unschedulable masters

cluster:usage:workload:capacity_physical_cpu_cores:min:5m - minimum over 5m window

cluster:usage:workload:capacity_physical_cpu_cores:max:5m - maximum over 5m window

cluster:usage:workload:capacity_physical_cpu_core_seconds - recording rule converting the gauge into a counter

The last 3 will be exposed via telemetry to simplify usage billing.

Part of 4.7 to gain experience, may be backported to 4.6 as necessary.

Comment 4 Nick Stielau 2021-01-04 20:14:13 UTC
Setting to blocker-, this doesn't seem like something we'd push the OCP release for.

Comment 5 Junqi Zhao 2021-02-18 03:41:05 UTC
tested with 4.7.0-0.nightly-2021-02-17-130606, metrics could be found both in cluster prometheus server and telemeter server

Comment 7 errata-xmlrpc 2021-02-24 15:41:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.