In preparation for enabling consumption tracking, we should export a metric from all clusters that represents the sum of physical cores covered under the subscription / usage agreement. We report min and max over 5m to establish a floor and ceiling for use (since scale down may take several minutes and we are willing to undercount to benefit users). Physical cores are our best estimate of real cores in use on a node that is not labelled as infra (excluded via agreement from use by general purpose workload) or control plane (when control plane is considered unschedulable for general purpose workloads). A physical core is an unshared CPU as seen by the operating system, ignoring hyperthreading or virtualization. In cloud environments a vCPU is often seen as a physical core and may vary based on instance size. The counter metric will be used to gain experience in real world scenarios about the precision and failure modes of our current prometheus -> thanos aggregation (with or without direct write) in order to put error bounds on the precision of the behavior. In the long run counters located close to the cluster are the most resilient metric option for assessing usage since they can be interpolated even in the face of significant disruption to find average use. The metric names will be workload:capacity_physical_cpu_cores - current physical cores available for workloads, excluding infra and unschedulable masters cluster:usage:workload:capacity_physical_cpu_cores:min:5m - minimum over 5m window cluster:usage:workload:capacity_physical_cpu_cores:max:5m - maximum over 5m window cluster:usage:workload:capacity_physical_cpu_core_seconds - recording rule converting the gauge into a counter The last 3 will be exposed via telemetry to simplify usage billing. Part of 4.7 to gain experience, may be backported to 4.6 as necessary.
Setting to blocker-, this doesn't seem like something we'd push the OCP release for.
tested with 4.7.0-0.nightly-2021-02-17-130606, metrics could be found both in cluster prometheus server and telemeter server
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633