Bug 1757159

Summary: Metering operator using old container metric labels for container cpu/memory usage metrics in 4.2
Product: OpenShift Container Platform Reporter: Chance Zibolski <chancez>
Component: Metering OperatorAssignee: Chance Zibolski <chancez>
Status: CLOSED ERRATA QA Contact: Peter Ruan <pruan>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: pruan, sd-operator-metering, talessio
Target Milestone: ---   
Target Release: 4.2.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1756548 Environment:
Last Closed: 2019-11-19 13:49:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1756548    
Bug Blocks:    

Description Chance Zibolski 2019-09-30 17:35:10 UTC
+++ This bug was initially created as a clone of Bug #1756548 +++

Description of problem: The metering reporting-operator is using container_name instead of container for metrics labels in Prometheus queries on pod memory/cpu usage. These are deprecated in 1.14 and removed in 1.16 which will be Openshift 4.3. Because metering may be running against a 4.3 cluster prior to being upgraded, it's possible we'll be using the deprecated metrics labels in 4.3 until our operator is upgraded. This would mean metering is broken on the 4.2 to 4.3 upgrade unless we backport this fix to 4.2 ensuring we use the kube 1.16 metrics labels.


Version-Release number of selected component (if applicable): 4.2.x


How reproducible: Always


Steps to Reproduce:
1. Querying Prometheus directly shows the same behavior. 

                  sum(rate(container_cpu_usage_seconds_total{container_name!="POD",container_name!="",pod!=""}[1m])) BY (pod, namespace) + on (pod, namespace) group_left(node) (sum(kube_pod_info{pod_ip!="",node!="",host_ip!=""}) by (pod, namespace, node) * 0)
and

sum(container_memory_usage_bytes{container_name!="POD", container_name!="",pod!=""}) by (pod, namespace) + on (pod, namespace) group_left(node) (sum(kube_pod_info{pod_ip!="",node!="",host_ip!=""}) by (pod, namespace, node) * 0)

both return no metrics in 4.3, but work in 4.2 and 4.1

After investigation it's because the container_name metric label changed to container in Kube 1.14, and in 1.16 the old metric labels such as container_name and pod_name were removed. We need to update our metrics queries to use container instead of container_name.

Comment 1 Chance Zibolski 2019-09-30 17:45:15 UTC
Once https://github.com/operator-framework/operator-metering/pull/960 is cherry-picked into release-4.2 and built, the verification steps will be different from 4.3. In 4.2 this isn't breaking anything, so we need to just verify the ReportDataSources pod-usage-cpu-cores and pod-usage-memory-bytes no longer reference container_name in their Prometheus query, and instead use "container". Also, metrics should be importing for both these dataSources.

Comment 4 Peter Ruan 2019-11-06 18:15:19 UTC
Verified with release 4.2
pruan@fedora-vm ~/workspace/gocode/src/github.com/operator-framework/operator-metering (fix_mirroring_registry_4.2\u25cf)$ oc get reportdatasource pod-usage-cpu-cores -o yaml | grep container                                                                 [ruby-2.6.3]
      sum(rate(container_cpu_usage_seconds_total{container!="POD",container!="",pod!=""}[1m])) BY (pod, namespace) + on (pod, namespace) group_left(node) (sum(kube_pod_info{pod_ip!="",node!="",host_ip!=""}) by (pod, namespace, node) * 0)
pruan@fedora-vm ~/workspace/gocode/src/github.com/operator-framework/operator-metering (fix_mirroring_registry_4.2\u25cf)$ oc get reportdatasource pod-usage-memory-bytes -o yaml | grep container                                                              [ruby-2.6.3]
      sum(container_memory_usage_bytes{container!="POD", container!="",pod!=""}) by (pod, namespace) + on (pod, namespace) group_left(node) (sum(kube_pod_info{pod_ip!="",node!="",host_ip!=""}) by (pod, namespace, node) * 0)

Comment 6 errata-xmlrpc 2019-11-19 13:49:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3869