Description of problem: Pod doesn't only consists of application containers, but there is also a pause container which is marked as `container="POD"`. This container resources are also counted towards overall pod resource consumption hence it should be counted. Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. compare outputs of webUI and `oc adm top pods` Actual results: Output values are not the same Expected results: Output values are the same Additional info: We need to remove `container!="POD"` from queries in alerts, recording rules, and dashboards. more in https://coreos.slack.com/archives/C0VMT03S5/p1615550752086600
Increasing severity/priority to medium as this bug also affects autoscaling.
Since the PR has been merged upstream, the fix will land in 4.8 with the bump of kube-prometheus downstream. Closing as UPSTREAM.
Test with payload 4.8.0-0.nightly-2021-05-06-003426 #oc get cm prometheus-adapter-prometheus-config -oyaml ... "cpu": "containerLabel": "container" "containerQuery": "sum(irate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!=\"\",pod!=\"\"}[5m])) by (<<.GroupBy>>)" "nodeQuery": "sum(1 - irate(node_cpu_seconds_total{mode=\"idle\"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}) by (<<.GroupBy>>) or sum (1- irate(windows_cpu_time_total{mode=\"idle\", job=\"windows-exporter\",<<.LabelMatchers>>}[5m])) by (<<.GroupBy>>)" ... "memory": "containerLabel": "container" "containerQuery": "sum(container_memory_working_set_bytes{<<.LabelMatchers>>,container!=\"\",pod!=\"\"}) by (<<.GroupBy>>)" "nodeQuery": "sum(node_memory_MemTotal_bytes{job=\"node-exporter\",<<.LabelMatchers>>} - node_memory_MemAvailable_bytes{job=\"node-exporter\",<<.LabelMatchers>>}) by (<<.GroupBy>>) or sum(windows_cs_physical_memory_bytes{job=\"windows-exporter\",<<.LabelMatchers>>} - windows_memory_available_bytes{job=\"windows-exporter\",<<.LabelMatchers>>}) by (<<.GroupBy>>)" ... # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum(container_memory_working_set_bytes{pod="prometheus-operator-7695b86877-bd4tk",namespace="openshift-monitoring"}) BY (container)/1024/1024'|jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 524 0 343 100 181 19055 10055 --:--:-- --:--:-- --:--:-- 29111 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "container": "POD" }, "value": [ 1620281212.51, "0.171875" ] }, { "metric": { "container": "kube-rbac-proxy" }, "value": [ 1620281212.51, "19.5625" ] }, { "metric": { "container": "prometheus-operator" }, "value": [ 1620281212.51, "126.88671875" ] }, { "metric": {}, "value": [ 1620281212.51, "148.9921875" ] } ] } } # oc adm top pod prometheus-operator-7695b86877-bd4tk --containers POD NAME CPU(cores) MEMORY(bytes) prometheus-operator-7695b86877-bd4tk POD 0m 0Mi prometheus-operator-7695b86877-bd4tk kube-rbac-proxy 0m 19Mi prometheus-operator-7695b86877-bd4tk prometheus-operator 1m 126Mi # oc adm top pod prometheus-operator-7695b86877-bd4tk NAME CPU(cores) MEMORY(bytes) prometheus-operator-7695b86877-bd4tk 1m 146Mi
# oc get PodMetrics prometheus-operator-7695b86877-bd4tk -oyaml apiVersion: metrics.k8s.io/v1beta1 containers: - name: kube-rbac-proxy usage: cpu: "0" memory: 20036Ki - name: prometheus-operator usage: cpu: 2m memory: 133432Ki - name: POD usage: cpu: "0" memory: 176Ki kind: PodMetrics metadata: creationTimestamp: "2021-05-06T06:11:49Z" labels: app.kubernetes.io/component: controller app.kubernetes.io/name: prometheus-operator app.kubernetes.io/part-of: openshift-monitoring app.kubernetes.io/version: 0.47.0 pod-template-hash: 7695b86877 name: prometheus-operator-7695b86877-bd4tk namespace: openshift-monitoring timestamp: "2021-05-06T06:11:49Z" window: 5m0s
I don't think a backport is meaningful here since the bug has a fairly low impact on the product. To clarify, not accounting for the pause container resource usages doesn't have any impact on the autoscaling pipeline, so the only benefit of this fix would be to have `oc adm top pods` being more accurate. That said, the pause container resource usages are so low compared to actual applications resource usages that they are negligible. But maybe your customer as a use case that makes it non-negligeable?
Yes the HPA is also affected by this change, but the impact that the resource usages of the pause container have on autoscaling is negligible, hence why I don't think this bug is worth backporting.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
*** Bug 2036003 has been marked as a duplicate of this bug. ***