Bug 1939547
| Summary: | Include container="POD" in resource queries | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pawel Krupa <pkrupa> |
| Component: | Monitoring | Assignee: | Damien Grisonnet <dgrisonn> |
| Status: | CLOSED ERRATA | QA Contact: | hongyan li <hongyli> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.7 | CC: | alegrand, anowak, anpicker, dgrisonn, erooth, kakkoyun, lcosic, mdhanve, pkrupa, rsandu, skanniha, skrenger, spasquie |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: |
If this bug requires documentation, please select an appropriate Doc Type value.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:53:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Pawel Krupa
2021-03-16 15:29:59 UTC
Increasing severity/priority to medium as this bug also affects autoscaling. Since the PR has been merged upstream, the fix will land in 4.8 with the bump of kube-prometheus downstream. Closing as UPSTREAM. Test with payload 4.8.0-0.nightly-2021-05-06-003426
#oc get cm prometheus-adapter-prometheus-config -oyaml
...
"cpu":
"containerLabel": "container"
"containerQuery": "sum(irate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!=\"\",pod!=\"\"}[5m])) by (<<.GroupBy>>)"
"nodeQuery": "sum(1 - irate(node_cpu_seconds_total{mode=\"idle\"}[5m]) * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:{<<.LabelMatchers>>}) by (<<.GroupBy>>) or sum (1- irate(windows_cpu_time_total{mode=\"idle\", job=\"windows-exporter\",<<.LabelMatchers>>}[5m])) by (<<.GroupBy>>)"
...
"memory":
"containerLabel": "container"
"containerQuery": "sum(container_memory_working_set_bytes{<<.LabelMatchers>>,container!=\"\",pod!=\"\"}) by (<<.GroupBy>>)"
"nodeQuery": "sum(node_memory_MemTotal_bytes{job=\"node-exporter\",<<.LabelMatchers>>} - node_memory_MemAvailable_bytes{job=\"node-exporter\",<<.LabelMatchers>>}) by (<<.GroupBy>>) or sum(windows_cs_physical_memory_bytes{job=\"windows-exporter\",<<.LabelMatchers>>} - windows_memory_available_bytes{job=\"windows-exporter\",<<.LabelMatchers>>}) by (<<.GroupBy>>)"
...
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=sum(container_memory_working_set_bytes{pod="prometheus-operator-7695b86877-bd4tk",namespace="openshift-monitoring"}) BY (container)/1024/1024'|jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 524 0 343 100 181 19055 10055 --:--:-- --:--:-- --:--:-- 29111
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"container": "POD"
},
"value": [
1620281212.51,
"0.171875"
]
},
{
"metric": {
"container": "kube-rbac-proxy"
},
"value": [
1620281212.51,
"19.5625"
]
},
{
"metric": {
"container": "prometheus-operator"
},
"value": [
1620281212.51,
"126.88671875"
]
},
{
"metric": {},
"value": [
1620281212.51,
"148.9921875"
]
}
]
}
}
# oc adm top pod prometheus-operator-7695b86877-bd4tk --containers
POD NAME CPU(cores) MEMORY(bytes)
prometheus-operator-7695b86877-bd4tk POD 0m 0Mi
prometheus-operator-7695b86877-bd4tk kube-rbac-proxy 0m 19Mi
prometheus-operator-7695b86877-bd4tk prometheus-operator 1m 126Mi
# oc adm top pod prometheus-operator-7695b86877-bd4tk
NAME CPU(cores) MEMORY(bytes)
prometheus-operator-7695b86877-bd4tk 1m 146Mi
# oc get PodMetrics prometheus-operator-7695b86877-bd4tk -oyaml
apiVersion: metrics.k8s.io/v1beta1
containers:
- name: kube-rbac-proxy
usage:
cpu: "0"
memory: 20036Ki
- name: prometheus-operator
usage:
cpu: 2m
memory: 133432Ki
- name: POD
usage:
cpu: "0"
memory: 176Ki
kind: PodMetrics
metadata:
creationTimestamp: "2021-05-06T06:11:49Z"
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/name: prometheus-operator
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 0.47.0
pod-template-hash: 7695b86877
name: prometheus-operator-7695b86877-bd4tk
namespace: openshift-monitoring
timestamp: "2021-05-06T06:11:49Z"
window: 5m0s
I don't think a backport is meaningful here since the bug has a fairly low impact on the product. To clarify, not accounting for the pause container resource usages doesn't have any impact on the autoscaling pipeline, so the only benefit of this fix would be to have `oc adm top pods` being more accurate. That said, the pause container resource usages are so low compared to actual applications resource usages that they are negligible. But maybe your customer as a use case that makes it non-negligeable? Yes the HPA is also affected by this change, but the impact that the resource usages of the pause container have on autoscaling is negligible, hence why I don't think this bug is worth backporting. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 *** Bug 2036003 has been marked as a duplicate of this bug. *** |