Bug 1853467
| Summary: | container_fs_writes_total is inconsistent with CPU/memory in summarizing cgroup values | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Node | Assignee: | Harshal Patil <harpatil> |
| Node sub component: | Kubelet | QA Contact: | Sunil Choudhary <schoudha> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | aos-bugs, harpatil, jokerman, mpatel, nagrawal, tsweeney |
| Version: | 4.5 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-07-27 22:32:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
After we fix this I expect to see an e2e test that verifies (on a pod doing stress io): 1. that per container IO is correct 2. that pod container summarization IO is correct (pod >= container IO sum) 3. that node IO summarization is correct (node IO for disks >= pod IO sum Thanks Harshal for the reproducer.
I tested on 4.8.0-0.nightly-2021-04-21-231018
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.0-0.nightly-2021-04-21-231018 True False 173m Cluster version is 4.8.0-0.nightly-2021-04-21-231018
For the prometheus query at the container level,
container_fs_writes_total{pod="postgresql-1-8f7nn", container="postgresql"}
container_fs_writes_total postgresql /dev/nvme0n1 https-metrics /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod93b8237d_b882_48f5_805d_003e02a5cfd7.slice/crio-97b4eb0829afae869dc78d88ce2482e32e6866130d8535b85e9042157e97f859.scope image-registry.openshift-image-registry.svc:5000/openshift/postgresql@sha256:ae810ec4cf64df30858ae03fee7c1542e0cc84aa09a23fa2d1759623647eec61 10.0.130.215:10250 kubelet /metrics/cadvisor k8s_postgresql_postgresql-1-8f7nn_app1_93b8237d-b882-48f5-805d-003e02a5cfd7_0 app1 ip-10-0-130-215.us-east-2.compute.internal postgresql-1-8f7nn openshift-monitoring/k8s kubelet 1261
and at pod level query,
container_fs_writes_total{pod="postgresql-1-8f7nn", container=""}
container_fs_writes_total /dev/nvme0n1 https-metrics /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod93b8237d_b882_48f5_805d_003e02a5cfd7.slice 10.0.130.215:10250 kubelet /metrics/cadvisor app1 ip-10-0-130-215.us-east-2.compute.internal postgresql-1-8f7nn openshift-monitoring/k8s kubelet 1261
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |
CPU and memory for a pod cgroup is summarized in container_cpu_usage_seconds_total{container="",pod="X",namespace="Y"}. container_fs_writes_total is not summarized the same way - neither container="POD" or container="" are reported as the sum of the io metrics of the container. I suspect this is broken for fs_reads_total, and it is broken for container_fs_writes_bytes_total. I suspect it is broken for all IO metrics. This is high severity because it means we can't actually query on IO metrics correctly for a cgroup (where IO is accounted to a container OR the cgroup, for instance when processes run inside the pod cgroup but not in an container, like conmon or pulls). I suspect this is broken in 4.4 as well, setting this to 4.5 for now.