1853467 – container_fs_writes_total is inconsistent with CPU/memory in summarizing cgroup values

Bug 1853467 - container_fs_writes_total is inconsistent with CPU/memory in summarizing cgroup values

Summary: container_fs_writes_total is inconsistent with CPU/memory in summarizing cgro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Harshal Patil
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-02 18:27 UTC by Clayton Coleman
Modified:	2021-07-27 22:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:32:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:32:43 UTC

Description Clayton Coleman 2020-07-02 18:27:05 UTC

CPU and memory for a pod cgroup is summarized in container_cpu_usage_seconds_total{container="",pod="X",namespace="Y"}.

container_fs_writes_total is not summarized the same way - neither container="POD" or container="" are reported as the sum of the io metrics of the container.  I suspect this is broken for fs_reads_total, and it is broken for container_fs_writes_bytes_total.  I suspect it is broken for all IO metrics.

This is high severity because it means we can't actually query on IO metrics correctly for a cgroup (where IO is accounted to a container OR the cgroup, for instance when processes run inside the pod cgroup but not in an container, like conmon or pulls).

I suspect this is broken in 4.4 as well, setting this to 4.5 for now.

Comment 3 Clayton Coleman 2020-07-09 16:27:25 UTC

After we fix this I expect to see an e2e test that verifies (on a pod doing stress io):

1. that per container IO is correct
2. that pod container summarization IO is correct (pod >= container IO sum)
3. that node IO summarization is correct (node IO for disks >= pod IO sum

Comment 27 Sunil Choudhary 2021-04-22 08:54:42 UTC

Thanks Harshal for the reproducer.

I tested on 4.8.0-0.nightly-2021-04-21-231018

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-21-231018   True        False         173m    Cluster version is 4.8.0-0.nightly-2021-04-21-231018


For the prometheus query at the container level,
container_fs_writes_total{pod="postgresql-1-8f7nn", container="postgresql"}

container_fs_writes_total	postgresql	/dev/nvme0n1	https-metrics	/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod93b8237d_b882_48f5_805d_003e02a5cfd7.slice/crio-97b4eb0829afae869dc78d88ce2482e32e6866130d8535b85e9042157e97f859.scope	image-registry.openshift-image-registry.svc:5000/openshift/postgresql@sha256:ae810ec4cf64df30858ae03fee7c1542e0cc84aa09a23fa2d1759623647eec61	10.0.130.215:10250	kubelet	/metrics/cadvisor	k8s_postgresql_postgresql-1-8f7nn_app1_93b8237d-b882-48f5-805d-003e02a5cfd7_0	app1	ip-10-0-130-215.us-east-2.compute.internal	postgresql-1-8f7nn	openshift-monitoring/k8s	kubelet	1261


and at pod level query,
container_fs_writes_total{pod="postgresql-1-8f7nn", container=""}

container_fs_writes_total	/dev/nvme0n1	https-metrics	/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod93b8237d_b882_48f5_805d_003e02a5cfd7.slice	10.0.130.215:10250	kubelet	/metrics/cadvisor	app1	ip-10-0-130-215.us-east-2.compute.internal	postgresql-1-8f7nn	openshift-monitoring/k8s	kubelet	1261

Comment 30 errata-xmlrpc 2021-07-27 22:32:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.