In 4.9, container_fs_writes_total{id="/"} returns no metrics, and in 4.8 it does. We need summary slice metrics for the global, system, kubepod, and qos class slices in order to correctly attribute IO cost to various subsystems. That includes system slices. Individual cgroup metrics are not summable into these because various kernel and short running processes is accounted to those scopes. For the majority of core cadvisor metrics, the following slices should show up container_fs_writes_total{id="/"} container_fs_writes_total{id="/system.slice"} container_fs_writes_total{id="/system.slice/crio.service"} container_fs_writes_total{id="/kubepods.slice"} container_fs_writes_total{id="/kubepods.slice/kubepods-besteffort.slice"} The intent of dropping container scope (in favor of pods) is that we didn't need the cardinality of the scope, just the pod side and slices above it This is a fundamental debugability regression in 4.9 and must be fixed ASAP because now we cannot figure out where IO use is going on the system.
This applies to most container_* metrics that we decide to keep on the pod scope.
Ok, so reviewing the drop rule - action: drop regex: (container_fs_.*|container_spec_.*|container_blkio_device_usage_total|container_file_descriptors|container_sockets|container_threads_max|container_threads|container_start_time_seconds|container_last_seen);; sourceLabels: - __name__ - pod - namespace container_fs_* and container_blkio_device_usage_total is pod centric (the drop rule is wrong, and must change to drop only container series, relying on slice summarization from cgroups to cadvisor) container_last_seen, start_time_seconds, thread_max, file_descriptor, sockets are all container centirc (the drop rule is correct, since cgroups doesn’t sum these) container_spec_* is probably no longer used (in a future release we can review and drop), but is container centric and the drop rule is correct
checked with 4.10.0-0.nightly-2021-09-26-233013, and search "count(container_fs_writes_total) by (id)" in prometheus, the core cadvisor metrics mentioned in Comment 0 are found
Moving back to assigned because of the discussion in https://github.com/openshift/cluster-monitoring-operator/pull/1395
checked with 4.10.0-0.nightly-2021-09-28-220911, # oc -n openshift-monitoring get servicemonitor kubelet -oyaml ... metricRelabelings: - action: drop regex: container_(network_tcp_usage_total|network_udp_usage_total|tasks_state|cpu_load_average_10s) sourceLabels: - __name__ - action: drop regex: (container_spec_.*|container_file_descriptors|container_sockets|container_threads_max|container_threads|container_start_time_seconds|container_last_seen);; sourceLabels: - __name__ - pod - namespace - action: drop regex: (container_blkio_device_usage_total);.+ sourceLabels: - __name__ - container - action: drop regex: container_memory_failures_total sourceLabels: - __name__ - action: drop regex: (container_fs_.*);.+ sourceLabels: - __name__ - container ************************************* except the result mentioned in Comment 6 and Comment 7, there is not container label for container_blkio_device_usage_total container_fs_.* count(container_blkio_device_usage_total) by (container) {} 1770 count(container_fs_writes_total) by (container) {} 337
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056