Created attachment 1611054 [details] Pods by Average I/O time are all 0 ns Description of problem: tested with 4.2.0-0.nightly-2019-09-02-172410 "Pods -> By Storage" in web console, Pods by Average I/O time are all 0 ns exprssion for "Pods -> By Storage" is topk(20, sort_desc(avg by (pod_name)(irate(container_fs_io_time_seconds_total{container_name="POD", pod_name!=""}[1m])))) but container_fs_io_time_seconds_total{container_name="POD", pod_name!=""} all results are all 0 The attached is the result for `container_fs_io_time_seconds_total` Version-Release number of selected component (if applicable): with 4.2.0-0.nightly-2019-09-02-172410 How reproducible: Always Steps to Reproduce: 1. Cluster admin, "Home -> Dashboard", Overview page, "Top Consumers" panel, select "Pods -> By Storage" 2. 3. Actual results: Pods by Average I/O time are all 0 ns Expected results: Additional info:
Created attachment 1611055 [details] result for container_fs_io_time_seconds_total
The raw cgroup attribute for this seems to be missing on RHCOS; blkio.io_service_time_recursive. https://github.com/openshift/origin/blob/5f82df03895827644fcfa3b37f260e0a29416022/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/blkio.go#L199-L202 Is this a new thing (i.e. regression). I don't imagine that it is.
The kernel needs to be built with CONFIG_BFQ_CGROUP_DEBUG to enable blkio.io_service_time_recursive. Reassigning to RHCOS. Source: https://github.com/torvalds/linux/blob/04cbfba6208592999d7bfe6609ec01dc3fde73f5/block/bfq-cgroup.c#L1212-L1336
We aren't building a specific kernel for RHCOS; we consume the same kernel that is used by traditional RHEL8 nodes. If the kernel needs to be built with specific options, you'll need to convince the RHEL folks to enable them. Did this previously work in 4.1? Or in 3.x? I'd be interested to see if the option was enabled for older RHEL8 or even RHEL7 kernels.
It looks like kernel options have changed between RHEL7 and RHEL8. See below showing RHCOS, RHEL8, and RHEL7: ``` $ rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fe0a593ee9a808afb562b9202c7135526eab39eaab4c7075bc381a363994996 CustomOrigin: Image generated via coreos-assembler Version: 43.80.20190930.0 (2019-09-30T14:58:05Z) [core@localhost ~]$ rpm -q kernel kernel-4.18.0-80.11.2.el8_0.x86_64 [core@localhost ~]$ ls -l /sys/fs/cgroup/blkio/ total 0 -r--r--r--. 1 root root 0 Oct 1 21:02 blkio.bfq.io_service_bytes -r--r--r--. 1 root root 0 Oct 1 21:02 blkio.bfq.io_service_bytes_recursive -r--r--r--. 1 root root 0 Oct 1 21:02 blkio.bfq.io_serviced -r--r--r--. 1 root root 0 Oct 1 21:02 blkio.bfq.io_serviced_recursive --w-------. 1 root root 0 Oct 1 21:02 blkio.reset_stats -r--r--r--. 1 root root 0 Oct 1 21:02 blkio.throttle.io_service_bytes -r--r--r--. 1 root root 0 Oct 1 21:02 blkio.throttle.io_service_bytes_recursive -r--r--r--. 1 root root 0 Oct 1 21:02 blkio.throttle.io_serviced -r--r--r--. 1 root root 0 Oct 1 21:02 blkio.throttle.io_serviced_recursive -rw-r--r--. 1 root root 0 Oct 1 21:02 blkio.throttle.read_bps_device -rw-r--r--. 1 root root 0 Oct 1 21:02 blkio.throttle.read_iops_device -rw-r--r--. 1 root root 0 Oct 1 21:02 blkio.throttle.write_bps_device -rw-r--r--. 1 root root 0 Oct 1 21:02 blkio.throttle.write_iops_device -rw-r--r--. 1 root root 0 Oct 1 21:02 cgroup.clone_children -rw-r--r--. 1 root root 0 Oct 1 21:02 cgroup.procs -r--r--r--. 1 root root 0 Oct 1 21:02 cgroup.sane_behavior -rw-r--r--. 1 root root 0 Oct 1 21:02 notify_on_release -rw-r--r--. 1 root root 0 Oct 1 21:02 release_agent -rw-r--r--. 1 root root 0 Oct 1 21:02 tasks $ cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.0 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.0" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8.0:GA" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.0 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.0" [cloud-user@micah-rhel8-1001a ~]$ rpm -q kernel kernel-4.18.0-80.el8.x86_64 [cloud-user@micah-rhel8-1001a ~]$ ls -l /sys/fs/cgroup/blkio/ total 0 -r--r--r--. 1 root root 0 Oct 1 17:07 blkio.bfq.io_service_bytes -r--r--r--. 1 root root 0 Oct 1 17:07 blkio.bfq.io_service_bytes_recursive -r--r--r--. 1 root root 0 Oct 1 17:07 blkio.bfq.io_serviced -r--r--r--. 1 root root 0 Oct 1 17:07 blkio.bfq.io_serviced_recursive --w-------. 1 root root 0 Oct 1 17:07 blkio.reset_stats -r--r--r--. 1 root root 0 Oct 1 17:07 blkio.throttle.io_service_bytes -r--r--r--. 1 root root 0 Oct 1 17:07 blkio.throttle.io_service_bytes_recursive -r--r--r--. 1 root root 0 Oct 1 17:07 blkio.throttle.io_serviced -r--r--r--. 1 root root 0 Oct 1 17:07 blkio.throttle.io_serviced_recursive -rw-r--r--. 1 root root 0 Oct 1 17:07 blkio.throttle.read_bps_device -rw-r--r--. 1 root root 0 Oct 1 17:07 blkio.throttle.read_iops_device -rw-r--r--. 1 root root 0 Oct 1 17:07 blkio.throttle.write_bps_device -rw-r--r--. 1 root root 0 Oct 1 17:07 blkio.throttle.write_iops_device -rw-r--r--. 1 root root 0 Oct 1 17:07 cgroup.clone_children -rw-r--r--. 1 root root 0 Oct 1 17:07 cgroup.procs -r--r--r--. 1 root root 0 Oct 1 17:07 cgroup.sane_behavior -rw-r--r--. 1 root root 0 Oct 1 17:07 notify_on_release -rw-r--r--. 1 root root 0 Oct 1 17:07 release_agent -rw-r--r--. 1 root root 0 Oct 1 17:07 tasks $ cat /etc/os-release NAME="Red Hat Enterprise Linux Server" VERSION="7.7 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.7" PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.7 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.7" [cloud-user@micah-rhel7-1001a ~]$ rpm -q kernel kernel-3.10.0-1062.el7.x86_64 [cloud-user@micah-rhel7-1001a ~]$ ls -l /sys/fs/cgroup/blkio/ total 0 -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_merged -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_merged_recursive -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_queued -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_queued_recursive -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_service_bytes -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_service_bytes_recursive -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_service_time -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_service_time_recursive -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_serviced -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_serviced_recursive -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_wait_time -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.io_wait_time_recursive -rw-r--r--. 1 root root 0 Oct 1 19:48 blkio.leaf_weight -rw-r--r--. 1 root root 0 Oct 1 19:48 blkio.leaf_weight_device --w-------. 1 root root 0 Oct 1 19:48 blkio.reset_stats -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.sectors -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.sectors_recursive -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.throttle.io_service_bytes -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.throttle.io_serviced -rw-r--r--. 1 root root 0 Oct 1 19:48 blkio.throttle.read_bps_device -rw-r--r--. 1 root root 0 Oct 1 19:48 blkio.throttle.read_iops_device -rw-r--r--. 1 root root 0 Oct 1 19:48 blkio.throttle.write_bps_device -rw-r--r--. 1 root root 0 Oct 1 19:48 blkio.throttle.write_iops_device -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.time -r--r--r--. 1 root root 0 Oct 1 19:48 blkio.time_recursive -rw-r--r--. 1 root root 0 Oct 1 19:48 blkio.weight -rw-r--r--. 1 root root 0 Oct 1 19:48 blkio.weight_device -rw-r--r--. 1 root root 0 Oct 1 19:48 cgroup.clone_children --w--w--w-. 1 root root 0 Oct 1 19:48 cgroup.event_control -rw-r--r--. 1 root root 0 Oct 1 19:48 cgroup.procs -r--r--r--. 1 root root 0 Oct 1 19:48 cgroup.sane_behavior -rw-r--r--. 1 root root 0 Oct 1 19:48 notify_on_release -rw-r--r--. 1 root root 0 Oct 1 19:48 release_agent -rw-r--r--. 1 root root 0 Oct 1 19:48 tasks ``` If the specific cgroup attribute (block.io_service_time_recursive) is required for OCP 4.3, it is basically impossible for us to deliver that. OCP 4.3 will be using RHEL 8.1 content and that set of RPMs has already been locked down. Is it possible to compute the metric using the cgroup attributes available to the RHEL8 kernel?
Need info from Urvashi or Peter...
We see no way of calculating the required cgroup attribute from the list enable in RHEL 8. We will have to get this enabled in the kernel.
I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1763288 with the kernel team for the inclusion of the necessary cgroup attribute. We'll keep this open to track the inclusion of the fixed kernel in RHCOS. However, it is unlikely to be fixed in the 4.3 time frame as the RHEL 8.1 content set has already been decided.
The kernel folks provided us with a custom kernel to test with, but it hasn't been a priority to test with it yet. Moving to 4.5.
I'm moving this to 4.6, but based on the last comment on the kernel BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1763288#c13) it doesn't seem like the missing metric is a big deal. Especially if it means the kernel takes a performance hit.
We've kicked this BZ out from 4.3 all the way to 4.6. Comments in the related kernel BZ indicate that 1) the functionality provided by the missing metric is not critical to OCP and 2) enabling the cgroup in the kernel to provide the missing metric may have performance implications. A "nice to have" metric that is may cause performance regressions is not something we want to pursue. I've closed the related kernel BZ and am closing this BZ the same.