Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1748260

Summary: "Pods -> By Storage" in web console, Pods by Average I/O time are all 0 ns
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: RHCOSAssignee: Micah Abbott <miabbott>
Status: CLOSED WONTFIX QA Contact: Michael Nguyen <mnguyen>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2.0CC: alegrand, anpicker, aos-bugs, bbreard, dustymabe, erooth, imcleod, jligon, jokerman, kakkoyun, lcosic, mloibl, nstielau, pkrupa, rphillips, smilner, surbania, umohnani
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-16 19:14:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1763288    
Bug Blocks:    
Attachments:
Description Flags
Pods by Average I/O time are all 0 ns
none
result for container_fs_io_time_seconds_total none

Description Junqi Zhao 2019-09-03 08:43:29 UTC
Created attachment 1611054 [details]
Pods by Average I/O time are all 0 ns

Description of problem:
tested with 4.2.0-0.nightly-2019-09-02-172410
"Pods -> By Storage" in web console, Pods by Average I/O time are all 0 ns

exprssion for "Pods -> By Storage" is
topk(20, sort_desc(avg by (pod_name)(irate(container_fs_io_time_seconds_total{container_name="POD", pod_name!=""}[1m]))))

but container_fs_io_time_seconds_total{container_name="POD", pod_name!=""}
all results are all 0

The attached  is  the result for `container_fs_io_time_seconds_total`


Version-Release number of selected component (if applicable):
with 4.2.0-0.nightly-2019-09-02-172410

How reproducible:
Always

Steps to Reproduce:
1. Cluster admin, "Home -> Dashboard", Overview page, "Top Consumers" panel, select "Pods -> By Storage" 
2.
3.

Actual results:
Pods by Average I/O time are all 0 ns

Expected results:


Additional info:

Comment 1 Junqi Zhao 2019-09-03 08:44:54 UTC
Created attachment 1611055 [details]
result for container_fs_io_time_seconds_total

Comment 3 Seth Jennings 2019-09-04 15:59:17 UTC
The raw cgroup attribute for this seems to be missing on RHCOS; blkio.io_service_time_recursive.

https://github.com/openshift/origin/blob/5f82df03895827644fcfa3b37f260e0a29416022/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/blkio.go#L199-L202

Is this a new thing (i.e. regression).  I don't imagine that it is.

Comment 4 Ryan Phillips 2019-10-01 14:28:02 UTC
The kernel needs to be built with CONFIG_BFQ_CGROUP_DEBUG to enable blkio.io_service_time_recursive.

Reassigning to RHCOS.

Source: https://github.com/torvalds/linux/blob/04cbfba6208592999d7bfe6609ec01dc3fde73f5/block/bfq-cgroup.c#L1212-L1336

Comment 5 Micah Abbott 2019-10-01 14:53:25 UTC
We aren't building a specific kernel for RHCOS; we consume the same kernel that is used by traditional RHEL8 nodes.

If the kernel needs to be built with specific options, you'll need to convince the RHEL folks to enable them.

Did this previously work in 4.1?  Or in 3.x?  I'd be interested to see if the option was enabled for older RHEL8 or even RHEL7 kernels.

Comment 6 Micah Abbott 2019-10-02 00:04:16 UTC
It looks like kernel options have changed between RHEL7 and RHEL8.  See below showing RHCOS, RHEL8, and RHEL7:

```
$ rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fe0a593ee9a808afb562b9202c7135526eab39eaab4c7075bc381a363994996
              CustomOrigin: Image generated via coreos-assembler
                   Version: 43.80.20190930.0 (2019-09-30T14:58:05Z)
[core@localhost ~]$ rpm -q kernel
kernel-4.18.0-80.11.2.el8_0.x86_64
[core@localhost ~]$ ls -l /sys/fs/cgroup/blkio/
total 0
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.bfq.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.bfq.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.bfq.io_serviced
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.bfq.io_serviced_recursive
--w-------. 1 root root 0 Oct  1 21:02 blkio.reset_stats
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.io_serviced
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.io_serviced_recursive
-rw-r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.read_bps_device
-rw-r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.read_iops_device
-rw-r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.write_bps_device
-rw-r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.write_iops_device
-rw-r--r--. 1 root root 0 Oct  1 21:02 cgroup.clone_children
-rw-r--r--. 1 root root 0 Oct  1 21:02 cgroup.procs
-r--r--r--. 1 root root 0 Oct  1 21:02 cgroup.sane_behavior
-rw-r--r--. 1 root root 0 Oct  1 21:02 notify_on_release
-rw-r--r--. 1 root root 0 Oct  1 21:02 release_agent
-rw-r--r--. 1 root root 0 Oct  1 21:02 tasks


$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.0 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.0"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.0:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.0
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.0"
[cloud-user@micah-rhel8-1001a ~]$ rpm -q kernel
kernel-4.18.0-80.el8.x86_64
[cloud-user@micah-rhel8-1001a ~]$ ls -l /sys/fs/cgroup/blkio/
total 0
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.bfq.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.bfq.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.bfq.io_serviced
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.bfq.io_serviced_recursive
--w-------. 1 root root 0 Oct  1 17:07 blkio.reset_stats
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.io_serviced
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.io_serviced_recursive
-rw-r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.read_bps_device
-rw-r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.read_iops_device
-rw-r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.write_bps_device
-rw-r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.write_iops_device
-rw-r--r--. 1 root root 0 Oct  1 17:07 cgroup.clone_children
-rw-r--r--. 1 root root 0 Oct  1 17:07 cgroup.procs
-r--r--r--. 1 root root 0 Oct  1 17:07 cgroup.sane_behavior
-rw-r--r--. 1 root root 0 Oct  1 17:07 notify_on_release
-rw-r--r--. 1 root root 0 Oct  1 17:07 release_agent
-rw-r--r--. 1 root root 0 Oct  1 17:07 tasks

$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux Server"
VERSION="7.7 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.7"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.7"
[cloud-user@micah-rhel7-1001a ~]$ rpm -q kernel
kernel-3.10.0-1062.el7.x86_64
[cloud-user@micah-rhel7-1001a ~]$  ls -l /sys/fs/cgroup/blkio/
total 0
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_merged
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_merged_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_queued
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_queued_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_service_time
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_service_time_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_serviced
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_serviced_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_wait_time
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_wait_time_recursive
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.leaf_weight
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.leaf_weight_device
--w-------. 1 root root 0 Oct  1 19:48 blkio.reset_stats
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.sectors
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.sectors_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.io_serviced
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.read_bps_device
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.read_iops_device
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.write_bps_device
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.write_iops_device
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.time
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.time_recursive
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.weight
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.weight_device
-rw-r--r--. 1 root root 0 Oct  1 19:48 cgroup.clone_children
--w--w--w-. 1 root root 0 Oct  1 19:48 cgroup.event_control
-rw-r--r--. 1 root root 0 Oct  1 19:48 cgroup.procs
-r--r--r--. 1 root root 0 Oct  1 19:48 cgroup.sane_behavior
-rw-r--r--. 1 root root 0 Oct  1 19:48 notify_on_release
-rw-r--r--. 1 root root 0 Oct  1 19:48 release_agent
-rw-r--r--. 1 root root 0 Oct  1 19:48 tasks
```

If the specific cgroup attribute (block.io_service_time_recursive) is required for OCP 4.3, it is basically impossible for us to deliver that.  OCP 4.3 will be using RHEL 8.1 content and that set of RPMs has already been locked down.

Is it possible to compute the metric using the cgroup attributes available to the RHEL8 kernel?

Comment 7 Ryan Phillips 2019-10-02 00:49:42 UTC
Need info from Urvashi or Peter...

Comment 8 Urvashi Mohnani 2019-10-14 19:15:05 UTC
We see no way of calculating the required cgroup attribute from the list enable in RHEL 8. We will have to get this enabled in the kernel.

Comment 9 Micah Abbott 2019-10-18 17:00:25 UTC
I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1763288 with the kernel team for the inclusion of the necessary cgroup attribute.

We'll keep this open to track the inclusion of the fixed kernel in RHCOS.  However, it is unlikely to be fixed in the 4.3 time frame as the RHEL 8.1 content set has already been decided.

Comment 10 Micah Abbott 2020-02-26 19:40:28 UTC
The kernel folks provided us with a custom kernel to test with, but it hasn't been a priority to test with it yet.  Moving to 4.5.

Comment 12 Micah Abbott 2020-05-15 18:25:45 UTC
I'm moving this to 4.6, but based on the last comment on the kernel BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1763288#c13) it doesn't seem like the missing metric is a big deal.  Especially if it means the kernel takes a performance hit.

Comment 13 Micah Abbott 2020-06-16 19:14:59 UTC
We've kicked this BZ out from 4.3 all the way to 4.6.  Comments in the related kernel BZ indicate that 1) the functionality provided by the missing metric is not critical to OCP and 2) enabling the cgroup in the kernel to provide the missing metric may have performance implications.

A "nice to have" metric that is may cause performance regressions is not something we want to pursue.  I've closed the related kernel BZ and am closing this BZ the same.