Bug 1748260 - "Pods -> By Storage" in web console, Pods by Average I/O time are all 0 ns
Summary: "Pods -> By Storage" in web console, Pods by Average I/O time are all 0 ns
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Micah Abbott
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1763288
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-03 08:43 UTC by Junqi Zhao
Modified: 2020-06-16 19:14 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-16 19:14:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Pods by Average I/O time are all 0 ns (20.20 KB, image/png)
2019-09-03 08:43 UTC, Junqi Zhao
no flags Details
result for container_fs_io_time_seconds_total (267.88 KB, text/plain)
2019-09-03 08:44 UTC, Junqi Zhao
no flags Details

Description Junqi Zhao 2019-09-03 08:43:29 UTC
Created attachment 1611054 [details]
Pods by Average I/O time are all 0 ns

Description of problem:
tested with 4.2.0-0.nightly-2019-09-02-172410
"Pods -> By Storage" in web console, Pods by Average I/O time are all 0 ns

exprssion for "Pods -> By Storage" is
topk(20, sort_desc(avg by (pod_name)(irate(container_fs_io_time_seconds_total{container_name="POD", pod_name!=""}[1m]))))

but container_fs_io_time_seconds_total{container_name="POD", pod_name!=""}
all results are all 0

The attached  is  the result for `container_fs_io_time_seconds_total`


Version-Release number of selected component (if applicable):
with 4.2.0-0.nightly-2019-09-02-172410

How reproducible:
Always

Steps to Reproduce:
1. Cluster admin, "Home -> Dashboard", Overview page, "Top Consumers" panel, select "Pods -> By Storage" 
2.
3.

Actual results:
Pods by Average I/O time are all 0 ns

Expected results:


Additional info:

Comment 1 Junqi Zhao 2019-09-03 08:44:54 UTC
Created attachment 1611055 [details]
result for container_fs_io_time_seconds_total

Comment 3 Seth Jennings 2019-09-04 15:59:17 UTC
The raw cgroup attribute for this seems to be missing on RHCOS; blkio.io_service_time_recursive.

https://github.com/openshift/origin/blob/5f82df03895827644fcfa3b37f260e0a29416022/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/blkio.go#L199-L202

Is this a new thing (i.e. regression).  I don't imagine that it is.

Comment 4 Ryan Phillips 2019-10-01 14:28:02 UTC
The kernel needs to be built with CONFIG_BFQ_CGROUP_DEBUG to enable blkio.io_service_time_recursive.

Reassigning to RHCOS.

Source: https://github.com/torvalds/linux/blob/04cbfba6208592999d7bfe6609ec01dc3fde73f5/block/bfq-cgroup.c#L1212-L1336

Comment 5 Micah Abbott 2019-10-01 14:53:25 UTC
We aren't building a specific kernel for RHCOS; we consume the same kernel that is used by traditional RHEL8 nodes.

If the kernel needs to be built with specific options, you'll need to convince the RHEL folks to enable them.

Did this previously work in 4.1?  Or in 3.x?  I'd be interested to see if the option was enabled for older RHEL8 or even RHEL7 kernels.

Comment 6 Micah Abbott 2019-10-02 00:04:16 UTC
It looks like kernel options have changed between RHEL7 and RHEL8.  See below showing RHCOS, RHEL8, and RHEL7:

```
$ rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8fe0a593ee9a808afb562b9202c7135526eab39eaab4c7075bc381a363994996
              CustomOrigin: Image generated via coreos-assembler
                   Version: 43.80.20190930.0 (2019-09-30T14:58:05Z)
[core@localhost ~]$ rpm -q kernel
kernel-4.18.0-80.11.2.el8_0.x86_64
[core@localhost ~]$ ls -l /sys/fs/cgroup/blkio/
total 0
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.bfq.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.bfq.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.bfq.io_serviced
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.bfq.io_serviced_recursive
--w-------. 1 root root 0 Oct  1 21:02 blkio.reset_stats
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.io_serviced
-r--r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.io_serviced_recursive
-rw-r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.read_bps_device
-rw-r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.read_iops_device
-rw-r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.write_bps_device
-rw-r--r--. 1 root root 0 Oct  1 21:02 blkio.throttle.write_iops_device
-rw-r--r--. 1 root root 0 Oct  1 21:02 cgroup.clone_children
-rw-r--r--. 1 root root 0 Oct  1 21:02 cgroup.procs
-r--r--r--. 1 root root 0 Oct  1 21:02 cgroup.sane_behavior
-rw-r--r--. 1 root root 0 Oct  1 21:02 notify_on_release
-rw-r--r--. 1 root root 0 Oct  1 21:02 release_agent
-rw-r--r--. 1 root root 0 Oct  1 21:02 tasks


$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux"
VERSION="8.0 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.0"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.0:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.0
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.0"
[cloud-user@micah-rhel8-1001a ~]$ rpm -q kernel
kernel-4.18.0-80.el8.x86_64
[cloud-user@micah-rhel8-1001a ~]$ ls -l /sys/fs/cgroup/blkio/
total 0
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.bfq.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.bfq.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.bfq.io_serviced
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.bfq.io_serviced_recursive
--w-------. 1 root root 0 Oct  1 17:07 blkio.reset_stats
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.io_serviced
-r--r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.io_serviced_recursive
-rw-r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.read_bps_device
-rw-r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.read_iops_device
-rw-r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.write_bps_device
-rw-r--r--. 1 root root 0 Oct  1 17:07 blkio.throttle.write_iops_device
-rw-r--r--. 1 root root 0 Oct  1 17:07 cgroup.clone_children
-rw-r--r--. 1 root root 0 Oct  1 17:07 cgroup.procs
-r--r--r--. 1 root root 0 Oct  1 17:07 cgroup.sane_behavior
-rw-r--r--. 1 root root 0 Oct  1 17:07 notify_on_release
-rw-r--r--. 1 root root 0 Oct  1 17:07 release_agent
-rw-r--r--. 1 root root 0 Oct  1 17:07 tasks

$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux Server"
VERSION="7.7 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.7"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.7"
[cloud-user@micah-rhel7-1001a ~]$ rpm -q kernel
kernel-3.10.0-1062.el7.x86_64
[cloud-user@micah-rhel7-1001a ~]$  ls -l /sys/fs/cgroup/blkio/
total 0
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_merged
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_merged_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_queued
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_queued_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_service_bytes_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_service_time
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_service_time_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_serviced
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_serviced_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_wait_time
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.io_wait_time_recursive
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.leaf_weight
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.leaf_weight_device
--w-------. 1 root root 0 Oct  1 19:48 blkio.reset_stats
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.sectors
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.sectors_recursive
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.io_service_bytes
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.io_serviced
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.read_bps_device
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.read_iops_device
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.write_bps_device
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.throttle.write_iops_device
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.time
-r--r--r--. 1 root root 0 Oct  1 19:48 blkio.time_recursive
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.weight
-rw-r--r--. 1 root root 0 Oct  1 19:48 blkio.weight_device
-rw-r--r--. 1 root root 0 Oct  1 19:48 cgroup.clone_children
--w--w--w-. 1 root root 0 Oct  1 19:48 cgroup.event_control
-rw-r--r--. 1 root root 0 Oct  1 19:48 cgroup.procs
-r--r--r--. 1 root root 0 Oct  1 19:48 cgroup.sane_behavior
-rw-r--r--. 1 root root 0 Oct  1 19:48 notify_on_release
-rw-r--r--. 1 root root 0 Oct  1 19:48 release_agent
-rw-r--r--. 1 root root 0 Oct  1 19:48 tasks
```

If the specific cgroup attribute (block.io_service_time_recursive) is required for OCP 4.3, it is basically impossible for us to deliver that.  OCP 4.3 will be using RHEL 8.1 content and that set of RPMs has already been locked down.

Is it possible to compute the metric using the cgroup attributes available to the RHEL8 kernel?

Comment 7 Ryan Phillips 2019-10-02 00:49:42 UTC
Need info from Urvashi or Peter...

Comment 8 Urvashi Mohnani 2019-10-14 19:15:05 UTC
We see no way of calculating the required cgroup attribute from the list enable in RHEL 8. We will have to get this enabled in the kernel.

Comment 9 Micah Abbott 2019-10-18 17:00:25 UTC
I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1763288 with the kernel team for the inclusion of the necessary cgroup attribute.

We'll keep this open to track the inclusion of the fixed kernel in RHCOS.  However, it is unlikely to be fixed in the 4.3 time frame as the RHEL 8.1 content set has already been decided.

Comment 10 Micah Abbott 2020-02-26 19:40:28 UTC
The kernel folks provided us with a custom kernel to test with, but it hasn't been a priority to test with it yet.  Moving to 4.5.

Comment 12 Micah Abbott 2020-05-15 18:25:45 UTC
I'm moving this to 4.6, but based on the last comment on the kernel BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1763288#c13) it doesn't seem like the missing metric is a big deal.  Especially if it means the kernel takes a performance hit.

Comment 13 Micah Abbott 2020-06-16 19:14:59 UTC
We've kicked this BZ out from 4.3 all the way to 4.6.  Comments in the related kernel BZ indicate that 1) the functionality provided by the missing metric is not critical to OCP and 2) enabling the cgroup in the kernel to provide the missing metric may have performance implications.

A "nice to have" metric that is may cause performance regressions is not something we want to pursue.  I've closed the related kernel BZ and am closing this BZ the same.


Note You need to log in before you can comment on or make changes to this bug.