Bug 1673787 - Grafana DISK IO metrics are empty due to not matching disk name patterns
Summary: Grafana DISK IO metrics are empty due to not matching disk name patterns
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.1.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On: 1678645
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-08 05:26 UTC by Daein Park
Modified: 2019-06-04 10:43 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:42:43 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:43:51 UTC

Description Daein Park 2019-02-08 05:26:18 UTC
Description of problem:

The disk io metrics are empty when the disk device named as "vd" prefix pattern on Grafana dashboard of Prometheus Cluster Monitoring [0].

[0] Prometheus Cluster Monitoring 
    [https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html]

Version-Release number of selected component (if applicable):

# oc version
oc v3.11.69
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

openshift v3.11.69
kubernetes v1.11.0+d4cacc0

# images
ose-cluster-monitoring-operator:v3.11
ose-prometheus-operator:v3.11
...

How reproducible:

When the virtual disk device name's prefix is "vd", then always ca be reproduced on RHEV or Some guest OS on OpenStack.

e.g.>

# ls -1 /dev/vd*
/dev/vda
/dev/vda1
/dev/vda2
/dev/vdb
/dev/vdb1
/dev/vdc
/dev/vdd

Steps to Reproduce:
1.
2.
3.

Actual results:

The following metrics is empty.

* Disk IO Utilisation
* Disk IO Saturation

Expected results:

Display the metrics usually.

Additional info:

I've found the related recording rules is fixed as follows.
But I don't know when this master branch is backport to v3.11.

[https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml#L214-L233]
~~~
      record: node:node_memory_swap_io_bytes:sum_rate
    - expr: |
        avg(irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]))
      record: :node_disk_utilisation:avg_irate
    - expr: |
        avg by (node) (
          irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m])
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        )
      record: node:node_disk_utilisation:avg_irate
    - expr: |
        avg(irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3)
      record: :node_disk_saturation:avg_irate
    - expr: |
        avg by (node) (
          irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        )
~~~

Comment 1 Frederic Branczyk 2019-02-08 13:24:20 UTC
Unfortunately due to how the dependencies work and evolved, it's not trivial to backport this. We're likely to only ship this fix in 4.0, not 3.11.

Comment 4 Junqi Zhao 2019-03-01 09:04:56 UTC
@ Frederic
It seems we missed one device 
I checked in 3.11 env, and found it has device="dm-0", maybe there have "dm-1", "dm-2" devcice  eg:
$ ls -l /dev/dm*
brw-rw----. 1 root disk 253, 0 Mar  1 08:11 /dev/dm-0
brw-rw----. 1 root disk 253, 1 Mar  1 08:11 /dev/dm-1
brw-rw----. 1 root disk 253, 2 Mar  1 08:11 /dev/dm-2

node_disk_io_time_ms in 3.11 also detects this device, eg:
node_disk_io_time_ms{device="dm-0",endpoint="https",instance="10.0.77.93:9100",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-9znkd",service="node-exporter"}	933001
node_disk_io_time_ms{device="vda",endpoint="https",instance="10.0.76.252:9100",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-k5vxn",service="node-exporter"}	59668

But prometheus rules does not contain this kind of device, eg:
record: node:node_disk_saturation:avg_irate
expr: avg
  by(node) (irate(node_disk_io_time_weighted_seconds_total{device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+",job="node-exporter"}[1m])
  / 1000 * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)


Shall we add this device to prometheus rules, same question for https://bugzilla.redhat.com/show_bug.cgi?id=1680517#c3

Reference:
https://superuser.com/questions/131519/what-is-this-dm-0-device

Comment 5 Frederic Branczyk 2019-03-01 09:18:30 UTC
Yes let's add them. Given that these are disk io stats, I think we can safely assume that these are only storage devices (my understanding is devicemapper devices can otherwise be pretty much anything). We'll make sure to adapt.

Comment 6 Junqi Zhao 2019-03-01 10:56:21 UTC
(In reply to Frederic Branczyk from comment #5)
> Yes let's add them. Given that these are disk io stats, I think we can
> safely assume that these are only storage devices (my understanding is
> devicemapper devices can otherwise be pretty much anything). We'll make sure
> to adapt.

Thanks, we also need to back port to 3.11, since 3.11 has the same issue, already mentioned in Bug 1680517

Comment 9 Junqi Zhao 2019-03-08 07:39:49 UTC
device names are correct now, also include devicemapper devices

device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"

payload: 4.0.0-0.nightly-2019-03-06-074438

Comment 13 errata-xmlrpc 2019-06-04 10:42:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.