1673787 – Grafana DISK IO metrics are empty due to not matching disk name patterns

Bug 1673787 - Grafana DISK IO metrics are empty due to not matching disk name patterns

Summary: Grafana DISK IO metrics are empty due to not matching disk name patterns

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1678645
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-08 05:26 UTC by Daein Park
Modified:	2019-06-04 10:43 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:42:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:43:51 UTC

Description Daein Park 2019-02-08 05:26:18 UTC

Description of problem:

The disk io metrics are empty when the disk device named as "vd" prefix pattern on Grafana dashboard of Prometheus Cluster Monitoring [0].

[0] Prometheus Cluster Monitoring 
    [https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html]

Version-Release number of selected component (if applicable):

# oc version
oc v3.11.69
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

openshift v3.11.69
kubernetes v1.11.0+d4cacc0

# images
ose-cluster-monitoring-operator:v3.11
ose-prometheus-operator:v3.11
...

How reproducible:

When the virtual disk device name's prefix is "vd", then always ca be reproduced on RHEV or Some guest OS on OpenStack.

e.g.>

# ls -1 /dev/vd*
/dev/vda
/dev/vda1
/dev/vda2
/dev/vdb
/dev/vdb1
/dev/vdc
/dev/vdd

Steps to Reproduce:
1.
2.
3.

Actual results:

The following metrics is empty.

* Disk IO Utilisation
* Disk IO Saturation

Expected results:

Display the metrics usually.

Additional info:

I've found the related recording rules is fixed as follows.
But I don't know when this master branch is backport to v3.11.

[https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml#L214-L233]
~~~
      record: node:node_memory_swap_io_bytes:sum_rate
    - expr: |
        avg(irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]))
      record: :node_disk_utilisation:avg_irate
    - expr: |
        avg by (node) (
          irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m])
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        )
      record: node:node_disk_utilisation:avg_irate
    - expr: |
        avg(irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3)
      record: :node_disk_saturation:avg_irate
    - expr: |
        avg by (node) (
          irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3
        * on (namespace, pod) group_left(node)
          node_namespace_pod:kube_pod_info:
        )
~~~

Comment 1 Frederic Branczyk 2019-02-08 13:24:20 UTC

Unfortunately due to how the dependencies work and evolved, it's not trivial to backport this. We're likely to only ship this fix in 4.0, not 3.11.

Comment 4 Junqi Zhao 2019-03-01 09:04:56 UTC

@ Frederic
It seems we missed one device 
I checked in 3.11 env, and found it has device="dm-0", maybe there have "dm-1", "dm-2" devcice  eg:
$ ls -l /dev/dm*
brw-rw----. 1 root disk 253, 0 Mar  1 08:11 /dev/dm-0
brw-rw----. 1 root disk 253, 1 Mar  1 08:11 /dev/dm-1
brw-rw----. 1 root disk 253, 2 Mar  1 08:11 /dev/dm-2

node_disk_io_time_ms in 3.11 also detects this device, eg:
node_disk_io_time_ms{device="dm-0",endpoint="https",instance="10.0.77.93:9100",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-9znkd",service="node-exporter"}	933001
node_disk_io_time_ms{device="vda",endpoint="https",instance="10.0.76.252:9100",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-k5vxn",service="node-exporter"}	59668

But prometheus rules does not contain this kind of device, eg:
record: node:node_disk_saturation:avg_irate
expr: avg
  by(node) (irate(node_disk_io_time_weighted_seconds_total{device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+",job="node-exporter"}[1m])
  / 1000 * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)


Shall we add this device to prometheus rules, same question for https://bugzilla.redhat.com/show_bug.cgi?id=1680517#c3

Reference:
https://superuser.com/questions/131519/what-is-this-dm-0-device

Comment 5 Frederic Branczyk 2019-03-01 09:18:30 UTC

Yes let's add them. Given that these are disk io stats, I think we can safely assume that these are only storage devices (my understanding is devicemapper devices can otherwise be pretty much anything). We'll make sure to adapt.

Comment 6 Junqi Zhao 2019-03-01 10:56:21 UTC

(In reply to Frederic Branczyk from comment #5)
> Yes let's add them. Given that these are disk io stats, I think we can
> safely assume that these are only storage devices (my understanding is
> devicemapper devices can otherwise be pretty much anything). We'll make sure
> to adapt.

Thanks, we also need to back port to 3.11, since 3.11 has the same issue, already mentioned in Bug 1680517

Comment 9 Junqi Zhao 2019-03-08 07:39:49 UTC

device names are correct now, also include devicemapper devices

device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"

payload: 4.0.0-0.nightly-2019-03-06-074438

Comment 13 errata-xmlrpc 2019-06-04 10:42:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.