Hide Forgot
Description of problem: The disk io metrics are empty when the disk device named as "vd" prefix pattern on Grafana dashboard of Prometheus Cluster Monitoring [0]. [0] Prometheus Cluster Monitoring [https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html] Version-Release number of selected component (if applicable): # oc version oc v3.11.69 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO openshift v3.11.69 kubernetes v1.11.0+d4cacc0 # images ose-cluster-monitoring-operator:v3.11 ose-prometheus-operator:v3.11 ... How reproducible: When the virtual disk device name's prefix is "vd", then always ca be reproduced on RHEV or Some guest OS on OpenStack. e.g.> # ls -1 /dev/vd* /dev/vda /dev/vda1 /dev/vda2 /dev/vdb /dev/vdb1 /dev/vdc /dev/vdd Steps to Reproduce: 1. 2. 3. Actual results: The following metrics is empty. * Disk IO Utilisation * Disk IO Saturation Expected results: Display the metrics usually. Additional info: I've found the related recording rules is fixed as follows. But I don't know when this master branch is backport to v3.11. [https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml#L214-L233] ~~~ record: node:node_memory_swap_io_bytes:sum_rate - expr: | avg(irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m])) record: :node_disk_utilisation:avg_irate - expr: | avg by (node) ( irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_disk_utilisation:avg_irate - expr: | avg(irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3) record: :node_disk_saturation:avg_irate - expr: | avg by (node) ( irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3 * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) ~~~
Unfortunately due to how the dependencies work and evolved, it's not trivial to backport this. We're likely to only ship this fix in 4.0, not 3.11.
@ Frederic It seems we missed one device I checked in 3.11 env, and found it has device="dm-0", maybe there have "dm-1", "dm-2" devcice eg: $ ls -l /dev/dm* brw-rw----. 1 root disk 253, 0 Mar 1 08:11 /dev/dm-0 brw-rw----. 1 root disk 253, 1 Mar 1 08:11 /dev/dm-1 brw-rw----. 1 root disk 253, 2 Mar 1 08:11 /dev/dm-2 node_disk_io_time_ms in 3.11 also detects this device, eg: node_disk_io_time_ms{device="dm-0",endpoint="https",instance="10.0.77.93:9100",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-9znkd",service="node-exporter"} 933001 node_disk_io_time_ms{device="vda",endpoint="https",instance="10.0.76.252:9100",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-k5vxn",service="node-exporter"} 59668 But prometheus rules does not contain this kind of device, eg: record: node:node_disk_saturation:avg_irate expr: avg by(node) (irate(node_disk_io_time_weighted_seconds_total{device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+",job="node-exporter"}[1m]) / 1000 * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:) Shall we add this device to prometheus rules, same question for https://bugzilla.redhat.com/show_bug.cgi?id=1680517#c3 Reference: https://superuser.com/questions/131519/what-is-this-dm-0-device
Yes let's add them. Given that these are disk io stats, I think we can safely assume that these are only storage devices (my understanding is devicemapper devices can otherwise be pretty much anything). We'll make sure to adapt.
(In reply to Frederic Branczyk from comment #5) > Yes let's add them. Given that these are disk io stats, I think we can > safely assume that these are only storage devices (my understanding is > devicemapper devices can otherwise be pretty much anything). We'll make sure > to adapt. Thanks, we also need to back port to 3.11, since 3.11 has the same issue, already mentioned in Bug 1680517
device names are correct now, also include devicemapper devices device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+" payload: 4.0.0-0.nightly-2019-03-06-074438
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758