Bug 1673787
| Summary: | Grafana DISK IO metrics are empty due to not matching disk name patterns | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Daein Park <dapark> |
| Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.11.0 | CC: | fbranczy, mloibl, surbania |
| Target Milestone: | --- | ||
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-04 10:42:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1678645 | ||
| Bug Blocks: | |||
Unfortunately due to how the dependencies work and evolved, it's not trivial to backport this. We're likely to only ship this fix in 4.0, not 3.11. @ Frederic
It seems we missed one device
I checked in 3.11 env, and found it has device="dm-0", maybe there have "dm-1", "dm-2" devcice eg:
$ ls -l /dev/dm*
brw-rw----. 1 root disk 253, 0 Mar 1 08:11 /dev/dm-0
brw-rw----. 1 root disk 253, 1 Mar 1 08:11 /dev/dm-1
brw-rw----. 1 root disk 253, 2 Mar 1 08:11 /dev/dm-2
node_disk_io_time_ms in 3.11 also detects this device, eg:
node_disk_io_time_ms{device="dm-0",endpoint="https",instance="10.0.77.93:9100",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-9znkd",service="node-exporter"} 933001
node_disk_io_time_ms{device="vda",endpoint="https",instance="10.0.76.252:9100",job="node-exporter",namespace="openshift-monitoring",pod="node-exporter-k5vxn",service="node-exporter"} 59668
But prometheus rules does not contain this kind of device, eg:
record: node:node_disk_saturation:avg_irate
expr: avg
by(node) (irate(node_disk_io_time_weighted_seconds_total{device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+",job="node-exporter"}[1m])
/ 1000 * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)
Shall we add this device to prometheus rules, same question for https://bugzilla.redhat.com/show_bug.cgi?id=1680517#c3
Reference:
https://superuser.com/questions/131519/what-is-this-dm-0-device
Yes let's add them. Given that these are disk io stats, I think we can safely assume that these are only storage devices (my understanding is devicemapper devices can otherwise be pretty much anything). We'll make sure to adapt. (In reply to Frederic Branczyk from comment #5) > Yes let's add them. Given that these are disk io stats, I think we can > safely assume that these are only storage devices (my understanding is > devicemapper devices can otherwise be pretty much anything). We'll make sure > to adapt. Thanks, we also need to back port to 3.11, since 3.11 has the same issue, already mentioned in Bug 1680517 device names are correct now, also include devicemapper devices device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+" payload: 4.0.0-0.nightly-2019-03-06-074438 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |
Description of problem: The disk io metrics are empty when the disk device named as "vd" prefix pattern on Grafana dashboard of Prometheus Cluster Monitoring [0]. [0] Prometheus Cluster Monitoring [https://docs.openshift.com/container-platform/3.11/install_config/prometheus_cluster_monitoring.html] Version-Release number of selected component (if applicable): # oc version oc v3.11.69 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO openshift v3.11.69 kubernetes v1.11.0+d4cacc0 # images ose-cluster-monitoring-operator:v3.11 ose-prometheus-operator:v3.11 ... How reproducible: When the virtual disk device name's prefix is "vd", then always ca be reproduced on RHEV or Some guest OS on OpenStack. e.g.> # ls -1 /dev/vd* /dev/vda /dev/vda1 /dev/vda2 /dev/vdb /dev/vdb1 /dev/vdc /dev/vdd Steps to Reproduce: 1. 2. 3. Actual results: The following metrics is empty. * Disk IO Utilisation * Disk IO Saturation Expected results: Display the metrics usually. Additional info: I've found the related recording rules is fixed as follows. But I don't know when this master branch is backport to v3.11. [https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml#L214-L233] ~~~ record: node:node_memory_swap_io_bytes:sum_rate - expr: | avg(irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m])) record: :node_disk_utilisation:avg_irate - expr: | avg by (node) ( irate(node_disk_io_time_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) record: node:node_disk_utilisation:avg_irate - expr: | avg(irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3) record: :node_disk_saturation:avg_irate - expr: | avg by (node) ( irate(node_disk_io_time_weighted_seconds_total{job="node-exporter",device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+"}[1m]) / 1e3 * on (namespace, pod) group_left(node) node_namespace_pod:kube_pod_info: ) ~~~