vpshere-problem-detector does not report these metrics: * vsphere_esxi_version_total * vsphere_node_hw_version_total They were lost as result of refactoring in https://github.com/openshift/vsphere-problem-detector/pull/14
Hi Jan, Tried to verify this bug with: 4.7.0-0.nightly-2021-01-18-000316, got 2 question about the fix: 1. For metric: vsphere_node_hw_version_total and vsphere_esxi_version_total, they are Gauge type of metrics, from the prometheus naming rules, looks it's not good to have `_total` suffix for this kind of metrics. 2. For vsphere_esxi_version_total, it should show the number of ESXi hosts with given version, we have 5 nodes in the cluster, but we got the value "4". $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=vsphere_esxi_version_total'|jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 463 100 463 0 0 12513 0 --:--:-- --:--:-- --:--:-- 12513 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "vsphere_esxi_version_total", "container": "vsphere-problem-detector-operator", "endpoint": "vsphere-metrics", "instance": "10.130.0.33:8444", "job": "vsphere-problem-detector-metrics", "namespace": "openshift-cluster-storage-operator", "pod": "vsphere-problem-detector-operator-66c49b9f99-l55dh", "service": "vsphere-problem-detector-metrics", "version": "7.0.0" }, "value": [ 1610965275.889, "4" ] } ] } } From the log, it checked 5 nodes: $ oc -n openshift-cluster-storage-operator logs vsphere-problem-detector-operator-66c49b9f99-l55dh|grep "ESXi version" I0118 08:29:14.712016 1 node_esxi_version.go:77] Node control-plane-0 runs on host host-129890 (10.3.32.7) with ESXi version: 7.0.0 I0118 08:29:14.712655 1 node_esxi_version.go:77] Node compute-1 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0 I0118 08:29:14.715153 1 node_esxi_version.go:77] Node control-plane-1 runs on host host-135481 (10.3.32.4) with ESXi version: 7.0.0 I0118 08:29:14.726766 1 node_esxi_version.go:77] Node control-plane-2 runs on host host-126079 (10.3.32.8) with ESXi version: 7.0.0 I0118 08:30:15.289148 1 node_esxi_version.go:77] Node compute-1 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0 I0118 08:30:15.295039 1 node_esxi_version.go:77] Node control-plane-0 runs on host host-129890 (10.3.32.7) with ESXi version: 7.0.0 I0118 08:30:15.295985 1 node_esxi_version.go:77] Node control-plane-2 runs on host host-126079 (10.3.32.8) with ESXi version: 7.0.0 I0118 08:30:15.308656 1 node_esxi_version.go:77] Node control-plane-1 runs on host host-135481 (10.3.32.4) with ESXi version: 7.0.0 I0118 08:32:15.911649 1 node_esxi_version.go:77] Node control-plane-0 runs on host host-129890 (10.3.32.7) with ESXi version: 7.0.0 I0118 08:32:15.915054 1 node_esxi_version.go:77] Node control-plane-2 runs on host host-126079 (10.3.32.8) with ESXi version: 7.0.0 I0118 08:32:15.918128 1 node_esxi_version.go:77] Node control-plane-1 runs on host host-135481 (10.3.32.4) with ESXi version: 7.0.0 I0118 08:32:15.919247 1 node_esxi_version.go:77] Node compute-0 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0 I0118 08:36:16.893037 1 node_esxi_version.go:77] Node control-plane-2 runs on host host-126079 (10.3.32.8) with ESXi version: 7.0.0 I0118 08:36:16.893411 1 node_esxi_version.go:77] Node control-plane-0 runs on host host-129890 (10.3.32.7) with ESXi version: 7.0.0 I0118 08:36:16.894299 1 node_esxi_version.go:77] Node control-plane-1 runs on host host-135481 (10.3.32.4) with ESXi version: 7.0.0 I0118 08:36:16.894612 1 node_esxi_version.go:77] Node compute-0 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0 For the metric vsphere_node_hw_version_total, it's correct: $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=vsphere_node_hw_version_total'|jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 470 100 470 0 0 15161 0 --:--:-- --:--:-- --:--:-- 15161 { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "vsphere_node_hw_version_total", "container": "vsphere-problem-detector-operator", "endpoint": "vsphere-metrics", "hw_version": "vmx-13", "instance": "10.130.0.33:8444", "job": "vsphere-problem-detector-metrics", "namespace": "openshift-cluster-storage-operator", "pod": "vsphere-problem-detector-operator-66c49b9f99-l55dh", "service": "vsphere-problem-detector-metrics" }, "value": [ 1610965305.145, "5" ] } ] } } $ oc -n openshift-cluster-storage-operator logs vsphere-problem-detector-operator-66c49b9f99-l55dh|grep vmx-13 I0118 08:29:14.688266 1 node_hw_version.go:54] Node compute-1 has HW version vmx-13 I0118 08:29:14.690931 1 node_hw_version.go:54] Node control-plane-1 has HW version vmx-13 I0118 08:29:14.699177 1 node_hw_version.go:54] Node control-plane-0 has HW version vmx-13 I0118 08:29:14.716041 1 node_hw_version.go:54] Node control-plane-2 has HW version vmx-13 I0118 08:29:14.720668 1 node_hw_version.go:54] Node compute-0 has HW version vmx-13 I0118 08:30:15.278742 1 node_hw_version.go:54] Node compute-1 has HW version vmx-13 I0118 08:30:15.280171 1 node_hw_version.go:54] Node control-plane-0 has HW version vmx-13 I0118 08:30:15.280651 1 node_hw_version.go:54] Node compute-0 has HW version vmx-13 I0118 08:30:15.281217 1 node_hw_version.go:54] Node control-plane-2 has HW version vmx-13 I0118 08:30:15.298352 1 node_hw_version.go:54] Node control-plane-1 has HW version vmx-13 I0118 08:32:15.897825 1 node_hw_version.go:54] Node control-plane-0 has HW version vmx-13 I0118 08:32:15.899533 1 node_hw_version.go:54] Node control-plane-2 has HW version vmx-13 I0118 08:32:15.901314 1 node_hw_version.go:54] Node compute-0 has HW version vmx-13 I0118 08:32:15.904082 1 node_hw_version.go:54] Node compute-1 has HW version vmx-13 I0118 08:32:15.906359 1 node_hw_version.go:54] Node control-plane-1 has HW version vmx-13 I0118 08:36:16.877253 1 node_hw_version.go:54] Node compute-0 has HW version vmx-13 I0118 08:36:16.877747 1 node_hw_version.go:54] Node control-plane-2 has HW version vmx-13 I0118 08:36:16.879018 1 node_hw_version.go:54] Node control-plane-0 has HW version vmx-13 I0118 08:36:16.882203 1 node_hw_version.go:54] Node control-plane-1 has HW version vmx-13 I0118 08:36:16.890006 1 node_hw_version.go:54] Node compute-1 has HW version vmx-13
vsphere_esxi_version_total shows the hosts that run at least one node. From the logs, compute0 and compute1 run on the same host, therefore the host is reported only once. Node compute-0 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0 Node compute-1 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0 Regarding _total - with https://github.com/openshift/vsphere-problem-detector/pull/24, it's a basically flag, where value "1" means there is an error and "0" no error. It's not number of errors, therefore _total suffix does not apply there.
@Jan thanks for the explanation. I'll mark the bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633