Bug 1915859 - vsphere-problem-detector: does not report ESXi host version nor VM HW version
Summary: vsphere-problem-detector: does not report ESXi host version nor VM HW version
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Jan Safranek
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-13 15:05 UTC by Jan Safranek
Modified: 2021-02-24 15:53 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:52:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift vsphere-problem-detector pull 21 0 None closed Bug 1915859: Fix reporting of node metrics 2021-02-18 16:05:21 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:53:00 UTC

Description Jan Safranek 2021-01-13 15:05:12 UTC
vpshere-problem-detector does not report these metrics:
* vsphere_esxi_version_total
* vsphere_node_hw_version_total

They were lost as result of refactoring in https://github.com/openshift/vsphere-problem-detector/pull/14

Comment 2 Qin Ping 2021-01-18 10:33:21 UTC
Hi Jan,

Tried to verify this bug with: 4.7.0-0.nightly-2021-01-18-000316, got 2 question about the fix:

1. For metric: vsphere_node_hw_version_total and vsphere_esxi_version_total, they are Gauge type of metrics, from the prometheus naming rules, looks it's not good to have `_total` suffix for this kind of metrics.
2. For vsphere_esxi_version_total, it should show the number of ESXi hosts with given version, we have 5 nodes in the cluster, but we got the value "4".

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=vsphere_esxi_version_total'|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   463  100   463    0     0  12513      0 --:--:-- --:--:-- --:--:-- 12513
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "vsphere_esxi_version_total",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "instance": "10.130.0.33:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-66c49b9f99-l55dh",
          "service": "vsphere-problem-detector-metrics",
          "version": "7.0.0"
        },
        "value": [
          1610965275.889,
          "4"
        ]
      }
    ]
  }
}

From the log, it checked 5 nodes:
$ oc -n openshift-cluster-storage-operator logs vsphere-problem-detector-operator-66c49b9f99-l55dh|grep "ESXi version"
I0118 08:29:14.712016       1 node_esxi_version.go:77] Node control-plane-0 runs on host host-129890 (10.3.32.7) with ESXi version: 7.0.0
I0118 08:29:14.712655       1 node_esxi_version.go:77] Node compute-1 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0
I0118 08:29:14.715153       1 node_esxi_version.go:77] Node control-plane-1 runs on host host-135481 (10.3.32.4) with ESXi version: 7.0.0
I0118 08:29:14.726766       1 node_esxi_version.go:77] Node control-plane-2 runs on host host-126079 (10.3.32.8) with ESXi version: 7.0.0
I0118 08:30:15.289148       1 node_esxi_version.go:77] Node compute-1 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0
I0118 08:30:15.295039       1 node_esxi_version.go:77] Node control-plane-0 runs on host host-129890 (10.3.32.7) with ESXi version: 7.0.0
I0118 08:30:15.295985       1 node_esxi_version.go:77] Node control-plane-2 runs on host host-126079 (10.3.32.8) with ESXi version: 7.0.0
I0118 08:30:15.308656       1 node_esxi_version.go:77] Node control-plane-1 runs on host host-135481 (10.3.32.4) with ESXi version: 7.0.0
I0118 08:32:15.911649       1 node_esxi_version.go:77] Node control-plane-0 runs on host host-129890 (10.3.32.7) with ESXi version: 7.0.0
I0118 08:32:15.915054       1 node_esxi_version.go:77] Node control-plane-2 runs on host host-126079 (10.3.32.8) with ESXi version: 7.0.0
I0118 08:32:15.918128       1 node_esxi_version.go:77] Node control-plane-1 runs on host host-135481 (10.3.32.4) with ESXi version: 7.0.0
I0118 08:32:15.919247       1 node_esxi_version.go:77] Node compute-0 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0
I0118 08:36:16.893037       1 node_esxi_version.go:77] Node control-plane-2 runs on host host-126079 (10.3.32.8) with ESXi version: 7.0.0
I0118 08:36:16.893411       1 node_esxi_version.go:77] Node control-plane-0 runs on host host-129890 (10.3.32.7) with ESXi version: 7.0.0
I0118 08:36:16.894299       1 node_esxi_version.go:77] Node control-plane-1 runs on host host-135481 (10.3.32.4) with ESXi version: 7.0.0
I0118 08:36:16.894612       1 node_esxi_version.go:77] Node compute-0 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0

For the metric vsphere_node_hw_version_total, it's correct:
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=vsphere_node_hw_version_total'|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   470  100   470    0     0  15161      0 --:--:-- --:--:-- --:--:-- 15161
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "vsphere_node_hw_version_total",
          "container": "vsphere-problem-detector-operator",
          "endpoint": "vsphere-metrics",
          "hw_version": "vmx-13",
          "instance": "10.130.0.33:8444",
          "job": "vsphere-problem-detector-metrics",
          "namespace": "openshift-cluster-storage-operator",
          "pod": "vsphere-problem-detector-operator-66c49b9f99-l55dh",
          "service": "vsphere-problem-detector-metrics"
        },
        "value": [
          1610965305.145,
          "5"
        ]
      }
    ]
  }
}

$ oc -n openshift-cluster-storage-operator logs vsphere-problem-detector-operator-66c49b9f99-l55dh|grep vmx-13
I0118 08:29:14.688266       1 node_hw_version.go:54] Node compute-1 has HW version vmx-13
I0118 08:29:14.690931       1 node_hw_version.go:54] Node control-plane-1 has HW version vmx-13
I0118 08:29:14.699177       1 node_hw_version.go:54] Node control-plane-0 has HW version vmx-13
I0118 08:29:14.716041       1 node_hw_version.go:54] Node control-plane-2 has HW version vmx-13
I0118 08:29:14.720668       1 node_hw_version.go:54] Node compute-0 has HW version vmx-13
I0118 08:30:15.278742       1 node_hw_version.go:54] Node compute-1 has HW version vmx-13
I0118 08:30:15.280171       1 node_hw_version.go:54] Node control-plane-0 has HW version vmx-13
I0118 08:30:15.280651       1 node_hw_version.go:54] Node compute-0 has HW version vmx-13
I0118 08:30:15.281217       1 node_hw_version.go:54] Node control-plane-2 has HW version vmx-13
I0118 08:30:15.298352       1 node_hw_version.go:54] Node control-plane-1 has HW version vmx-13
I0118 08:32:15.897825       1 node_hw_version.go:54] Node control-plane-0 has HW version vmx-13
I0118 08:32:15.899533       1 node_hw_version.go:54] Node control-plane-2 has HW version vmx-13
I0118 08:32:15.901314       1 node_hw_version.go:54] Node compute-0 has HW version vmx-13
I0118 08:32:15.904082       1 node_hw_version.go:54] Node compute-1 has HW version vmx-13
I0118 08:32:15.906359       1 node_hw_version.go:54] Node control-plane-1 has HW version vmx-13
I0118 08:36:16.877253       1 node_hw_version.go:54] Node compute-0 has HW version vmx-13
I0118 08:36:16.877747       1 node_hw_version.go:54] Node control-plane-2 has HW version vmx-13
I0118 08:36:16.879018       1 node_hw_version.go:54] Node control-plane-0 has HW version vmx-13
I0118 08:36:16.882203       1 node_hw_version.go:54] Node control-plane-1 has HW version vmx-13
I0118 08:36:16.890006       1 node_hw_version.go:54] Node compute-1 has HW version vmx-13

Comment 3 Jan Safranek 2021-01-19 12:39:09 UTC
vsphere_esxi_version_total shows the hosts that run at least one node. From the logs, compute0 and compute1 run on the same host, therefore the host is reported only once.

Node compute-0 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0
Node compute-1 runs on host host-137116 (10.3.32.6) with ESXi version: 7.0.0

Regarding _total - with https://github.com/openshift/vsphere-problem-detector/pull/24, it's a basically flag, where value "1" means there is an error and "0" no error. It's not number of errors, therefore _total suffix does not apply there.

Comment 4 Qin Ping 2021-01-20 05:26:28 UTC
@Jan thanks for the explanation. I'll mark the bug as verified.

Comment 7 errata-xmlrpc 2021-02-24 15:52:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.