+++ This bug was initially created as a clone of Bug #1948037 +++ Windows nodes are not reporting some key node level metrics that summarize node info via telemetry. Specifically, windows_exporter reports a number of metrics as windows_* instead of their direct equivalent from node_exporter node_*. The node_cpu_info metric is required to calculate the node_role_os_version_machine:cpu_capacity_cores:sum recording rule which we use to report cores per OS. In OpenShift, windows_exporter needs to report windows_cpu_info (consistent with node_exporter, if it isn't already), and windows_cpu_info must be renamed to -> node_cpu_info, which should fix the recording rule. To verify, check a cluster reports node_role_os_version_machine:cpu_capacity_cores:sum with label_node_openshift_io_os_id="Windows" via prometheus, which will be sent to telemetry. Also, I recommend reviewing all rules in cluster-monitoring-operator to ensure no additional node_* metrics are missing and must be restored / renamed. Our goal should be metrics that can be identical between windows/linux nodes SHOULD be consistent with linux.
This bug has been verified on 4.7.0-0.nightly-2021-05-05-092347 and passed, thanks. Version-Release number of selected component (if applicable): WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/aa9379f01d65bfa12f38edf005e885ba61b7c7fe OCP version 4.7.0-0.nightly-2021-05-05-092347 Steps: 1. Install WMCO operator on OCP 4.7, make sure WMCO namespace is monitored by selecting checkbox "Enable Operator recommended cluster monitoring on this Namespace". 2. Create Windows machineset and scale up Windows nodes 3. Check cluster reports node_role_os_version_machine:cpu_capacity_cores:sum with label_node_openshift_io_os_id="Windows" via prometheus e.g Search `node_role_os_version_machine:cpu_capacity_cores:sum` in https://prometheus-k8s-openshift-monitoring.apps.sgao-7.qe.devcluster.openshift.com/graph, got: node_role_os_version_machine:cpu_capacity_cores:sum{label_kubernetes_io_arch="amd64", label_node_hyperthread_enabled="false", label_node_openshift_io_os_id="Windows"} 2 node_role_os_version_machine:cpu_capacity_cores:sum{label_kubernetes_io_arch="amd64", label_node_hyperthread_enabled="true", label_node_openshift_io_os_id="rhcos"} 3 node_role_os_version_machine:cpu_capacity_cores:sum{label_kubernetes_io_arch="amd64", label_node_hyperthread_enabled="true", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io_master="true"} 6 4. Check rules in https://prometheus-k8s-openshift-monitoring.apps.sgao-a1.qe.devcluster.openshift.com/rules, did not find anything wrong. node.rules Rule State Error Last Evaluation Evaluation Time record:node_namespace_pod:kube_pod_info: expr:topk by(namespace, pod) (1, max by(node, namespace, pod) (label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)"))) OK 22.647s ago 3.702ms record:node:node_num_cpu:sum expr:count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)) OK 22.644s ago 2.859ms record::node_memory_MemAvailable_bytes:sum expr:sum by(cluster) (node_memory_MemAvailable_bytes{job="node-exporter"} or (node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Slab_bytes{job="node-exporter"})) OK 22.642s ago 0.549ms windows.rules Rule State Error Last Evaluation Evaluation Time record:instance:node_cpu_utilisation:rate1m expr:avg without(core, mode) (rate(windows_cpu_time_total{mode="idle"}[1m])) OK 30.5s ago 0.335ms record:instance:node_cpu:rate:sum expr:sum by(instance) (rate(windows_cpu_time_total{mode!="iowait",mode="idle"}[3m])) OK 30.5s ago 0.174ms record:node_filesystem_size_bytes expr:windows_logical_disk_size_bytes OK 30.5s ago 0.086ms record:node_filesystem_avail_bytes expr:windows_logical_disk_free_bytes OK 30.5s ago 0.070ms record:node_network_receive_bytes_total expr:rate(windows_net_bytes_received_total[1m]) OK 30.5s ago 0.116ms record:node_network_transmit_bytes_total expr:rate(windows_net_bytes_sent_total[1m]) OK 30.4s ago 0.089ms record:node_filesystem_free_bytes expr:windows_logical_disk_free_bytes OK 30.4s ago 0.080ms record:node_memory_MemAvailable_bytes expr:windows_memory_available_bytes OK 30.4s ago 0.095ms record:node_memory_MemTotal_bytes expr:windows_cs_physical_memory_bytes OK 30.4s ago 0.079ms record:node_cpu_info expr:windows_cpu_info OK 30.4s ago 0.084ms
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Windows Container Support for Red Hat OpenShift 2.0.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2130