Bug 1948037 - Telemetry info not completely available to identify windows nodes
Summary: Telemetry info not completely available to identify windows nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Mansi Kulkarni
QA Contact: gaoshang
URL:
Whiteboard:
Depends On:
Blocks: 1955319
TreeView+ depends on / blocked
 
Reported: 2021-04-09 19:51 UTC by Clayton Coleman
Modified: 2021-08-03 20:29 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1955319 (view as bug list)
Environment:
Last Closed: 2021-08-03 20:29:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift windows-machine-config-operator pull 425 0 None closed Bug 1948037: Enable and rename metric windows_cpu_info 2021-04-30 14:33:35 UTC
Red Hat Product Errata RHSA-2021:3001 0 None None None 2021-08-03 20:29:49 UTC

Description Clayton Coleman 2021-04-09 19:51:52 UTC
Windows nodes are not reporting some key node level metrics that summarize node info via telemetry.

Specifically, windows_exporter reports a number of metrics as windows_* instead of their direct equivalent from node_exporter node_*.  The node_cpu_info metric is required to calculate the node_role_os_version_machine:cpu_capacity_cores:sum recording rule which we use to report cores per OS.

In OpenShift, windows_exporter needs to report windows_cpu_info (consistent with node_exporter, if it isn't already), and windows_cpu_info must be renamed to -> node_cpu_info, which should fix the recording rule.

To verify, check a cluster reports node_role_os_version_machine:cpu_capacity_cores:sum with label_node_openshift_io_os_id="Windows" via prometheus, which will be sent to telemetry.

Also, I recommend reviewing all rules in cluster-monitoring-operator to ensure no additional node_* metrics are missing and must be restored / renamed.  Our goal should be metrics that can be identical between windows/linux nodes SHOULD be consistent with linux.

Comment 1 gaoshang 2021-05-07 08:10:50 UTC
This bug has been verified on OCP 4.8.0-0.nightly-2021-05-06-210840 and passed, thanks.

Version-Release number of selected component (if applicable):
WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/1ca41c250ff937d1543559ba19e805a7473d45bf
OCP version 4.8.0-0.nightly-2021-05-06-210840


Steps:

1. Install WMCO operator on OCP 4.8, make sure WMCO namespace is monitored by selecting checkbox "Enable Operator recommended cluster monitoring on this Namespace".

2. Create Windows machineset and scale up Windows nodes

3. Check cluster reports node_role_os_version_machine:cpu_capacity_cores:sum with label_node_openshift_io_os_id="Windows" via prometheus

e.g
Search `node_role_os_version_machine:cpu_capacity_cores:sum` in https://prometheus-k8s-openshift-monitoring.apps.sgao-a1.qe.devcluster.openshift.com/graph, got:

node_role_os_version_machine:cpu_capacity_cores:sum{label_kubernetes_io_arch="amd64", label_node_hyperthread_enabled="false", label_node_openshift_io_os_id="Windows"}                                           2
node_role_os_version_machine:cpu_capacity_cores:sum{label_kubernetes_io_arch="amd64", label_node_hyperthread_enabled="true", label_node_openshift_io_os_id="rhcos"}                                              3
node_role_os_version_machine:cpu_capacity_cores:sum{label_kubernetes_io_arch="amd64", label_node_hyperthread_enabled="true", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io_master="true"} 6

4. Check rules in https://prometheus-k8s-openshift-monitoring.apps.sgao-a1.qe.devcluster.openshift.com/rules, did not find anything wrong.

node.rules
Rule	State	Error	Last Evaluation	Evaluation Time
record:node_namespace_pod:kube_pod_info:
expr:topk by(namespace, pod) (1, max by(node, namespace, pod) (label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)")))
OK		8.731s ago	3.970ms
record:node:node_num_cpu:sum
expr:count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, node_namespace_pod:kube_pod_info:)))
OK		8.727s ago	3.897ms
record::node_memory_MemAvailable_bytes:sum
expr:sum by(cluster) (node_memory_MemAvailable_bytes{job="node-exporter"} or (node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Slab_bytes{job="node-exporter"}))
OK		8.724s ago	0.650ms


windows.rules
Rule	State	Error	Last Evaluation	Evaluation Time
record:instance:node_cpu_utilisation:rate1m
expr:avg without(core, mode) (rate(windows_cpu_time_total{mode="idle"}[1m]))
OK		4m 13s ago	0.276ms
record:instance:node_cpu:rate:sum
expr:sum by(instance) (rate(windows_cpu_time_total{mode!="iowait",mode="idle"}[3m]))
OK		4m 13s ago	0.177ms
record:node_filesystem_size_bytes
expr:windows_logical_disk_size_bytes
OK		4m 13s ago	0.092ms
record:node_filesystem_avail_bytes
expr:windows_logical_disk_free_bytes
OK		4m 13s ago	0.087ms
record:node_network_receive_bytes_total
expr:rate(windows_net_bytes_received_total[1m])
OK		4m 13s ago	0.122ms
record:node_network_transmit_bytes_total
expr:rate(windows_net_bytes_sent_total[1m])
OK		4m 13s ago	0.096ms
record:node_filesystem_free_bytes
expr:windows_logical_disk_free_bytes
OK		4m 13s ago	0.084ms
record:node_memory_MemAvailable_bytes
expr:windows_memory_available_bytes
OK		4m 13s ago	0.087ms
record:node_memory_MemTotal_bytes
expr:windows_cs_physical_memory_bytes
OK		4m 13s ago	0.078ms
record:node_cpu_info
expr:windows_cpu_info
OK		4m 13s ago	0.102ms

Comment 4 errata-xmlrpc 2021-08-03 20:29:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Platform for Windows Containers 3.0.0 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3001


Note You need to log in before you can comment on or make changes to this bug.