+++ This bug was initially created as a clone of Bug #1920903 +++ Description of problem: The command # oc adm top Isn't reporting any metrics on Windows node oc adm top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-135-193.us-east-2.compute.internal 179m 11% 2498Mi 37% ip-10-0-146-98.us-east-2.compute.internal 587m 16% 6352Mi 43% ip-10-0-165-247.us-east-2.compute.internal 210m 14% 2798Mi 42% ip-10-0-174-255.us-east-2.compute.internal 722m 20% 6418Mi 43% ip-10-0-203-0.us-east-2.compute.internal 432m 28% 2817Mi 42% ip-10-0-208-133.us-east-2.compute.internal 662m 18% 6191Mi 42% ip-10-0-136-210.us-east-2.compute.internal <unknown> Version-Release number of selected component (if applicable): 4.7 How reproducible: 100% Steps to Reproduce: 1. Deploy OCP on AWs 2. Configure WMCO 3. Add Windows node to existing nodes Actual results: No reporting of Windows node metrics - status unknown Expected results: Same reporting as Other Linux nodes metrics Additional info: oc adm node-logs -u crio ip-10-0-136-210.us-east-2.compute.internal Get-WinEvent : There is not an event provider on the localhost computer that matches "crio". At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : ObjectNotFound: (crio:String) [Get-WinEvent], Exception + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand Get-WinEvent : The specified providers do not write events to any of the specified logs. At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidArgument: ( [Get-WinEvent], Exception + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand Get-WinEvent : The parameter is incorrect At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: ( [Get-WinEvent], EventLogException + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi nEventCommand [root@osboxes windows-machine-config-operator]# oc adm node-logs -u kubelet ip-10-0-136-210.us-east-2.compute.internal Get-WinEvent : There is not an event provider on the localhost computer that matches "kubelet". At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : ObjectNotFound: (kubelet:String) [Get-WinEvent], Exception + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand Get-WinEvent : The specified providers do not write events to any of the specified logs. At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidArgument: ( [Get-WinEvent], Exception + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand Get-WinEvent : The parameter is incorrect At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: ( [Get-WinEvent], EventLogException + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi nEventCommand Options --- Additional comment from Mansi Kulkarni on 2021-03-25 16:19:10 UTC --- The prometheus-adapter used by CMO currently has node-exporter specific fields specified in the configMap that it uses which is resulting in resource metrics not being reported for Windows nodes. Opened https://github.com/prometheus-operator/kube-prometheus/pull/1058 against the upstream https://github.com/prometheus-operator/kube-prometheus repository to add a fix for this issue. --- Additional comment from Aravindh Puthiyaparambil on 2021-04-15 14:35:56 UTC --- Raising the priority on this as it break HCA and HPA --- Additional comment from Mansi Kulkarni on 2021-04-20 20:45:08 UTC --- Merged upstream fix for this issue against kube-prometheus: https://github.com/prometheus-operator/kube-prometheus/pull/1058 The fix will be picked up downstream with the PR open against CMO repo: https://github.com/openshift/cluster-monitoring-operator/pull/1127
>oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-148-35.us-east-2.compute.internal 746m 21% 7566Mi 51% ip-10-0-156-90.us-east-2.compute.internal 370m 24% 3463Mi 52% ip-10-0-173-191.us-east-2.compute.internal 104m 6% 1932Mi 29% ip-10-0-184-76.us-east-2.compute.internal 698m 19% 7261Mi 49% ip-10-0-203-37.us-east-2.compute.internal 466m 31% 4802Mi 73% ip-10-0-207-184.us-east-2.compute.internal 518m 14% 5547Mi 37% ip-10-0-133-203.us-east-2.compute.internal <unknown> <unknown> <unknown> <unknown> ip-10-0-132-187.us-east-2.compute.internal <unknown> <unknown> <unknown> <unknown> Server Version: 4.7.0-0.nightly-2021-05-17-040457
@rrasouli since the fix was merged on May 14th, it might not be available on a nightly and would have to be tested on a CI cluster. Could you provide more details on how the operator was installed? It should be built from release-4.7 branch of WMCO, the released 2.0.0 version of WMCO does not include latest developments with metrics configuration.
@rrasouli tested this out on a latest CI cluster and it worked. Server version: 4.7.0-0.ci-2021-05-17-153541 Steps: 1. Install WMCO operator by building from releas-4.7 operator branch on OCP 4.7, ensure cluster monitoring is enabled in operator namespace. 2. Create Windows machineset and scale up Windows nodes 3. Check `oc adm top nodes` should monitor Windows nodes >oc adm top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-134-241.us-east-2.compute.internal 285m 19% 3401Mi 51% ip-10-0-139-95.us-east-2.compute.internal 529m 15% 5918Mi 40% ip-10-0-152-127.us-east-2.compute.internal 90m 6% 1533Mi 22% ip-10-0-164-118.us-east-2.compute.internal 671m 19% 6029Mi 41% ip-10-0-170-159.us-east-2.compute.internal 219m 14% 3432Mi 51% ip-10-0-212-23.us-east-2.compute.internal 174m 11% 2702Mi 40% ip-10-0-220-59.us-east-2.compute.internal 718m 20% 6681Mi 45% >oc adm top node -l kubernetes.io/os=windows NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-152-127.us-east-2.compute.internal 91m 6% 1521Mi 22% Can you verify this?
@rrasouli Please ensure the commit that adds this fix to the release-4.7 -> Bug 1952149: oc adm top reporting unknown status for Windows node[https://github.com/openshift/cluster-monitoring-operator/pull/1130/commits/1c9b296b55fc36175d39b4e7230a5c0674db69fa] is a part of the cluster payload to test this.
@rrasouli the WMCO should be built by pulling in the latest from release-4.7 branch since there are some renaming changes related to the metrics job that went in windows-machine-config-operator-metrics -> windows-exporter, please make sure the following commits that were part of this change, are pulled in when building the operator-> https://github.com/openshift/windows-machine-config-operator/pull/353/commits
version": "2.0.1+ae13f4c was built from the latest 4.7 branch Server Version: 4.7.0-0.nightly-2021-05-17-040457 Indeed after few minutes the metrics are working: oc adm top node --selector=beta.kubernetes.io/os=windows NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-154-238.us-east-2.compute.internal 1119m 74% 1569Mi 23%
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.12 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1561