Bug 1952149 - oc adm top reporting unknown status for Windows node
Summary: oc adm top reporting unknown status for Windows node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Mansi Kulkarni
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On: 1920903
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-21 15:41 UTC by Mansi Kulkarni
Modified: 2021-05-24 17:15 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1920903
Environment:
Last Closed: 2021-05-24 17:14:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1130 0 None open Bug 1952149: oc adm top reporting unknown status for Windows node 2021-05-07 08:51:21 UTC
Red Hat Product Errata RHSA-2021:1561 0 None None None 2021-05-24 17:15:10 UTC

Description Mansi Kulkarni 2021-04-21 15:41:50 UTC
+++ This bug was initially created as a clone of Bug #1920903 +++

Description of problem:

The command 
# oc adm top 

Isn't reporting any metrics on Windows node

oc adm top nodes
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-135-193.us-east-2.compute.internal   179m         11%    2498Mi          37%
ip-10-0-146-98.us-east-2.compute.internal    587m         16%    6352Mi          43%
ip-10-0-165-247.us-east-2.compute.internal   210m         14%    2798Mi          42%
ip-10-0-174-255.us-east-2.compute.internal   722m         20%    6418Mi          43%
ip-10-0-203-0.us-east-2.compute.internal     432m         28%    2817Mi          42%
ip-10-0-208-133.us-east-2.compute.internal   662m         18%    6191Mi          42%
ip-10-0-136-210.us-east-2.compute.internal   <unknown>

Version-Release number of selected component (if applicable):
4.7

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP on AWs
2. Configure WMCO
3. Add Windows node to existing nodes

Actual results:

No reporting of Windows node metrics - status unknown

Expected results:

Same reporting as Other Linux nodes metrics

Additional info:

oc adm  node-logs -u crio ip-10-0-136-210.us-east-2.compute.internal

Get-WinEvent : There is not an event provider on the localhost computer that matches "crio".

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : ObjectNotFound: (crio:String) [Get-WinEvent], Exception

    + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The specified providers do not write events to any of the specified logs.

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : InvalidArgument: ( [Get-WinEvent], Exception

    + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The parameter is incorrect

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : NotSpecified: ( [Get-WinEvent], EventLogException

    + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi

   nEventCommand

 

[root@osboxes windows-machine-config-operator]# oc adm  node-logs -u kubelet  ip-10-0-136-210.us-east-2.compute.internal

Get-WinEvent : There is not an event provider on the localhost computer that matches "kubelet".

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : ObjectNotFound: (kubelet:String) [Get-WinEvent], Exception

    + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The specified providers do not write events to any of the specified logs.

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : InvalidArgument: ( [Get-WinEvent], Exception

    + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The parameter is incorrect

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : NotSpecified: ( [Get-WinEvent], EventLogException

    + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi

   nEventCommand

Options

--- Additional comment from Mansi Kulkarni on 2021-03-25 16:19:10 UTC ---

The prometheus-adapter used by CMO currently has node-exporter specific fields specified in the configMap that it uses which is resulting in resource metrics not being reported for Windows nodes.
Opened https://github.com/prometheus-operator/kube-prometheus/pull/1058 against the upstream https://github.com/prometheus-operator/kube-prometheus repository to add a fix for this issue.

--- Additional comment from Aravindh Puthiyaparambil on 2021-04-15 14:35:56 UTC ---

Raising the priority on this as it break HCA and HPA

--- Additional comment from Mansi Kulkarni on 2021-04-20 20:45:08 UTC ---

Merged upstream fix for this issue against kube-prometheus: https://github.com/prometheus-operator/kube-prometheus/pull/1058 
The fix will be picked up downstream with the PR open against CMO repo: https://github.com/openshift/cluster-monitoring-operator/pull/1127

Comment 4 Ronnie Rasouli 2021-05-18 09:41:12 UTC
>oc adm top node
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-148-35.us-east-2.compute.internal    746m         21%    7566Mi          51%
ip-10-0-156-90.us-east-2.compute.internal    370m         24%    3463Mi          52%
ip-10-0-173-191.us-east-2.compute.internal   104m         6%     1932Mi          29%
ip-10-0-184-76.us-east-2.compute.internal    698m         19%    7261Mi          49%
ip-10-0-203-37.us-east-2.compute.internal    466m         31%    4802Mi          73%
ip-10-0-207-184.us-east-2.compute.internal   518m         14%    5547Mi          37%
ip-10-0-133-203.us-east-2.compute.internal   <unknown>                           <unknown>               <unknown>               <unknown>
ip-10-0-132-187.us-east-2.compute.internal   <unknown>                           <unknown>               <unknown>               <unknown>

Server Version: 4.7.0-0.nightly-2021-05-17-040457

Comment 5 Mansi Kulkarni 2021-05-18 18:15:54 UTC
@rrasouli since the fix was merged on May 14th, it might not be available on a nightly and would have to be tested on a CI cluster. Could you provide more details on how the operator was installed? It should be built from release-4.7 branch of WMCO, the released 2.0.0 version of WMCO does not include latest developments with metrics configuration.

Comment 6 Mansi Kulkarni 2021-05-19 13:44:34 UTC
@rrasouli tested this out on a latest CI cluster and it worked.

Server version: 4.7.0-0.ci-2021-05-17-153541

Steps:

1. Install WMCO operator by building from releas-4.7 operator branch on OCP 4.7, ensure cluster monitoring is enabled in operator namespace.

2. Create Windows machineset and scale up Windows nodes

3. Check `oc adm top nodes` should monitor Windows nodes

>oc adm top node
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-134-241.us-east-2.compute.internal   285m         19%    3401Mi          51%       
ip-10-0-139-95.us-east-2.compute.internal    529m         15%    5918Mi          40%       
ip-10-0-152-127.us-east-2.compute.internal   90m          6%     1533Mi          22%       
ip-10-0-164-118.us-east-2.compute.internal   671m         19%    6029Mi          41%       
ip-10-0-170-159.us-east-2.compute.internal   219m         14%    3432Mi          51%       
ip-10-0-212-23.us-east-2.compute.internal    174m         11%    2702Mi          40%       
ip-10-0-220-59.us-east-2.compute.internal    718m         20%    6681Mi          45%  

>oc adm top node -l kubernetes.io/os=windows
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-152-127.us-east-2.compute.internal   91m          6%     1521Mi          22% 

Can you verify this?

Comment 7 Mansi Kulkarni 2021-05-19 14:46:37 UTC
@rrasouli Please ensure the commit that adds this fix to the release-4.7 -> Bug 1952149: oc adm top reporting unknown status for Windows node[https://github.com/openshift/cluster-monitoring-operator/pull/1130/commits/1c9b296b55fc36175d39b4e7230a5c0674db69fa] is a part of the cluster payload to test this.

Comment 8 Mansi Kulkarni 2021-05-19 15:03:07 UTC
@rrasouli the WMCO should be built by pulling in the latest from release-4.7 branch since there are some renaming changes related to the metrics job that went in windows-machine-config-operator-metrics -> windows-exporter, please make sure the following commits that were part of this change, are pulled in when building the operator-> https://github.com/openshift/windows-machine-config-operator/pull/353/commits

Comment 9 Ronnie Rasouli 2021-05-20 06:22:03 UTC
version": "2.0.1+ae13f4c was built from the latest 4.7 branch
Server Version: 4.7.0-0.nightly-2021-05-17-040457

Indeed after few minutes the metrics are working:

oc adm top node --selector=beta.kubernetes.io/os=windows
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-154-238.us-east-2.compute.internal   1119m        74%    1569Mi          23%

Comment 11 errata-xmlrpc 2021-05-24 17:14:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.12 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1561


Note You need to log in before you can comment on or make changes to this bug.