1952149 – oc adm top reporting unknown status for Windows node

Bug 1952149 - oc adm top reporting unknown status for Windows node

Summary: oc adm top reporting unknown status for Windows node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Mansi Kulkarni
QA Contact:	Ronnie Rasouli
Docs Contact:
URL:
Whiteboard:
Depends On:	1920903
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-21 15:41 UTC by Mansi Kulkarni
Modified:	2021-05-24 17:15 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1920903
Environment:
Last Closed:	2021-05-24 17:14:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1130	0	None	open	Bug 1952149: oc adm top reporting unknown status for Windows node	2021-05-07 08:51:21 UTC
Red Hat Product Errata	RHSA-2021:1561	0	None	None	None	2021-05-24 17:15:10 UTC

Description Mansi Kulkarni 2021-04-21 15:41:50 UTC

+++ This bug was initially created as a clone of Bug #1920903 +++

Description of problem:

The command 
# oc adm top 

Isn't reporting any metrics on Windows node

oc adm top nodes
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-135-193.us-east-2.compute.internal   179m         11%    2498Mi          37%
ip-10-0-146-98.us-east-2.compute.internal    587m         16%    6352Mi          43%
ip-10-0-165-247.us-east-2.compute.internal   210m         14%    2798Mi          42%
ip-10-0-174-255.us-east-2.compute.internal   722m         20%    6418Mi          43%
ip-10-0-203-0.us-east-2.compute.internal     432m         28%    2817Mi          42%
ip-10-0-208-133.us-east-2.compute.internal   662m         18%    6191Mi          42%
ip-10-0-136-210.us-east-2.compute.internal   <unknown>

Version-Release number of selected component (if applicable):
4.7

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP on AWs
2. Configure WMCO
3. Add Windows node to existing nodes

Actual results:

No reporting of Windows node metrics - status unknown

Expected results:

Same reporting as Other Linux nodes metrics

Additional info:

oc adm  node-logs -u crio ip-10-0-136-210.us-east-2.compute.internal

Get-WinEvent : There is not an event provider on the localhost computer that matches "crio".

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : ObjectNotFound: (crio:String) [Get-WinEvent], Exception

    + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The specified providers do not write events to any of the specified logs.

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : InvalidArgument: ( [Get-WinEvent], Exception

    + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The parameter is incorrect

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : NotSpecified: ( [Get-WinEvent], EventLogException

    + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi

   nEventCommand

 

[root@osboxes windows-machine-config-operator]# oc adm  node-logs -u kubelet  ip-10-0-136-210.us-east-2.compute.internal

Get-WinEvent : There is not an event provider on the localhost computer that matches "kubelet".

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : ObjectNotFound: (kubelet:String) [Get-WinEvent], Exception

    + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The specified providers do not write events to any of the specified logs.

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : InvalidArgument: ( [Get-WinEvent], Exception

    + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The parameter is incorrect

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : NotSpecified: ( [Get-WinEvent], EventLogException

    + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi

   nEventCommand

Options

--- Additional comment from Mansi Kulkarni on 2021-03-25 16:19:10 UTC ---

The prometheus-adapter used by CMO currently has node-exporter specific fields specified in the configMap that it uses which is resulting in resource metrics not being reported for Windows nodes.
Opened https://github.com/prometheus-operator/kube-prometheus/pull/1058 against the upstream https://github.com/prometheus-operator/kube-prometheus repository to add a fix for this issue.

--- Additional comment from Aravindh Puthiyaparambil on 2021-04-15 14:35:56 UTC ---

Raising the priority on this as it break HCA and HPA

--- Additional comment from Mansi Kulkarni on 2021-04-20 20:45:08 UTC ---

Merged upstream fix for this issue against kube-prometheus: https://github.com/prometheus-operator/kube-prometheus/pull/1058 
The fix will be picked up downstream with the PR open against CMO repo: https://github.com/openshift/cluster-monitoring-operator/pull/1127

Comment 4 Ronnie Rasouli 2021-05-18 09:41:12 UTC

>oc adm top node
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-148-35.us-east-2.compute.internal    746m         21%    7566Mi          51%
ip-10-0-156-90.us-east-2.compute.internal    370m         24%    3463Mi          52%
ip-10-0-173-191.us-east-2.compute.internal   104m         6%     1932Mi          29%
ip-10-0-184-76.us-east-2.compute.internal    698m         19%    7261Mi          49%
ip-10-0-203-37.us-east-2.compute.internal    466m         31%    4802Mi          73%
ip-10-0-207-184.us-east-2.compute.internal   518m         14%    5547Mi          37%
ip-10-0-133-203.us-east-2.compute.internal   <unknown>                           <unknown>               <unknown>               <unknown>
ip-10-0-132-187.us-east-2.compute.internal   <unknown>                           <unknown>               <unknown>               <unknown>

Server Version: 4.7.0-0.nightly-2021-05-17-040457

Comment 5 Mansi Kulkarni 2021-05-18 18:15:54 UTC

@rrasouli since the fix was merged on May 14th, it might not be available on a nightly and would have to be tested on a CI cluster. Could you provide more details on how the operator was installed? It should be built from release-4.7 branch of WMCO, the released 2.0.0 version of WMCO does not include latest developments with metrics configuration.

Comment 6 Mansi Kulkarni 2021-05-19 13:44:34 UTC

@rrasouli tested this out on a latest CI cluster and it worked.

Server version: 4.7.0-0.ci-2021-05-17-153541

Steps:

1. Install WMCO operator by building from releas-4.7 operator branch on OCP 4.7, ensure cluster monitoring is enabled in operator namespace.

2. Create Windows machineset and scale up Windows nodes

3. Check `oc adm top nodes` should monitor Windows nodes

>oc adm top node
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-134-241.us-east-2.compute.internal   285m         19%    3401Mi          51%       
ip-10-0-139-95.us-east-2.compute.internal    529m         15%    5918Mi          40%       
ip-10-0-152-127.us-east-2.compute.internal   90m          6%     1533Mi          22%       
ip-10-0-164-118.us-east-2.compute.internal   671m         19%    6029Mi          41%       
ip-10-0-170-159.us-east-2.compute.internal   219m         14%    3432Mi          51%       
ip-10-0-212-23.us-east-2.compute.internal    174m         11%    2702Mi          40%       
ip-10-0-220-59.us-east-2.compute.internal    718m         20%    6681Mi          45%  

>oc adm top node -l kubernetes.io/os=windows
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-152-127.us-east-2.compute.internal   91m          6%     1521Mi          22% 

Can you verify this?

Comment 7 Mansi Kulkarni 2021-05-19 14:46:37 UTC

@rrasouli Please ensure the commit that adds this fix to the release-4.7 -> Bug 1952149: oc adm top reporting unknown status for Windows node[https://github.com/openshift/cluster-monitoring-operator/pull/1130/commits/1c9b296b55fc36175d39b4e7230a5c0674db69fa] is a part of the cluster payload to test this.

Comment 8 Mansi Kulkarni 2021-05-19 15:03:07 UTC

@rrasouli the WMCO should be built by pulling in the latest from release-4.7 branch since there are some renaming changes related to the metrics job that went in windows-machine-config-operator-metrics -> windows-exporter, please make sure the following commits that were part of this change, are pulled in when building the operator-> https://github.com/openshift/windows-machine-config-operator/pull/353/commits

Comment 9 Ronnie Rasouli 2021-05-20 06:22:03 UTC

version": "2.0.1+ae13f4c was built from the latest 4.7 branch
Server Version: 4.7.0-0.nightly-2021-05-17-040457

Indeed after few minutes the metrics are working:

oc adm top node --selector=beta.kubernetes.io/os=windows
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-154-238.us-east-2.compute.internal   1119m        74%    1569Mi          23%

Comment 11 errata-xmlrpc 2021-05-24 17:14:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.12 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1561

Note You need to log in before you can comment on or make changes to this bug.