Bug 1920903

Summary:	oc adm top reporting unknown status for Windows node
Product:	OpenShift Container Platform	Reporter:	Ronnie Rasouli <rrasouli>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:	Ronnie Rasouli <rrasouli>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.7	CC:	alegrand, anpicker, aos-bugs, aravindh, erooth, jfajersk, juzhao, kakkoyun, lcosic, mankulka, obulatov, pkrupa, sgao, spasquie, vhire
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1952149 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:36:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1952149

Description Ronnie Rasouli 2021-01-27 08:51:03 UTC

Description of problem:

The command 
# oc adm top 

Isn't reporting any metrics on Windows node

oc adm top nodes
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-135-193.us-east-2.compute.internal   179m         11%    2498Mi          37%
ip-10-0-146-98.us-east-2.compute.internal    587m         16%    6352Mi          43%
ip-10-0-165-247.us-east-2.compute.internal   210m         14%    2798Mi          42%
ip-10-0-174-255.us-east-2.compute.internal   722m         20%    6418Mi          43%
ip-10-0-203-0.us-east-2.compute.internal     432m         28%    2817Mi          42%
ip-10-0-208-133.us-east-2.compute.internal   662m         18%    6191Mi          42%
ip-10-0-136-210.us-east-2.compute.internal   <unknown>

Version-Release number of selected component (if applicable):
4.7

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP on AWs
2. Configure WMCO
3. Add Windows node to existing nodes

Actual results:

No reporting of Windows node metrics - status unknown

Expected results:

Same reporting as Other Linux nodes metrics

Additional info:

oc adm  node-logs -u crio ip-10-0-136-210.us-east-2.compute.internal

Get-WinEvent : There is not an event provider on the localhost computer that matches "crio".

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : ObjectNotFound: (crio:String) [Get-WinEvent], Exception

    + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The specified providers do not write events to any of the specified logs.

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : InvalidArgument: ( [Get-WinEvent], Exception

    + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The parameter is incorrect

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : NotSpecified: ( [Get-WinEvent], EventLogException

    + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi

   nEventCommand

 

[root@osboxes windows-machine-config-operator]# oc adm  node-logs -u kubelet  ip-10-0-136-210.us-east-2.compute.internal

Get-WinEvent : There is not an event provider on the localhost computer that matches "kubelet".

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : ObjectNotFound: (kubelet:String) [Get-WinEvent], Exception

    + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The specified providers do not write events to any of the specified logs.

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : InvalidArgument: ( [Get-WinEvent], Exception

    + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand

 

Get-WinEvent : The parameter is incorrect

At line:1 char:1

+ Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ...

+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    + CategoryInfo          : NotSpecified: ( [Get-WinEvent], EventLogException

    + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi

   nEventCommand

Options

Comment 2 Aravindh Puthiyaparambil 2021-04-15 14:35:56 UTC

Raising the priority on this as it break HCA and HPA

Comment 3 Mansi Kulkarni 2021-04-20 20:45:08 UTC

Merged upstream fix for this issue against kube-prometheus: https://github.com/prometheus-operator/kube-prometheus/pull/1058 
The fix will be picked up downstream with the PR open against CMO repo: https://github.com/openshift/cluster-monitoring-operator/pull/1127

Comment 5 Junqi Zhao 2021-05-06 13:41:01 UTC

Checked with 4.8.0-0.nightly-2021-05-06-003426, oc adm top still reports unknown status for Windows node
# oc get no --show-labels | grep windows | awk '{print $1}'
ip-10-0-146-241.us-east-2.compute.internal
ip-10-0-158-141.us-east-2.compute.internal

# oc adm top node
W0506 05:43:09.237978   15140 top_node.go:119] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
ip-10-0-149-206.us-east-2.compute.internal   1014m        28%    6834Mi          46%         
ip-10-0-159-183.us-east-2.compute.internal   196m         5%     1434Mi          9%          
ip-10-0-167-210.us-east-2.compute.internal   1025m        29%    5362Mi          36%         
ip-10-0-171-40.us-east-2.compute.internal    924m         26%    7585Mi          52%         
ip-10-0-213-181.us-east-2.compute.internal   761m         21%    5653Mi          38%         
ip-10-0-218-55.us-east-2.compute.internal    361m         10%    4082Mi          27%         
ip-10-0-146-241.us-east-2.compute.internal   <unknown>                           <unknown>               <unknown>               <unknown>               
ip-10-0-158-141.us-east-2.compute.internal   <unknown>                           <unknown>               <unknown>               <unknown>      


checked the fix is in the payload
*******************************************
# docker pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25
...
Digest: sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25
Status: Downloaded newer image for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25


# docker images
REPOSITORY                                       TAG                 IMAGE ID            CREATED             SIZE
quay.io/openshift-release-dev/ocp-v4.0-art-dev   <none>              269449901989        35 hours ago        306 MB

# docker inspect 269449901989 | grep "io.openshift.build.commit.url"
                "io.openshift.build.commit.url": "https://github.com/openshift/images/commit/bcab0f7337420343611546aae2634eaf0d36c33e",
                "io.openshift.build.commit.url": "https://github.com/openshift/cluster-monitoring-operator/commit/4d6bf3d9ed8187ed13854fce3d75d32a0525b1db",
*******************************************

# oc adm top node ip-10-0-146-241.us-east-2.compute.internal --loglevel=10
...
I0506 06:06:17.247607   15757 round_trippers.go:435] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: oc/4.8.0 (linux/amd64) kubernetes/7cae9e8" 'https://api.juzhao-0506.qe.devcluster.openshift.com:6443/apis/metrics.k8s.io/v1beta1/nodes/ip-10-0-146-241.us-east-2.compute.internal'
I0506 06:06:17.294720   15757 round_trippers.go:454] GET https://api.juzhao-0506.qe.devcluster.openshift.com:6443/apis/metrics.k8s.io/v1beta1/nodes/ip-10-0-146-241.us-east-2.compute.internal 404 Not Found in 47 milliseconds
I0506 06:06:17.294752   15757 round_trippers.go:460] Response Headers:
I0506 06:06:17.294767   15757 round_trippers.go:463]     Audit-Id: 2858f3a6-72b5-47b5-aa19-d75b24087825
I0506 06:06:17.294772   15757 round_trippers.go:463]     Cache-Control: no-cache, private
I0506 06:06:17.294776   15757 round_trippers.go:463]     Cache-Control: no-cache, private
I0506 06:06:17.294780   15757 round_trippers.go:463]     Content-Type: application/json
I0506 06:06:17.294783   15757 round_trippers.go:463]     Date: Thu, 06 May 2021 10:28:27 GMT
I0506 06:06:17.294787   15757 round_trippers.go:463]     X-Kubernetes-Pf-Flowschema-Uid: e1d427ce-6ee5-4370-8f58-942550853b5d
I0506 06:06:17.294791   15757 round_trippers.go:463]     X-Kubernetes-Pf-Prioritylevel-Uid: 13dc42b2-5d4b-464d-afbc-5a8e1d88a047
I0506 06:06:17.294795   15757 round_trippers.go:463]     Content-Length: 306
I0506 06:06:17.294819   15757 request.go:1123] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodemetrics.metrics.k8s.io \"ip-10-0-146-241.us-east-2.compute.internal\" not found","reason":"NotFound","details":{"name":"ip-10-0-146-241.us-east-2.compute.internal","group":"metrics.k8s.io","kind":"nodemetrics"},"code":404}
I0506 06:06:17.295335   15757 helpers.go:216] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "nodemetrics.metrics.k8s.io \"ip-10-0-146-241.us-east-2.compute.internal\" not found",
  "reason": "NotFound",
  "details": {
    "name": "ip-10-0-146-241.us-east-2.compute.internal",
    "group": "metrics.k8s.io",
    "kind": "nodemetrics"
  },
  "code": 404
}]
F0506 06:06:17.295367   15757 helpers.go:115] Error from server (NotFound): nodemetrics.metrics.k8s.io "ip-10-0-146-241.us-east-2.compute.internal" not found
goroutine 1 [running]:
...

windows node
# oc get nodemetrics.metrics.k8s.io/ip-10-0-146-241.us-east-2.compute.internal
Error from server (NotFound): nodemetrics.metrics.k8s.io "ip-10-0-146-241.us-east-2.compute.internal" not found

coreos node
# oc get nodemetrics.metrics.k8s.io/ip-10-0-149-206.us-east-2.compute.internal
NAME                                         CPU    MEMORY      WINDOW
ip-10-0-149-206.us-east-2.compute.internal   752m   7346348Ki   1m0s

Comment 6 Aravindh Puthiyaparambil 2021-05-06 15:06:04 UTC

@juzhao did you use the WMCO version from OperatorHub to test this? If yes, that does not have the necessary fixes on the WMCO side. You need to use the operator built from master. It will be easier for @sgao or @rrasouli to test verify this. I hope one of you can pick this off Junqi's plate.

Comment 7 Junqi Zhao 2021-05-07 01:15:12 UTC

(In reply to Aravindh Puthiyaparambil from comment #6)
> @juzhao did you use the WMCO version from OperatorHub to test
> this? If yes, that does not have the necessary fixes on the WMCO side. You
> need to use the operator built from master. It will be easier for
> @sgao or @rrasouli to test verify this. I hope one of
> you can pick this off Junqi's plate.

I did not use the WMCO version from OperatorHub, we have jenkins job which can add windows nodes to build the cluster

Comment 8 gaoshang 2021-05-07 05:07:15 UTC

@aravindh @juzhao By default, cluster installed by QE Jenkins job did not monitoring WMCO workspace, I fixed it and works now with monitoring enabled.

This bug has been verified on OCP 4.8.0-0.nightly-2021-05-06-210840 and passed, thanks.

Version-Release number of selected component (if applicable):
WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/1ca41c250ff937d1543559ba19e805a7473d45bf
OCP version 4.8.0-0.nightly-2021-05-06-210840


Steps:

1. Install WMCO operator on OCP 4.8, make sure WMCO namespace is monitored by selecting checkbox "Enable Operator recommended cluster monitoring on this Namespace".

2. Create Windows machineset and scale up Windows nodes

3. Check `oc adm top nodes` should monitor Windows nodes

# oc get nodes -owide -l kubernetes.io/os=windows
NAME                                        STATUS   ROLES    AGE   VERSION                            INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION    CONTAINER-RUNTIME
ip-10-0-129-15.us-east-2.compute.internal   Ready    worker   10m   v1.21.0-rc.0.1190+e22a836a8b2659   10.0.129.15   <none>        Windows Server 2019 Datacenter   10.0.17763.1879   docker://20.10.0

# oc adm top nodes
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-10-0-129-15.us-east-2.compute.internal    1086m        72%    1593Mi          23%       
ip-10-0-130-153.us-east-2.compute.internal   362m         24%    3962Mi          59%       
ip-10-0-141-42.us-east-2.compute.internal    1063m        30%    8406Mi          57%       
ip-10-0-171-168.us-east-2.compute.internal   709m         20%    6091Mi          41%       
ip-10-0-177-52.us-east-2.compute.internal    84m          5%     1373Mi          20%       
ip-10-0-203-106.us-east-2.compute.internal   464m         30%    4826Mi          72%       
ip-10-0-219-57.us-east-2.compute.internal    849m         24%    7477Mi          51%

# oc get nodemetrics ip-10-0-129-15.us-east-2.compute.internal
NAME                                        CPU    MEMORY      WINDOW
ip-10-0-129-15.us-east-2.compute.internal   104m   1560580Ki   1m0s

Comment 11 errata-xmlrpc 2021-07-27 22:36:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 12 Red Hat Bugzilla 2023-09-15 00:59:16 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days