Description of problem: The command # oc adm top Isn't reporting any metrics on Windows node oc adm top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-135-193.us-east-2.compute.internal 179m 11% 2498Mi 37% ip-10-0-146-98.us-east-2.compute.internal 587m 16% 6352Mi 43% ip-10-0-165-247.us-east-2.compute.internal 210m 14% 2798Mi 42% ip-10-0-174-255.us-east-2.compute.internal 722m 20% 6418Mi 43% ip-10-0-203-0.us-east-2.compute.internal 432m 28% 2817Mi 42% ip-10-0-208-133.us-east-2.compute.internal 662m 18% 6191Mi 42% ip-10-0-136-210.us-east-2.compute.internal <unknown> Version-Release number of selected component (if applicable): 4.7 How reproducible: 100% Steps to Reproduce: 1. Deploy OCP on AWs 2. Configure WMCO 3. Add Windows node to existing nodes Actual results: No reporting of Windows node metrics - status unknown Expected results: Same reporting as Other Linux nodes metrics Additional info: oc adm node-logs -u crio ip-10-0-136-210.us-east-2.compute.internal Get-WinEvent : There is not an event provider on the localhost computer that matches "crio". At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : ObjectNotFound: (crio:String) [Get-WinEvent], Exception + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand Get-WinEvent : The specified providers do not write events to any of the specified logs. At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidArgument: ( [Get-WinEvent], Exception + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand Get-WinEvent : The parameter is incorrect At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: ( [Get-WinEvent], EventLogException + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi nEventCommand [root@osboxes windows-machine-config-operator]# oc adm node-logs -u kubelet ip-10-0-136-210.us-east-2.compute.internal Get-WinEvent : There is not an event provider on the localhost computer that matches "kubelet". At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : ObjectNotFound: (kubelet:String) [Get-WinEvent], Exception + FullyQualifiedErrorId : NoMatchingProvidersFound,Microsoft.PowerShell.Commands.GetWinEventCommand Get-WinEvent : The specified providers do not write events to any of the specified logs. At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : InvalidArgument: ( [Get-WinEvent], Exception + FullyQualifiedErrorId : LogsAndProvidersDontOverlap,Microsoft.PowerShell.Commands.GetWinEventCommand Get-WinEvent : The parameter is incorrect At line:1 char:1 + Get-WinEvent -FilterHashtable @{LogName='Application'; ProviderName=' ... + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : NotSpecified: ( [Get-WinEvent], EventLogException + FullyQualifiedErrorId : System.Diagnostics.Eventing.Reader.EventLogException,Microsoft.PowerShell.Commands.GetWi nEventCommand Options
Raising the priority on this as it break HCA and HPA
Merged upstream fix for this issue against kube-prometheus: https://github.com/prometheus-operator/kube-prometheus/pull/1058 The fix will be picked up downstream with the PR open against CMO repo: https://github.com/openshift/cluster-monitoring-operator/pull/1127
Checked with 4.8.0-0.nightly-2021-05-06-003426, oc adm top still reports unknown status for Windows node # oc get no --show-labels | grep windows | awk '{print $1}' ip-10-0-146-241.us-east-2.compute.internal ip-10-0-158-141.us-east-2.compute.internal # oc adm top node W0506 05:43:09.237978 15140 top_node.go:119] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-149-206.us-east-2.compute.internal 1014m 28% 6834Mi 46% ip-10-0-159-183.us-east-2.compute.internal 196m 5% 1434Mi 9% ip-10-0-167-210.us-east-2.compute.internal 1025m 29% 5362Mi 36% ip-10-0-171-40.us-east-2.compute.internal 924m 26% 7585Mi 52% ip-10-0-213-181.us-east-2.compute.internal 761m 21% 5653Mi 38% ip-10-0-218-55.us-east-2.compute.internal 361m 10% 4082Mi 27% ip-10-0-146-241.us-east-2.compute.internal <unknown> <unknown> <unknown> <unknown> ip-10-0-158-141.us-east-2.compute.internal <unknown> <unknown> <unknown> <unknown> checked the fix is in the payload ******************************************* # docker pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25 ... Digest: sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25 Status: Downloaded newer image for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25 # docker images REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/openshift-release-dev/ocp-v4.0-art-dev <none> 269449901989 35 hours ago 306 MB # docker inspect 269449901989 | grep "io.openshift.build.commit.url" "io.openshift.build.commit.url": "https://github.com/openshift/images/commit/bcab0f7337420343611546aae2634eaf0d36c33e", "io.openshift.build.commit.url": "https://github.com/openshift/cluster-monitoring-operator/commit/4d6bf3d9ed8187ed13854fce3d75d32a0525b1db", ******************************************* # oc adm top node ip-10-0-146-241.us-east-2.compute.internal --loglevel=10 ... I0506 06:06:17.247607 15757 round_trippers.go:435] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: oc/4.8.0 (linux/amd64) kubernetes/7cae9e8" 'https://api.juzhao-0506.qe.devcluster.openshift.com:6443/apis/metrics.k8s.io/v1beta1/nodes/ip-10-0-146-241.us-east-2.compute.internal' I0506 06:06:17.294720 15757 round_trippers.go:454] GET https://api.juzhao-0506.qe.devcluster.openshift.com:6443/apis/metrics.k8s.io/v1beta1/nodes/ip-10-0-146-241.us-east-2.compute.internal 404 Not Found in 47 milliseconds I0506 06:06:17.294752 15757 round_trippers.go:460] Response Headers: I0506 06:06:17.294767 15757 round_trippers.go:463] Audit-Id: 2858f3a6-72b5-47b5-aa19-d75b24087825 I0506 06:06:17.294772 15757 round_trippers.go:463] Cache-Control: no-cache, private I0506 06:06:17.294776 15757 round_trippers.go:463] Cache-Control: no-cache, private I0506 06:06:17.294780 15757 round_trippers.go:463] Content-Type: application/json I0506 06:06:17.294783 15757 round_trippers.go:463] Date: Thu, 06 May 2021 10:28:27 GMT I0506 06:06:17.294787 15757 round_trippers.go:463] X-Kubernetes-Pf-Flowschema-Uid: e1d427ce-6ee5-4370-8f58-942550853b5d I0506 06:06:17.294791 15757 round_trippers.go:463] X-Kubernetes-Pf-Prioritylevel-Uid: 13dc42b2-5d4b-464d-afbc-5a8e1d88a047 I0506 06:06:17.294795 15757 round_trippers.go:463] Content-Length: 306 I0506 06:06:17.294819 15757 request.go:1123] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodemetrics.metrics.k8s.io \"ip-10-0-146-241.us-east-2.compute.internal\" not found","reason":"NotFound","details":{"name":"ip-10-0-146-241.us-east-2.compute.internal","group":"metrics.k8s.io","kind":"nodemetrics"},"code":404} I0506 06:06:17.295335 15757 helpers.go:216] server response object: [{ "metadata": {}, "status": "Failure", "message": "nodemetrics.metrics.k8s.io \"ip-10-0-146-241.us-east-2.compute.internal\" not found", "reason": "NotFound", "details": { "name": "ip-10-0-146-241.us-east-2.compute.internal", "group": "metrics.k8s.io", "kind": "nodemetrics" }, "code": 404 }] F0506 06:06:17.295367 15757 helpers.go:115] Error from server (NotFound): nodemetrics.metrics.k8s.io "ip-10-0-146-241.us-east-2.compute.internal" not found goroutine 1 [running]: ... windows node # oc get nodemetrics.metrics.k8s.io/ip-10-0-146-241.us-east-2.compute.internal Error from server (NotFound): nodemetrics.metrics.k8s.io "ip-10-0-146-241.us-east-2.compute.internal" not found coreos node # oc get nodemetrics.metrics.k8s.io/ip-10-0-149-206.us-east-2.compute.internal NAME CPU MEMORY WINDOW ip-10-0-149-206.us-east-2.compute.internal 752m 7346348Ki 1m0s
@juzhao did you use the WMCO version from OperatorHub to test this? If yes, that does not have the necessary fixes on the WMCO side. You need to use the operator built from master. It will be easier for @sgao or @rrasouli to test verify this. I hope one of you can pick this off Junqi's plate.
(In reply to Aravindh Puthiyaparambil from comment #6) > @juzhao did you use the WMCO version from OperatorHub to test > this? If yes, that does not have the necessary fixes on the WMCO side. You > need to use the operator built from master. It will be easier for > @sgao or @rrasouli to test verify this. I hope one of > you can pick this off Junqi's plate. I did not use the WMCO version from OperatorHub, we have jenkins job which can add windows nodes to build the cluster
@aravindh @juzhao By default, cluster installed by QE Jenkins job did not monitoring WMCO workspace, I fixed it and works now with monitoring enabled. This bug has been verified on OCP 4.8.0-0.nightly-2021-05-06-210840 and passed, thanks. Version-Release number of selected component (if applicable): WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/1ca41c250ff937d1543559ba19e805a7473d45bf OCP version 4.8.0-0.nightly-2021-05-06-210840 Steps: 1. Install WMCO operator on OCP 4.8, make sure WMCO namespace is monitored by selecting checkbox "Enable Operator recommended cluster monitoring on this Namespace". 2. Create Windows machineset and scale up Windows nodes 3. Check `oc adm top nodes` should monitor Windows nodes # oc get nodes -owide -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-129-15.us-east-2.compute.internal Ready worker 10m v1.21.0-rc.0.1190+e22a836a8b2659 10.0.129.15 <none> Windows Server 2019 Datacenter 10.0.17763.1879 docker://20.10.0 # oc adm top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-129-15.us-east-2.compute.internal 1086m 72% 1593Mi 23% ip-10-0-130-153.us-east-2.compute.internal 362m 24% 3962Mi 59% ip-10-0-141-42.us-east-2.compute.internal 1063m 30% 8406Mi 57% ip-10-0-171-168.us-east-2.compute.internal 709m 20% 6091Mi 41% ip-10-0-177-52.us-east-2.compute.internal 84m 5% 1373Mi 20% ip-10-0-203-106.us-east-2.compute.internal 464m 30% 4826Mi 72% ip-10-0-219-57.us-east-2.compute.internal 849m 24% 7477Mi 51% # oc get nodemetrics ip-10-0-129-15.us-east-2.compute.internal NAME CPU MEMORY WINDOW ip-10-0-129-15.us-east-2.compute.internal 104m 1560580Ki 1m0s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days