Bug 1920903
Summary: | oc adm top reporting unknown status for Windows node | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ronnie Rasouli <rrasouli> | |
Component: | Monitoring | Assignee: | Sergiusz Urbaniak <surbania> | |
Status: | CLOSED ERRATA | QA Contact: | Ronnie Rasouli <rrasouli> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.7 | CC: | alegrand, anpicker, aos-bugs, aravindh, erooth, jfajersk, juzhao, kakkoyun, lcosic, mankulka, obulatov, pkrupa, sgao, spasquie, vhire | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1952149 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:36:45 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1952149 |
Description
Ronnie Rasouli
2021-01-27 08:51:03 UTC
Raising the priority on this as it break HCA and HPA Merged upstream fix for this issue against kube-prometheus: https://github.com/prometheus-operator/kube-prometheus/pull/1058 The fix will be picked up downstream with the PR open against CMO repo: https://github.com/openshift/cluster-monitoring-operator/pull/1127 Checked with 4.8.0-0.nightly-2021-05-06-003426, oc adm top still reports unknown status for Windows node # oc get no --show-labels | grep windows | awk '{print $1}' ip-10-0-146-241.us-east-2.compute.internal ip-10-0-158-141.us-east-2.compute.internal # oc adm top node W0506 05:43:09.237978 15140 top_node.go:119] Using json format to get metrics. Next release will switch to protocol-buffers, switch early by passing --use-protocol-buffers flag NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-149-206.us-east-2.compute.internal 1014m 28% 6834Mi 46% ip-10-0-159-183.us-east-2.compute.internal 196m 5% 1434Mi 9% ip-10-0-167-210.us-east-2.compute.internal 1025m 29% 5362Mi 36% ip-10-0-171-40.us-east-2.compute.internal 924m 26% 7585Mi 52% ip-10-0-213-181.us-east-2.compute.internal 761m 21% 5653Mi 38% ip-10-0-218-55.us-east-2.compute.internal 361m 10% 4082Mi 27% ip-10-0-146-241.us-east-2.compute.internal <unknown> <unknown> <unknown> <unknown> ip-10-0-158-141.us-east-2.compute.internal <unknown> <unknown> <unknown> <unknown> checked the fix is in the payload ******************************************* # docker pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25 ... Digest: sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25 Status: Downloaded newer image for quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e0645d0cc8ef475f9f9e2bda659886164d34e02395c2c39722ae12728f276b25 # docker images REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/openshift-release-dev/ocp-v4.0-art-dev <none> 269449901989 35 hours ago 306 MB # docker inspect 269449901989 | grep "io.openshift.build.commit.url" "io.openshift.build.commit.url": "https://github.com/openshift/images/commit/bcab0f7337420343611546aae2634eaf0d36c33e", "io.openshift.build.commit.url": "https://github.com/openshift/cluster-monitoring-operator/commit/4d6bf3d9ed8187ed13854fce3d75d32a0525b1db", ******************************************* # oc adm top node ip-10-0-146-241.us-east-2.compute.internal --loglevel=10 ... I0506 06:06:17.247607 15757 round_trippers.go:435] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: oc/4.8.0 (linux/amd64) kubernetes/7cae9e8" 'https://api.juzhao-0506.qe.devcluster.openshift.com:6443/apis/metrics.k8s.io/v1beta1/nodes/ip-10-0-146-241.us-east-2.compute.internal' I0506 06:06:17.294720 15757 round_trippers.go:454] GET https://api.juzhao-0506.qe.devcluster.openshift.com:6443/apis/metrics.k8s.io/v1beta1/nodes/ip-10-0-146-241.us-east-2.compute.internal 404 Not Found in 47 milliseconds I0506 06:06:17.294752 15757 round_trippers.go:460] Response Headers: I0506 06:06:17.294767 15757 round_trippers.go:463] Audit-Id: 2858f3a6-72b5-47b5-aa19-d75b24087825 I0506 06:06:17.294772 15757 round_trippers.go:463] Cache-Control: no-cache, private I0506 06:06:17.294776 15757 round_trippers.go:463] Cache-Control: no-cache, private I0506 06:06:17.294780 15757 round_trippers.go:463] Content-Type: application/json I0506 06:06:17.294783 15757 round_trippers.go:463] Date: Thu, 06 May 2021 10:28:27 GMT I0506 06:06:17.294787 15757 round_trippers.go:463] X-Kubernetes-Pf-Flowschema-Uid: e1d427ce-6ee5-4370-8f58-942550853b5d I0506 06:06:17.294791 15757 round_trippers.go:463] X-Kubernetes-Pf-Prioritylevel-Uid: 13dc42b2-5d4b-464d-afbc-5a8e1d88a047 I0506 06:06:17.294795 15757 round_trippers.go:463] Content-Length: 306 I0506 06:06:17.294819 15757 request.go:1123] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"nodemetrics.metrics.k8s.io \"ip-10-0-146-241.us-east-2.compute.internal\" not found","reason":"NotFound","details":{"name":"ip-10-0-146-241.us-east-2.compute.internal","group":"metrics.k8s.io","kind":"nodemetrics"},"code":404} I0506 06:06:17.295335 15757 helpers.go:216] server response object: [{ "metadata": {}, "status": "Failure", "message": "nodemetrics.metrics.k8s.io \"ip-10-0-146-241.us-east-2.compute.internal\" not found", "reason": "NotFound", "details": { "name": "ip-10-0-146-241.us-east-2.compute.internal", "group": "metrics.k8s.io", "kind": "nodemetrics" }, "code": 404 }] F0506 06:06:17.295367 15757 helpers.go:115] Error from server (NotFound): nodemetrics.metrics.k8s.io "ip-10-0-146-241.us-east-2.compute.internal" not found goroutine 1 [running]: ... windows node # oc get nodemetrics.metrics.k8s.io/ip-10-0-146-241.us-east-2.compute.internal Error from server (NotFound): nodemetrics.metrics.k8s.io "ip-10-0-146-241.us-east-2.compute.internal" not found coreos node # oc get nodemetrics.metrics.k8s.io/ip-10-0-149-206.us-east-2.compute.internal NAME CPU MEMORY WINDOW ip-10-0-149-206.us-east-2.compute.internal 752m 7346348Ki 1m0s @juzhao did you use the WMCO version from OperatorHub to test this? If yes, that does not have the necessary fixes on the WMCO side. You need to use the operator built from master. It will be easier for @sgao or @rrasouli to test verify this. I hope one of you can pick this off Junqi's plate. (In reply to Aravindh Puthiyaparambil from comment #6) > @juzhao did you use the WMCO version from OperatorHub to test > this? If yes, that does not have the necessary fixes on the WMCO side. You > need to use the operator built from master. It will be easier for > @sgao or @rrasouli to test verify this. I hope one of > you can pick this off Junqi's plate. I did not use the WMCO version from OperatorHub, we have jenkins job which can add windows nodes to build the cluster @aravindh @juzhao By default, cluster installed by QE Jenkins job did not monitoring WMCO workspace, I fixed it and works now with monitoring enabled. This bug has been verified on OCP 4.8.0-0.nightly-2021-05-06-210840 and passed, thanks. Version-Release number of selected component (if applicable): WMCO built from https://github.com/openshift/windows-machine-config-operator/commit/1ca41c250ff937d1543559ba19e805a7473d45bf OCP version 4.8.0-0.nightly-2021-05-06-210840 Steps: 1. Install WMCO operator on OCP 4.8, make sure WMCO namespace is monitored by selecting checkbox "Enable Operator recommended cluster monitoring on this Namespace". 2. Create Windows machineset and scale up Windows nodes 3. Check `oc adm top nodes` should monitor Windows nodes # oc get nodes -owide -l kubernetes.io/os=windows NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-129-15.us-east-2.compute.internal Ready worker 10m v1.21.0-rc.0.1190+e22a836a8b2659 10.0.129.15 <none> Windows Server 2019 Datacenter 10.0.17763.1879 docker://20.10.0 # oc adm top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-0-129-15.us-east-2.compute.internal 1086m 72% 1593Mi 23% ip-10-0-130-153.us-east-2.compute.internal 362m 24% 3962Mi 59% ip-10-0-141-42.us-east-2.compute.internal 1063m 30% 8406Mi 57% ip-10-0-171-168.us-east-2.compute.internal 709m 20% 6091Mi 41% ip-10-0-177-52.us-east-2.compute.internal 84m 5% 1373Mi 20% ip-10-0-203-106.us-east-2.compute.internal 464m 30% 4826Mi 72% ip-10-0-219-57.us-east-2.compute.internal 849m 24% 7477Mi 51% # oc get nodemetrics ip-10-0-129-15.us-east-2.compute.internal NAME CPU MEMORY WINDOW ip-10-0-129-15.us-east-2.compute.internal 104m 1560580Ki 1m0s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |