1898134 – Descheduler logs show absolute values instead of percentage when LowNodeUtilization strategy is applied

Bug 1898134 - Descheduler logs show absolute values instead of percentage when LowNodeUtilization strategy is applied

Summary: Descheduler logs show absolute values instead of percentage when LowNodeUtili...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-16 13:41 UTC by RamaKasturi
Modified:	2021-02-24 15:33 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:33:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes-sigs descheduler pull 441	None	closed	LowNodeUtilization: show node usage in percentage as well	2021-01-12 05:13:45 UTC
Github	openshift descheduler pull 46	None	closed	bug 1898134: Sync with upstream	2021-01-12 05:13:45 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:33:55 UTC

Description RamaKasturi 2020-11-16 13:41:25 UTC

Description of problem:
When LowNodeUtilization strategy is applied descheduler logs show absolute values instead of precentages.
I1116 13:11:32.733207       1 node.go:46] "Node lister returned empty list, now fetch directly"
I1116 13:11:33.628776       1 lownodeutilization.go:257] "Node is appropriately utilized" node="ip-10-0-131-111.us-east-2.compute.internal" usage=map[cpu:2015m memory:6379Mi pods:47]
I1116 13:11:33.628811       1 lownodeutilization.go:254] "Node is overutilized" node="ip-10-0-138-187.us-east-2.compute.internal" usage=map[cpu:1418m memory:5745Mi pods:26]
I1116 13:11:33.628829       1 lownodeutilization.go:251] "Node is underutilized" node="ip-10-0-178-31.us-east-2.compute.internal" usage=map[cpu:625m memory:2645Mi pods:15]
I1116 13:11:33.628845       1 lownodeutilization.go:257] "Node is appropriately utilized" node="ip-10-0-188-98.us-east-2.compute.internal" usage=map[cpu:2040m memory:6269Mi pods:39]
I1116 13:11:33.628858       1 lownodeutilization.go:251] "Node is underutilized" node="ip-10-0-195-214.us-east-2.compute.internal" usage=map[cpu:1714m memory:4879Mi pods:21]
I1116 13:11:33.628872       1 lownodeutilization.go:254] "Node is overutilized" node="ip-10-0-201-202.us-east-2.compute.internal" usage=map[cpu:1368m memory:5798Mi pods:31]
I1116 13:11:33.628889       1 lownodeutilization.go:116] "Criteria for a node under utilization" CPU=50 Mem=50 Pods=20
I1116 13:11:33.628902       1 lownodeutilization.go:123] "Total number of underutilized nodes" totalNumber=2
I1116 13:11:33.628913       1 lownodeutilization.go:140] "Criteria for a node above target utilization" CPU=75 Mem=75 Pods=50
I1116 13:11:33.628922       1 lownodeutilization.go:143] "Number of nodes above target utilization" totalNumber=2
I1116 13:11:33.628941       1 lownodeutilization.go:293] "Total capacity to be moved" CPU=1411 Mem=8821616640 Pods=214
I1116 13:11:33.629101       1 lownodeutilization.go:304] "AllPods" %!s(int=31)="nonRemovablePods" %!s(int=18)="removablePods" %!s(int=13)="(MISSING)"
I1116 13:11:33.629113       1 lownodeutilization.go:311] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I1116 13:11:33.658387       1 evictions.go:117] "Evicted pod" pod="openshift-kube-descheduler-operator/descheduler-operator-5749b85c57-8ws7r" reason=" (LowNodeUtilization)"
I1116 13:11:33.683626       1 evictions.go:117] "Evicted pod" pod="knarra/hello1-2-558xq" reason=" (LowNodeUtilization)"
I1116 13:11:33.745456       1 evictions.go:117] "Evicted pod" pod="knarra/hello1-2-f6vwc" reason=" (LowNodeUtilization)"
I1116 13:11:33.820340       1 evictions.go:117] "Evicted pod" pod="knarra/hello1-2-jfvgl" reason=" (LowNodeUtilization)"
I1116 13:11:33.853316       1 evictions.go:117] "Evicted pod" pod="knarra/hello1-2-kr26k" reason=" (LowNodeUtilization)"
I1116 13:11:33.853410       1 lownodeutilization.go:315] "Evicted pods from node" node="ip-10-0-201-202.us-east-2.compute.internal" evictedPods=5 usage=map[cpu:968m memory:4774Mi pods:26]
I1116 13:11:33.853544       1 lownodeutilization.go:304] "AllPods" %!s(int=26)="nonRemovablePods" %!s(int=18)="removablePods" %!s(int=8)="(MISSING)"
I1116 13:11:33.853559       1 lownodeutilization.go:311] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I1116 13:11:33.900100       1 evictions.go:117] "Evicted pod" pod="knarra/hello1-2-4jf46" reason=" (LowNodeUtilization)"
I1116 13:11:33.955342       1 evictions.go:117] "Evicted pod" pod="knarra/hello1-2-hhfrt" reason=" (LowNodeUtilization)"
I1116 13:11:33.984306       1 evictions.go:117] "Evicted pod" pod="knarra/hello1-2-ksg9m" reason=" (LowNodeUtilization)"
I1116 13:11:34.025505       1 evictions.go:117] "Evicted pod" pod="knarra/hello1-2-lw9mp" reason=" (LowNodeUtilization)"
I1116 13:11:34.025634       1 lownodeutilization.go:315] "Evicted pods from node" node="ip-10-0-138-187.us-east-2.compute.internal" evictedPods=4 usage=map[cpu:1018m memory:4721Mi pods:22]
I1116 13:11:34.025650       1 lownodeutilization.go:153] "Total number of pods evicted" evictedPods=9


Version-Release number of selected component (if applicable):
[knarra@knarra openshift-client-linux-4.7.0-0.nightly-2020-11-10-023606]$ oc get csv -n openshift-kube-descheduler-operator
NAME                                                   DISPLAY                     VERSION                 REPLACES   PHASE
clusterkubedescheduleroperator.4.7.0-202011121509.p0   Kube Descheduler Operator   4.7.0-202011121509.p0              Succeeded
[knarra@knarra openshift-client-linux-4.7.0-0.nightly-2020-11-10-023606]$ ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-12-200927   True        False         7h19m   Cluster version is 4.7.0-0.nightly-2020-11-12-200927


How reproducible:
Always

Steps to Reproduce:
1. Install latest 4.7 cluster
2. create policy.cfg as below
[knarra@knarra openshift-client-linux-4.7.0-0.nightly-2020-11-10-023606]$ cat policy.cfg 
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           "cpu" : 20
           "memory": 20
           "pods": 20
         targetThresholds:
           "cpu" : 50
           "memory": 50
           "pods": 50

3.create configmap by running the command below
oc create configmap --from-file=policy.cfg descheduler-policy -n openshift-kube-descheduler-operator
4. specify descheduler-policy in kubedescheduler cluster object
4. Run command oc logs -f <cluster>

Actual results:
Below results are shown in the logs
I1116 13:13:34.742237       1 node.go:46] "Node lister returned empty list, now fetch directly"
I1116 13:13:35.328287       1 lownodeutilization.go:257] "Node is appropriately utilized" node="ip-10-0-131-111.us-east-2.compute.internal" usage=map[cpu:2015m memory:6379Mi pods:47]
I1116 13:13:35.328319       1 lownodeutilization.go:257] "Node is appropriately utilized" node="ip-10-0-138-187.us-east-2.compute.internal" usage=map[cpu:1018m memory:4721Mi pods:22]
I1116 13:13:35.328334       1 lownodeutilization.go:254] "Node is overutilized" node="ip-10-0-178-31.us-east-2.compute.internal" usage=map[cpu:1205m memory:3837Mi pods:20]
I1116 13:13:35.328349       1 lownodeutilization.go:257] "Node is appropriately utilized" node="ip-10-0-188-98.us-east-2.compute.internal" usage=map[cpu:2040m memory:6269Mi pods:39]
I1116 13:13:35.328362       1 lownodeutilization.go:251] "Node is underutilized" node="ip-10-0-195-214.us-east-2.compute.internal" usage=map[cpu:1714m memory:4879Mi pods:21]
I1116 13:13:35.328374       1 lownodeutilization.go:254] "Node is overutilized" node="ip-10-0-201-202.us-east-2.compute.internal" usage=map[cpu:1068m memory:5030Mi pods:27]
I1116 13:13:35.328391       1 lownodeutilization.go:116] "Criteria for a node under utilization" CPU=50 Mem=50 Pods=20
I1116 13:13:35.328403       1 lownodeutilization.go:123] "Total number of underutilized nodes" totalNumber=1
I1116 13:13:35.328417       1 lownodeutilization.go:140] "Criteria for a node above target utilization" CPU=75 Mem=75 Pods=50
I1116 13:13:35.328428       1 lownodeutilization.go:143] "Number of nodes above target utilization" totalNumber=2
I1116 13:13:35.328448       1 lownodeutilization.go:293] "Total capacity to be moved" CPU=911 Mem=6377252864 Pods=104
I1116 13:13:35.328596       1 lownodeutilization.go:304] "AllPods" %!s(int=27)="nonRemovablePods" %!s(int=18)="removablePods" %!s(int=9)="(MISSING)"
I1116 13:13:35.328609       1 lownodeutilization.go:311] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I1116 13:13:35.435582       1 evictions.go:117] "Evicted pod" pod="openshift-console/downloads-76b785c674-wqzg8" reason=" (LowNodeUtilization)"
I1116 13:13:35.435729       1 lownodeutilization.go:315] "Evicted pods from node" node="ip-10-0-201-202.us-east-2.compute.internal" evictedPods=1 usage=map[cpu:1058m memory:4980Mi pods:26]
I1116 13:13:35.435870       1 lownodeutilization.go:304] "AllPods" %!s(int=20)="nonRemovablePods" %!s(int=12)="removablePods" %!s(int=8)="(MISSING)"
I1116 13:13:35.435882       1 lownodeutilization.go:311] "Evicting pods based on priority, if they have same priority, they'll be evicted based on QoS tiers"
I1116 13:13:35.435937       1 lownodeutilization.go:315] "Evicted pods from node" node="ip-10-0-178-31.us-east-2.compute.internal" evictedPods=0 usage=map[cpu:1205m memory:3837Mi pods:20]
I1116 13:13:35.435947       1 lownodeutilization.go:153] "Total number of pods evicted" evictedPods=1


Expected results:
Descheduler logs should show percentage instead of absolute values when LowNodeUtilization Strategy is applied

Additional info:
previous version shows percentage instead of absolute values.

I0205 13:42:01.102467       1 lownodeutilization.go:145] Node "ip-10-0-162-253.us-east-2.compute.internal" is over utilized with usage: api.ResourceThresholds{"cpu":88.8, "memory":43.1717846880164, "pods":8.8}

Comment 2 RamaKasturi 2020-11-24 13:55:19 UTC

Tried verifying the bug here and i see that with the current values shown in the cluster pod log we cannot map back to what is present in the params. So moving back.

For example, if cpu  says overutilized with 0.5 in the cluster logs, how do we need to map back to a param with cpu targetthreshold being 50 ? We would need to get back to the old way where log says "overutilized with 54 when cpu targetthreshold is set to 50"

Based on the above moving the bug back to assigned state.

Comment 3 Jan Chaloupka 2021-01-08 13:56:46 UTC

The latest master head in sigs.k8s.io/descheduler:
```
$ go test -v sigs.k8s.io/descheduler/pkg/descheduler/strategies -run TestLowNodeUtilization/without_priorities$
=== RUN   TestLowNodeUtilization
=== RUN   TestLowNodeUtilization/without_priorities
I0108 14:54:36.868024 1823919 lownodeutilization.go:271] "Node is overutilized" node="n1" usage=map[cpu:3200m memory:0 pods:8] usagePercentage=map[cpu:100 memory:0 pods:88.88888888888889]
I0108 14:54:36.879223 1823919 lownodeutilization.go:268] "Node is underutilized" node="n2" usage=map[cpu:0 memory:0 pods:0] usagePercentage=map[cpu:0 memory:0 pods:0]
I0108 14:54:36.879299 1823919 lownodeutilization.go:274] "Node is appropriately utilized" node="n3" usage=map[cpu:0 memory:0 pods:0] usagePercentage=map[cpu:0 memory:0 pods:0]
--- PASS: TestLowNodeUtilization (0.01s)
    --- PASS: TestLowNodeUtilization/without_priorities (0.01s)
PASS
ok  	sigs.k8s.io/descheduler/pkg/descheduler/strategies	0.040s
```

Rama, can you share your output for comparison and point out which number(s) is/are incorrectly printed?

Comment 4 RamaKasturi 2021-01-12 13:47:35 UTC

@jan the bug was failed_qa on Nov24th and now it looks fine not sure if rebase fixed the issue or something else, we could move the bug back to assigned state. Below is what i see in the logs now instead of percentages.

I0112 13:40:38.978218       1 lownodeutilization.go:271] "Node is overutilized" node="ip-10-0-130-98.us-east-2.compute.internal" usage=map[cpu:2019m memory:6600Mi pods:42] usagePercentage=map[cpu:75 memory:45.19219118736223 pods:16.8]
I0112 13:40:38.978262       1 lownodeutilization.go:271] "Node is overutilized" node="ip-10-0-158-14.us-east-2.compute.internal" usage=map[cpu:854m memory:4279Mi pods:23] usagePercentage=map[cpu:50 memory:64.59145942012834 pods:9.2]
I0112 13:40:38.978284       1 lownodeutilization.go:271] "Node is overutilized" node="ip-10-0-165-204.us-east-2.compute.internal" usage=map[cpu:2086m memory:6760Mi pods:48] usagePercentage=map[cpu:75 memory:46.28775945857102 pods:19.2]
I0112 13:40:38.978305       1 lownodeutilization.go:271] "Node is overutilized" node="ip-10-0-174-151.us-east-2.compute.internal" usage=map[cpu:754m memory:4494Mi pods:23] usagePercentage=map[cpu:50 memory:67.83672213079424 pods:9.2]
I0112 13:40:38.978328       1 lownodeutilization.go:274] "Node is appropriately utilized" node="ip-10-0-206-249.us-east-2.compute.internal" usage=map[cpu:1729m memory:5100Mi pods:22] usagePercentage=map[cpu:50 memory:34.92123864477991 pods:8.8]
I0112 13:40:38.978365       1 lownodeutilization.go:274] "Node is appropriately utilized" node="ip-10-0-208-166.us-east-2.compute.internal" usage=map[cpu:555m memory:2386Mi pods:18] usagePercentage=map[cpu:50 memory:36.01664458434826 pods:7.2]
I0112 13:40:38.978385       1 lownodeutilization.go:116] "Criteria for a node under utilization" CPU=20 Mem=20 Pods=20
I0112 13:40:38.978396       1 lownodeutilization.go:120] "No node is underutilized, nothing to do here, you might tune your thresholds further"

When the bug was moved back, below is what i saw in the logs, so moved it back, hope that helps:
================================================================================================
I1124 13:24:59.039858       1 lownodeutilization.go:274] "Node is appropriately utilized" node="ip-10-0-137-48.us-east-2.compute.internal" usage=map[cpu:534m memory:2140Mi pods:15] usagePercentage=map[cpu:0.5 memory:0.32667508637355114 pods:0.06]
I1124 13:24:59.039893       1 lownodeutilization.go:274] "Node is appropriately utilized" node="ip-10-0-149-201.us-east-2.compute.internal" usage=map[cpu:1714m memory:4879Mi pods:21] usagePercentage=map[cpu:0.5 memory:0.33773087071240104 pods:0.084]
I1124 13:24:59.039912       1 lownodeutilization.go:271] "Node is overutilized" node="ip-10-0-180-241.us-east-2.compute.internal" usage=map[cpu:2014m memory:6409Mi pods:44] usagePercentage=map[cpu:0.75 memory:0.4385396480892954 pods:0.176]
I1124 13:24:59.039930       1 lownodeutilization.go:271] "Node is overutilized" node="ip-10-0-181-129.us-east-2.compute.internal" usage=map[cpu:750m memory:3853Mi pods:22] usagePercentage=map[cpu:0.5 memory:0.5807213565676511 pods:0.088]
I1124 13:24:59.039948       1 lownodeutilization.go:271] "Node is overutilized" node="ip-10-0-198-240.us-east-2.compute.internal" usage=map[cpu:837m memory:4583Mi pods:23] usagePercentage=map[cpu:0.5 memory:0.6907480521254411 pods:0.092]
I1124 13:24:59.039970       1 lownodeutilization.go:271] "Node is overutilized" node="ip-10-0-218-137.us-east-2.compute.internal" usage=map[cpu:2051m memory:6289Mi pods:43] usagePercentage=map[cpu:0.75 memory:0.4353324756387148 pods:0.172]
I1124 13:24:59.039987       1 lownodeutilization.go:116] "Criteria for a node under utilization" CPU=20 Mem=20 Pods=20
I1124 13:24:59.039995       1 lownodeutilization.go:120] "No node is underutilized, nothing to do here, you might tune your thresholds further"

Comment 6 RamaKasturi 2021-01-13 13:34:38 UTC

As per comment4, moving the bug to verified state. Below are the csv and payload used to verify the bug.

[knarra@knarra openshift-client-linux-4.7.0-0.nightly-2021-01-10-070949]$ ./oc get csv -n openshift-kube-descheduler-operator
NAME                                                   DISPLAY                     VERSION                 REPLACES   PHASE
clusterkubedescheduleroperator.4.7.0-202101090212.p0   Kube Descheduler Operator   4.7.0-202101090212.p0              Succeeded

[knarra@knarra openshift-client-linux-4.7.0-0.nightly-2021-01-10-070949]$ ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-10-070949   True        False         2d4h    Cluster version is 4.7.0-0.nightly-2021-01-10-070949

Below logs are seen in descheduler logs:
=======================================
I0113 13:05:00.160538       1 lownodeutilization.go:274] "Node is appropriately utilized" node="dyan47-gfkth-master-0" usage=map[cpu:2038m memory:6453Mi pods:45] usagePercentage=map[cpu:37.5 memory:20.797755410385307 pods:18]
I0113 13:05:00.160589       1 lownodeutilization.go:274] "Node is appropriately utilized" node="dyan47-gfkth-master-1" usage=map[cpu:1719m memory:5000Mi pods:22] usagePercentage=map[cpu:25 memory:16.11479576195979 pods:8.8]
I0113 13:05:00.160608       1 lownodeutilization.go:274] "Node is appropriately utilized" node="dyan47-gfkth-master-2" usage=map[cpu:1989m memory:6474Mi pods:45] usagePercentage=map[cpu:25 memory:20.865437552585536 pods:18]
I0113 13:05:00.160625       1 lownodeutilization.go:274] "Node is appropriately utilized" node="dyan47-gfkth-worker-centralus1-78sdd" usage=map[cpu:809m memory:4199Mi pods:32] usagePercentage=map[cpu:25 memory:28.18073522256338 pods:12.8]
I0113 13:05:00.160643       1 lownodeutilization.go:274] "Node is appropriately utilized" node="dyan47-gfkth-worker-centralus2-4jsvl" usage=map[cpu:795m memory:5538Mi pods:36] usagePercentage=map[cpu:25 memory:37.167161624804955 pods:14.4]
I0113 13:05:00.160660       1 lownodeutilization.go:268] "Node is underutilized" node="dyan47-gfkth-worker-centralus3-lb8mj" usage=map[cpu:535m memory:2286Mi pods:26] usagePercentage=map[cpu:25 memory:15.3420244626768 pods:10.4]

Comment 9 errata-xmlrpc 2021-02-24 15:33:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.