Bug 1579200

Summary: [Pod_public_851] Descheduler pod should be a critical pod to make itself not be evicted
Product: OpenShift Container Platform Reporter: weiwei jiang <wjiang>
Component: InstallerAssignee: Vadim Rutkovsky <vrutkovs>
Status: CLOSED ERRATA QA Contact: weiwei jiang <wjiang>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.10.0CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:15:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description weiwei jiang 2018-05-17 07:14:25 UTC
Description of problem:
After setup with descheduler enabled and descheduler cronjob will evicted its pod since it is not in the blacklist

Version-Release number of selected component (if applicable):
# oc version
oc v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-14-232.ec2.internal:8443
openshift v3.10.0-0.47.0
kubernetes v1.10.0+b81c8f8


How reproducible:
Always

Steps to Reproduce:
1. Setup openshift with following parameters:
openshift_descheduler_install: true
  openshift_descheduler_image_prefix: xxxx:443/openshift3/ose-
  openshift_descheduler_image_version: v3.10
  openshift_descheduler_strategies_dict: "{'remove_duplicates': True, 'remove_pods_violating_inter_pod_anti_affinity': True, 'low_node_utilization': True}"
  openshift_descheduler_cronjob_node_selector: ""
  openshift_descheduler_cronjob_schedule: "*/5 * * * *"
2. Make some pod to make low_node_utilization work
3. Check the descheduler log

Actual results:
# oc logs -f descheduler-cronjob-1526540100-tjkdd
I0517 06:55:10.340245       1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84                                                                                                                                                       
I0517 06:55:10.340368       1 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:84
I0517 06:55:10.440492       1 duplicates.go:50] Processing node: "ip-172-18-4-140.ec2.internal"
I0517 06:55:10.496043       1 duplicates.go:54] "ReplicationController/hello-1"
I0517 06:55:10.496074       1 duplicates.go:65] Evicted pod: "hello-1-4q9zs" (<nil>)
I0517 06:55:10.496080       1 duplicates.go:65] Evicted pod: "hello-1-54xhw" (<nil>)
I0517 06:55:10.496085       1 duplicates.go:65] Evicted pod: "hello-1-5xv2m" (<nil>)
I0517 06:55:10.496089       1 duplicates.go:65] Evicted pod: "hello-1-8f85b" (<nil>)
I0517 06:55:10.496094       1 duplicates.go:65] Evicted pod: "hello-1-94bdz" (<nil>)
I0517 06:55:10.496098       1 duplicates.go:65] Evicted pod: "hello-1-bl7k7" (<nil>)
I0517 06:55:10.496104       1 duplicates.go:65] Evicted pod: "hello-1-n5htm" (<nil>)
I0517 06:55:10.496108       1 duplicates.go:65] Evicted pod: "hello-1-rr28r" (<nil>)
I0517 06:55:10.496113       1 duplicates.go:65] Evicted pod: "hello-1-vkvfw" (<nil>)
I0517 06:55:10.496117       1 duplicates.go:65] Evicted pod: "hello-1-vvvt9" (<nil>)
I0517 06:55:10.496121       1 duplicates.go:65] Evicted pod: "hello-1-x5rk8" (<nil>)
I0517 06:55:10.496126       1 duplicates.go:65] Evicted pod: "hello-1-xq4kj" (<nil>)
I0517 06:55:10.496130       1 duplicates.go:50] Processing node: "ip-172-18-14-232.ec2.internal"
I0517 06:55:10.504325       1 duplicates.go:54] "ReplicationController/hello-1"
I0517 06:55:10.504353       1 duplicates.go:65] Evicted pod: "hello-1-hswrv" (<nil>)
I0517 06:55:10.504360       1 duplicates.go:65] Evicted pod: "hello-1-jtw9g" (<nil>)
I0517 06:55:10.504365       1 duplicates.go:65] Evicted pod: "hello-1-kbznn" (<nil>)
I0517 06:55:10.504369       1 duplicates.go:65] Evicted pod: "hello-1-kn5m2" (<nil>)
I0517 06:55:10.504374       1 duplicates.go:65] Evicted pod: "hello-1-m67q9" (<nil>)
I0517 06:55:10.504378       1 duplicates.go:65] Evicted pod: "hello-1-sn5r8" (<nil>)
I0517 06:55:10.504383       1 duplicates.go:65] Evicted pod: "hello-1-zsl2q" (<nil>)
I0517 06:55:10.504388       1 duplicates.go:50] Processing node: "ip-172-18-8-195.ec2.internal"
I0517 06:55:10.512867       1 duplicates.go:54] "ReplicationController/hello-1"
I0517 06:55:10.512897       1 duplicates.go:65] Evicted pod: "hello-1-9pxrm" (<nil>)
I0517 06:55:10.512904       1 duplicates.go:65] Evicted pod: "hello-1-b8qtn" (<nil>)
I0517 06:55:10.512908       1 duplicates.go:65] Evicted pod: "hello-1-jqdbk" (<nil>)
I0517 06:55:10.512913       1 duplicates.go:65] Evicted pod: "hello-1-l5wsj" (<nil>)
I0517 06:55:10.512917       1 duplicates.go:65] Evicted pod: "hello-1-mq87n" (<nil>)
I0517 06:55:10.512922       1 duplicates.go:65] Evicted pod: "hello-1-s2qkz" (<nil>)
I0517 06:55:10.512926       1 duplicates.go:65] Evicted pod: "hello-1-sptzm" (<nil>)
I0517 06:55:10.512931       1 duplicates.go:65] Evicted pod: "hello-1-vfhfq" (<nil>)
I0517 06:55:10.542384       1 lownodeutilization.go:144] Node "ip-172-18-4-140.ec2.internal" is over utilized with usage: api.ResourceThresholds{"cpu":40, "memory":25.873227362806837, "pods":7.6}                                                                                                                         
I0517 06:55:10.542436       1 lownodeutilization.go:149] allPods:19, nonRemovablePods:3, bePods:2, bPods:14, gPods:0
I0517 06:55:10.542523       1 lownodeutilization.go:141] Node "ip-172-18-14-232.ec2.internal" is under utilized with usage: api.ResourceThresholds{"cpu":27.5, "memory":16.775785028577992, "pods":7.6}                                                                                                                     
I0517 06:55:10.542543       1 lownodeutilization.go:149] allPods:19, nonRemovablePods:9, bePods:1, bPods:9, gPods:0
I0517 06:55:10.542625       1 lownodeutilization.go:144] Node "ip-172-18-8-195.ec2.internal" is over utilized with usage: api.ResourceThresholds{"cpu":32.5, "memory":27.495062326909224, "pods":6.4}
I0517 06:55:10.542643       1 lownodeutilization.go:149] allPods:16, nonRemovablePods:5, bePods:0, bPods:11, gPods:0
I0517 06:55:10.542649       1 lownodeutilization.go:65] Criteria for a node under utilization: CPU: 28, Mem: 20, Pods: 8
I0517 06:55:10.542707       1 lownodeutilization.go:72] Total number of underutilized nodes: 1
I0517 06:55:10.542712       1 lownodeutilization.go:89] Criteria for a node above target utilization: CPU: 30, Mem: 25, Pods: 7
I0517 06:55:10.542717       1 lownodeutilization.go:91] Total number of nodes above target utilization: 2
I0517 06:55:10.542731       1 lownodeutilization.go:183] Total capacity to be moved: CPU:100, Mem:1.3612236799999998e+09, Pods:-1.4999999999999991
I0517 06:55:10.542745       1 lownodeutilization.go:184] ********Number of pods evicted from each node:***********
I0517 06:55:10.542751       1 lownodeutilization.go:191] evicting pods from node "ip-172-18-4-140.ec2.internal" with usage: api.ResourceThresholds{"cpu":40, "memory":25.873227362806837, "pods":7.6}
I0517 06:55:10.542763       1 lownodeutilization.go:230] Evicted pod: "asb-1-42nq7" (<nil>)
I0517 06:55:10.542769       1 lownodeutilization.go:244] updated node usage: api.ResourceThresholds{"cpu":40, "memory":25.873227362806837, "pods":7.199999999999999}
I0517 06:55:10.542778       1 lownodeutilization.go:230] Evicted pod: "descheduler-cronjob-1526540100-tjkdd" (<nil>)
I0517 06:55:10.542784       1 lownodeutilization.go:244] updated node usage: api.ResourceThresholds{"cpu":40, "memory":25.873227362806837, "pods":6.799999999999999}
I0517 06:55:10.542792       1 lownodeutilization.go:230] Evicted pod: "hello-1-2gbpx" (<nil>)
I0517 06:55:10.542798       1 lownodeutilization.go:244] updated node usage: api.ResourceThresholds{"cpu":37.5, "memory":24.251399202944313, "pods":6.399999999999999}
I0517 06:55:10.542806       1 lownodeutilization.go:230] Evicted pod: "hello-1-4q9zs" (<nil>)
I0517 06:55:10.542811       1 lownodeutilization.go:244] updated node usage: api.ResourceThresholds{"memory":22.62957104308179, "pods":5.999999999999998, "cpu":35}
I0517 06:55:10.542818       1 lownodeutilization.go:230] Evicted pod: "hello-1-54xhw" (<nil>)
I0517 06:55:10.542824       1 lownodeutilization.go:244] updated node usage: api.ResourceThresholds{"cpu":32.5, "memory":21.007742883219265, "pods":5.599999999999998}
I0517 06:55:10.542832       1 lownodeutilization.go:230] Evicted pod: "hello-1-5xv2m" (<nil>)
I0517 06:55:10.542838       1 lownodeutilization.go:244] updated node usage: api.ResourceThresholds{"cpu":30, "memory":19.38591472335674, "pods":5.1999999999999975}
I0517 06:55:10.542847       1 lownodeutilization.go:202] 18 pods evicted from node "ip-172-18-4-140.ec2.internal" with usage map[memory:19.38591472335674 pods:5.1999999999999975 cpu:30]
I0517 06:55:10.542869       1 lownodeutilization.go:191] evicting pods from node "ip-172-18-8-195.ec2.internal" with usage: api.ResourceThresholds{"cpu":32.5, "memory":27.495062326909224, "pods":6.4}
I0517 06:55:10.542880       1 lownodeutilization.go:230] Evicted pod: "hello-1-4rmtt" (<nil>)
I0517 06:55:10.542885       1 lownodeutilization.go:244] updated node usage: api.ResourceThresholds{"cpu":30, "memory":25.873233765690614, "pods":6}
I0517 06:55:10.542893       1 lownodeutilization.go:230] Evicted pod: "hello-1-9pxrm" (<nil>)
I0517 06:55:10.542898       1 lownodeutilization.go:244] updated node usage: api.ResourceThresholds{"cpu":27.5, "memory":24.251405204472004, "pods":5.6}
I0517 06:55:10.542907       1 lownodeutilization.go:202] 10 pods evicted from node "ip-172-18-8-195.ec2.internal" with usage map[cpu:27.5 memory:24.251405204472004 pods:5.6]
I0517 06:55:10.542916       1 lownodeutilization.go:94] Total number of pods evicted: 28
I0517 06:55:10.542923       1 pod_antiaffinity.go:45] Processing node: "ip-172-18-4-140.ec2.internal"
I0517 06:55:10.550841       1 pod_antiaffinity.go:45] Processing node: "ip-172-18-14-232.ec2.internal"
I0517 06:55:10.558162       1 pod_antiaffinity.go:45] Processing node: "ip-172-18-8-195.ec2.internal"

Expected results:
Descheduler pod should not be descheduled

Additional info:


Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Vadim Rutkovsky 2018-05-21 08:22:13 UTC
Fix is available in openshift-ansible-3.10.0-0.48.0

Comment 3 weiwei jiang 2018-05-23 10:01:43 UTC
Checked with openshift-ansible-3.10.0-0.50.0 and the descheduler pod now have the annotation scheduler.alpha.kubernetes.io/critical-pod.

Comment 5 errata-xmlrpc 2018-07-30 19:15:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816