Bug 1658582

Summary: When strategy changes, descheduler-operator can not update configmap in time
Product: OpenShift Container Platform Reporter: MinLi <minmli>
Component: NodeAssignee: ravig <rgudimet>
Status: CLOSED ERRATA QA Contact: Xiaoli Tian <xtian>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, jokerman, minmli, mmccomas, rgudimet
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:41:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description MinLi 2018-12-12 13:04:36 UTC
Description of problem:
When strategy changes,  descheduler-operator can not update configmap, indeed it delete configmap directly. 
And 8 minutes later,the configmap generate again.

Version-Release number of selected component (if applicable):
oc v4.0.0-0.95.0
kubernetes v1.11.0+8afe8f3cf9
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-7-155.ec2.internal:8443
openshift v4.0.0-0.95.0
kubernetes v1.11.0+8afe8f3cf9

How reproducible:
always

Steps to Reproduce:
1.download code from github : https://github.com/openshift/descheduler-operator 
2.deploy descheduler-operator, step as follows:
#oc create -f deploy/namespace.yaml
#oc project openshift-descheduler-operator
#oc create -f deploy/crds/descheduler_v1alpha1_descheduler_crd.yaml
#oc create -f deploy/service_account.yaml
#oc create -f deploy/rbac.yaml
#oc create -f deploy/operator.yaml
#oc create -f deploy/crds/descheduler_v1alpha1_descheduler_cr.yaml 
and descheduler_v1alpha1_descheduler_cr.yaml is like:
apiVersion: descheduler.io/v1alpha1
kind: Descheduler
metadata:
  name: example-descheduler-1
spec:
  schedule: "*/1 * * * ?"
  strategies:
    - name: "lownodeutilization"
      params:
       - name: "cputhreshold"
         value: "10"
       - name: "memorythreshold"
         value: "20"
       - name: "memorytargetthreshold"
         value: "40"
    - name: "duplicates"


3.check configmap 
#oc describe cm example-descheduler-1
Name:         example-descheduler-1
Namespace:    openshift-descheduler-operator
Labels:       <none>
Annotations:  <none>

Data
====
policy.yaml:
----
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           cpu: 10
           memory: 20
         targetThresholds:
           memory: 40
         numberOfNodes: 0
  "RemoveDuplicates":
     enabled: true
Events:  <none>

4.modify strategy of deschedulers
# oc edit deschedulers.descheduler.io example-descheduler-1
change is like:
apiVersion: descheduler.io/v1alpha1
kind: Descheduler
metadata:
  name: example-descheduler-1
spec:
  schedule: "*/1 * * * ?"
  strategies:
    - name: "lownodeutilization"
      params:
       - name: "cputhreshold"
         value: "10"
       - name: "memorythreshold"
         value: "20"
       - name: "memorytargetthreshold"
         value: "40"   (change 40 to 30)
    - name: "duplicates"  (delete this line)

5.check configmap whether updated or not

Actual results:
5.no cm found.
#oc get cm 
No resources found.

Expected results:
5.cm update correctly

Additional info:
descheduler-operators logs show:
# oc logs descheduler-operator-965cb8f7f-kdkk8
...
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           cpu: 10
           memory: 20
         targetThresholds:
           memory: 30
         numberOfNodes: 0
2018/12/12 12:23:26 Strategy mismatch in configmap. Delete it
2018/12/12 12:23:26 Inside generated descheduler job

And 8 minutes later, the up to date cm generate again. I think the cm should update immediately when strategy changed.

Comment 1 ravig 2018-12-13 02:57:55 UTC
By default, we are not making aggressive reconcile loops. I can make it frequent but I believe, this becomes too much aggressive.

Comment 2 MinLi 2018-12-17 07:58:28 UTC
the deletion of configmap also block the start of descheduler job pod. It's more serious than becoming too much aggressive. @ravig

[root@ip-172-18-12-194 ~]# oc get pod 
NAME                                     READY     STATUS              RESTARTS   AGE
descheduler-operator-965cb8f7f-jb49v     1/1       Running             0          31m
example-descheduler-1-1545032400-5546r   0/1       ContainerCreating   0          12m
example-descheduler-1-1545032520-zp5qd   0/1       ContainerCreating   0          10m

#oc  describe  pod example-descheduler-1-1545032400-5546r
.....
Events:
  Type     Reason       Age               From                                   Message
  ----     ------       ----              ----                                   -------
  Normal   Scheduled    1m                default-scheduler                      Successfully assigned openshift-descheduler-operator/example-descheduler-1-1545032880-258km to ip-172-18-7-162.ec2.internal
  Warning  FailedMount  21s (x8 over 1m)  kubelet, ip-172-18-7-162.ec2.internal  MountVolume.SetUp failed for volume "policy-volume" : configmaps "example-descheduler-1" not found

Comment 3 ravig 2018-12-21 03:47:10 UTC
https://github.com/openshift/descheduler-operator/pull/37


The above PR should have fixed it.

Comment 4 MinLi 2018-12-26 09:32:03 UTC
verified!

Version info:
oc v4.0.0-alpha.0+85a0623-808
kubernetes v1.11.0+85a0623
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://wsun-qe-api.origin-ci-int-aws.dev.rhcloud.com:6443
kubernetes v1.11.0+85a0623

Comment 5 MinLi 2019-01-04 07:36:14 UTC
this problem reproduce in recent version, but has different phenomenons. Pls @ravig check it.

Version info:
oc v4.0.0-0.123.0
kubernetes v1.11.0+4d56dbaf21
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://ip-172-18-5-72.ec2.internal:8443
openshift v4.0.0-0.123.0
kubernetes v1.11.0+4d56dbaf21


phenomenon: 
when update configmap, it shows update succ immediately. After about 8 minutes, regenerate a new configmap, and it recover the one which before update.
And the log of descheduler-operator pod also show the old policy-strategy.

logs:
2019/01/04 06:57:36 Creating descheduler job
2019/01/04 06:57:36 Validating descheduler flags
2019/01/04 06:57:36 Creating a new cron job openshift-descheduler-operator/example-descheduler-1
=================================================================================================(time after update)
2019/01/04 07:13:19 Reconciling Descheduler openshift-descheduler-operator/example-descheduler-1
2019/01/04 07:13:19 cputhreshold 10
2019/01/04 07:13:19 memorythreshold 20
2019/01/04 07:13:19 memorytargetthreshold 30
2019/01/04 07:13:19 
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "nodeaffinity":
     enabled: true, 
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           cpu: 10
           memory: 20
         targetThresholds:
           memory: 30
         numberOfNodes: 0
2019/01/04 07:13:19 Strategy mismatch in configmap. Delete it
2019/01/04 07:13:20 Validating descheduler flags
2019/01/04 07:13:20 Flags mismatch for descheduler. Delete cronjob
2019/01/04 07:13:20 Reconciling Descheduler openshift-descheduler-operator/example-descheduler-1
2019/01/04 07:13:20 Creating config map
2019/01/04 07:13:20 cputhreshold 10
2019/01/04 07:13:20 memorythreshold 20
2019/01/04 07:13:20 memorytargetthreshold 30
2019/01/04 07:13:20   "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           cpu: 10
           memory: 20
         targetThresholds:
           memory: 30
         numberOfNodes: 0
2019/01/04 07:13:20 Creating a new configmap openshift-descheduler-operator/example-descheduler-1
2019/01/04 07:13:20 Validating descheduler flags
2019/01/04 07:13:20 Flags mismatch for descheduler. Delete cronjob
2019/01/04 07:13:20 Error while deleting cronjob
2019/01/04 07:13:21 Reconciling Descheduler openshift-descheduler-operator/example-descheduler-1
2019/01/04 07:13:21 cputhreshold 10
2019/01/04 07:13:21 memorythreshold 20
2019/01/04 07:13:21 memorytargetthreshold 30
2019/01/04 07:13:21 
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           cpu: 10
           memory: 20
         targetThresholds:
           memory: 30
         numberOfNodes: 0, 
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "LowNodeUtilization":
     enabled: true
     params:
       nodeResourceUtilizationThresholds:
         thresholds:
           cpu: 10
           memory: 20
         targetThresholds:
           memory: 30
         numberOfNodes: 0
2019/01/04 07:13:21 Creating descheduler job
2019/01/04 07:13:21 Validating descheduler flags
2019/01/04 07:13:21 Creating a new cron job openshift-descheduler-operator/example-descheduler-1

Comment 7 MinLi 2019-01-08 07:43:46 UTC
@ravig, I think it's my steps not appropriate. 
I directly update cm , not via CR, so the cm did not update succ indeed. 
Really Sorry for the inconvenience.

May I change the bug status to "verified"?

Comment 8 ravig 2019-01-08 14:52:04 UTC
No problem, please go ahead and modify the status.

Comment 9 MinLi 2019-01-09 09:43:40 UTC
verified!

version info:
oc v4.0.0-0.130.0
kubernetes v1.11.0+f67f40dbad
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-10-163.ec2.internal:8443
openshift v4.0.0-0.130.0
kubernetes v1.11.0+f67f40dbad

Comment 12 errata-xmlrpc 2019-06-04 10:41:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758