Bug 1658588

Summary: descheduler job should print log of evicting pod
Product: OpenShift Container Platform Reporter: MinLi <minmli>
Component: NodeAssignee: ravig <rgudimet>
Status: CLOSED ERRATA QA Contact: Xiaoli Tian <xtian>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:41:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description MinLi 2018-12-12 13:20:01 UTC
Description of problem:
descheduler job should print log of evicting pod, now there is no log when evicting pod.

Version-Release number of selected component (if applicable):
oc v4.0.0-0.95.0
kubernetes v1.11.0+8afe8f3cf9
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-7-155.ec2.internal:8443
openshift v4.0.0-0.95.0
kubernetes v1.11.0+8afe8f3cf9

How reproducible:
always 

Steps to Reproduce:
1.download code from github : https://github.com/openshift/descheduler-operator 
2.deploy descheduler-operator, step as follows:
#oc create -f deploy/namespace.yaml
#oc project openshift-descheduler-operator
#oc create -f deploy/crds/descheduler_v1alpha1_descheduler_crd.yaml
#oc create -f deploy/service_account.yaml
#oc create -f deploy/rbac.yaml
#oc create -f deploy/operator.yaml
#oc create -f deploy/crds/descheduler_v1alpha1_descheduler_cr.yaml 
and descheduler_v1alpha1_descheduler_cr.yaml is like:
apiVersion: descheduler.io/v1alpha1
kind: Descheduler
metadata:
  name: example-descheduler-1
spec:
  schedule: "*/1 * * * ?"
  strategies:
    - name: "duplicates"
3.create  a deploymentconfig and replicontroller
oc run hello --image=openshift/hello-openshift:latest --replicas=6

4.check duplicate pod if been evicted
#oc get pod 
NAME                                     READY     STATUS              RESTARTS   AGE
descheduler-operator-965cb8f7f-kdkk8     1/1       Running             0          27m
example-descheduler-1-1544617080-88jb2   0/1       Completed           0          2m
example-descheduler-1-1544617140-r4cbb   0/1       Completed           0          1m
example-descheduler-1-1544617200-zj9th   0/1       Completed           0          12s
hello-1-2xsc5                            1/1       Running             0          8m
hello-1-6jptk                            0/1       ContainerCreating   0          7s
hello-1-f2qm4                            0/1       Terminating         0          1m
hello-1-gv2cv                            0/1       ContainerCreating   0          7s
hello-1-jvj67                            0/1       ContainerCreating   0          7s
hello-1-jwxrp                            0/1       ContainerCreating   0          7s
hello-1-qrw5j                            0/1       Terminating         0          1m
hello-1-rdphn                            0/1       ContainerCreating   0          7s
hello-1-rlhn2                            0/1       Terminating         0          1m

5.check descheduler job log of evicting duplicate pods.

Actual results:
5.There is no log of evicting pod. (But from result of step 4, we can confirm descheduler evicted duplicate pods well)
# oc logs example-descheduler-1-1544617200-zj9th

Expected results:
5.print log of evicting pods just like 3.11 version:

I0110 05:50:42.087277       1 reflector.go:202] Starting reflector *v1.Node (1h0m0s) from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I0110 05:50:42.087391       1 reflector.go:240] Listing and watching *v1.Node from github.com/kubernetes-incubator/descheduler/pkg/descheduler/node/node.go:83
I0110 05:50:42.187377       1 duplicates.go:49] Processing node: "qe-wjiang-node-registry-router-1"
I0110 05:50:42.256960       1 duplicates.go:53] "ReplicationController/hello-1"
I0110 05:50:42.264316       1 duplicates.go:61] Evicted pod: "hello-1-d6dvf" (<nil>)
I0110 05:50:42.274861       1 duplicates.go:61] Evicted pod: "hello-1-f997x" (<nil>)
I0110 05:50:42.284345       1 duplicates.go:61] Evicted pod: "hello-1-fdtj5" (<nil>)
I0110 05:50:42.297100       1 duplicates.go:61] Evicted pod: "hello-1-lq752" (<nil>)
I0110 05:50:42.336787       1 duplicates.go:61] Evicted pod: "hello-1-rmdk5" (<nil>)

Additional info:

Comment 1 ravig 2018-12-13 02:53:57 UTC
I have couple of questions here

- In 3.11, can you provide descheduler-cronjob description? - Our docs explicitly tell to run descheduler as job https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/descheduler.html (Unrelated question)
- When ran for 3.11, was the log-level increased?(You can find that from description again)

Comment 2 MinLi 2018-12-18 07:11:00 UTC
(In reply to ravig from comment #1)
> I have couple of questions here
> 
> - In 3.11, can you provide descheduler-cronjob description? - Our docs
> explicitly tell to run descheduler as job
> https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/
> descheduler.html (Unrelated question)

Yeah, in 3.11 the cronjob generate job to run descheduler at fixed period. We create cronjob by pass parameters to ansible job as following
openshift_ansible_vars:
  openshift_descheduler_install: true
  openshift_descheduler_strategies_dict: "{'remove_duplicates': True, 'remove_pods_violating_inter_pod_anti_affinity': True, 'low_node_utilization': True}"
  openshift_descheduler_cronjob_node_selector: ""
  openshift_descheduler_cronjob_schedule: "*/2 * * * *"

descheduler-cronjob description:
[root@qe-minmli-gcemaster-etcd-1 ~]# oc get cronjob descheduler-cronjob -o yaml 
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  creationTimestamp: 2018-12-18T02:18:33Z
  name: descheduler-cronjob
  namespace: openshift-descheduler
  resourceVersion: "46315"
  selfLink: /apis/batch/v1beta1/namespaces/openshift-descheduler/cronjobs/descheduler-cronjob
  uid: 380feecd-026b-11e9-84f0-42010af00003
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          annotations:
            scheduler.alpha.kubernetes.io/critical-pod: ""
          creationTimestamp: null
          name: descheduler-cronjob
        spec:
          containers:
          - args:
            - --policy-config-file=/policy-dir/policy.yaml
            - --v=5
            - --dry-run
            command:
            - /bin/descheduler
            image: registry.reg-aws.openshift.com:443/openshift3/ose-descheduler:v3.11
            imagePullPolicy: IfNotPresent
            name: descheduler
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /policy-dir
              name: policy-volume
          dnsPolicy: ClusterFirst
          priorityClassName: system-cluster-critical
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: descheduler-sa
          serviceAccountName: descheduler-sa
          terminationGracePeriodSeconds: 30
          volumes:
          - configMap:
              defaultMode: 420
              name: descheduler-policy-configmap
            name: policy-volume
  schedule: '*/2 * * * *'
  successfulJobsHistoryLimit: 3
  suspend: false
status:
  lastScheduleTime: 2018-12-18T07:02:00Z

configmap description:
[root@qe-minmli-gcemaster-etcd-1 ~]# oc get cm descheduler-policy-configmap -o yaml 
apiVersion: v1
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "RemoveDuplicates":
         enabled: true
      "RemovePodsViolatingInterPodAntiAffinity":
         enabled: true
      "LowNodeUtilization":
         enabled: true
         params:
           nodeResourceUtilizationThresholds:
             thresholds:
               "cpu" : 40
               "memory": 40
               "pods": 40
             targetThresholds:
               "cpu" : 70
               "memory": 70
               "pods": 70
kind: ConfigMap
metadata:
  creationTimestamp: 2018-12-18T02:18:30Z
  name: descheduler-policy-configmap
  namespace: openshift-descheduler
  resourceVersion: "4390"
  selfLink: /api/v1/namespaces/openshift-descheduler/configmaps/descheduler-policy-configmap
  uid: 36175f3f-026b-11e9-84f0-42010af00003



> - When ran for 3.11, was the log-level increased?(You can find that from
> description again)
NO, DEBUG_LOGLEVEL=5 (/etc/origin/master/master.env )

Version-Release number of selected component:
oc v3.11.57
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-minmli-gcemaster-etcd-1:8443
openshift v3.11.57
kubernetes v1.11.0+d4cacc0

Comment 3 ravig 2018-12-18 13:35:33 UTC
Thanks for the output MinLi. Actually, the following section of job config is causing the increased logging:

 - args:
     - --policy-config-file=/policy-dir/policy.yaml
     - --v=5

While I don't mind increasing the logging level, I have an inflight PR https://github.com/openshift/descheduler-operator/pull/36, through which you can pass --v(or log-level) option for descheduler binary from the CR.

So, either way, we need to update docs to show that we are creating a cron job ibstead of a job. Once the PR merges, we should be able to add a flag which shows the log values.

Comment 4 ravig 2018-12-21 03:45:16 UTC
https://github.com/openshift/descheduler-operator/pull/38

This PR should have fixed the issue.

Comment 5 MinLi 2018-12-26 09:07:29 UTC
verified!

Version info:
oc v4.0.0-alpha.0+85a0623-808
kubernetes v1.11.0+85a0623
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://wsun-qe-api.origin-ci-int-aws.dev.rhcloud.com:6443
kubernetes v1.11.0+85a0623

Comment 8 errata-xmlrpc 2019-06-04 10:41:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758