Bug 1833329

Summary:	Descheduler should remove crashlooping pods
Product:	OpenShift Container Platform	Reporter:	Mike Dame <mdame>
Component:	kube-scheduler	Assignee:	Mike Dame <mdame>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	aos-bugs, jchaloup, mfojtik
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Feature: Descheduler should remove pods which exceed a certain number of restarts Reason: Constantly-restarting pods are often crashlooping, and an eviction may put them onto a node where they are able to run Result: RemovePodsHavingTooManyRestarts strategy is now available for this	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:36:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike Dame 2020-05-08 12:50:07 UTC

As described in https://issues.redhat.com/browse/WRKLDS-145, the descheduler operator should support the new RemovePodsWithTooManyRestarts strategy which is meant to remove crashlooping pods

Comment 1 Mike Dame 2020-05-08 12:52:37 UTC

Please see the updated readme in the descheduler operator on how to configure this new strategy for verification. Specifically, it will require adding a section like this to the operator config:

- name: "RemovePodsHavingTooManyRestarts"
      params:
       - name: "PodRestartThreshold"
         value: "10"
       - name: "IncludeInitContainers"
         value: "false"

Comment 4 Jan Chaloupka 2020-05-18 09:32:43 UTC

Switching to POST to include https://github.com/openshift/cluster-kube-descheduler-operator/pull/110 fixing README type for the strategy. The correct strategy param is now (IncludeInitContainers renamed to IncludingInitContainers:

- name: "RemovePodsHavingTooManyRestarts"
      params:
       - name: "PodRestartThreshold"
         value: "10"
       - name: "IncludingInitContainers"
         value: "false"

Comment 6 RamaKasturi 2020-05-18 14:40:16 UTC

1) could enable RemovePodsHavingTooManyRestarts strategy and set podRestartThreshold value, also see that values propagate well to configmap.

apiVersion: v1
data:
  policy.yaml: |
    strategies:
      RemovePodsHavingTooManyRestarts:
        enabled: true
        params:
          podsHavingTooManyRestarts:
            podRestartThreshold: 4

apiVersion: v1
data:
  policy.yaml: |
    strategies:
      RemovePodsHavingTooManyRestarts:
        enabled: true
        params:
          podsHavingTooManyRestarts:
            includingInitContainers: true
            podRestartThreshold: 4


2) created a replicationcontroller with replicas set to '3' and podRestartThreshold value at 4 and i see that after 4 restarts, descheduler evicts the pod.

I0518 14:15:01.282015       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-ktg87
I0518 14:15:01.397058       1 evictions.go:99] Evicted pod: "nginx-7gvfz" in namespace "test"
I0518 14:15:01.397811       1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-7gvfz", UID:"0d29e248-1f51-4f53-9924-94cc4def1ab0", APIVersion:"v1", ResourceVersion:"85188", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0518 14:15:01.475036       1 evictions.go:99] Evicted pod: "nginx-c4c5l" in namespace "test"
I0518 14:15:01.475530       1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-c4c5l", UID:"6ffdbe53-ce1a-4421-baea-035f12108bbf", APIVersion:"v1", ResourceVersion:"85226", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0518 14:15:01.515833       1 evictions.go:99] Evicted pod: "nginx-ft92r" in namespace "test"
I0518 14:15:01.515854       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr
I0518 14:15:01.516015       1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-ft92r", UID:"a914c8a9-9f4c-4a51-8bc7-e6525c716ecd", APIVersion:"v1", ResourceVersion:"85181", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0518 14:15:01.578982       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7

3) Included includinginitcontainers as well into the strategy and set the value as "true" and once the init container restarts for 4 times, i see that descheduler evicts the pod.

[ramakasturinarra@dhcp35-60 ocp_files]$ oc get pods -o wide
NAME                            READY   STATUS        RESTARTS   AGE    IP            NODE                             NOMINATED NODE   READINESS GATES
initcontainer-db54dc85b-fc5mv   0/1     Terminating   4          2m7s   10.129.2.65   knarra-518f-f5j6b-worker-ktg87   <none>           <none>
initcontainer-db54dc85b-m29vv   0/1     Init:0/1      0          5s     <none>        knarra-518f-f5j6b-worker-ktg87   <none>           <none>


I0518 14:25:41.497582       1 evictions.go:99] Evicted pod: "initcontainer-db54dc85b-7q5jp" in namespace "test"
I0518 14:25:41.497606       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr
I0518 14:25:41.497935       1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"initcontainer-db54dc85b-7q5jp", UID:"2b52a5c4-ed5e-491d-98fd-674b7521317b", APIVersion:"v1", ResourceVersion:"88801", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0518 14:25:41.582640       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7

4) When includinginitcontainers is included and value is set as "false", even if the init container restarts for more than 4 times, descheduler does not evict the pod since including init containers is set as false.

[ramakasturinarra@dhcp35-60 ocp_files]$ oc get pods -o wide
NAME                            READY   STATUS       RESTARTS   AGE     IP            NODE                             NOMINATED NODE   READINESS GATES
initcontainer-db54dc85b-m29vv   0/1     Init:Error   5          3m16s   10.129.2.66   knarra-518f-f5j6b-worker-ktg87   <none>           <none>


I0518 14:30:53.079707       1 node.go:45] node lister returned empty list, now fetch directly
I0518 14:30:53.092628       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-0
I0518 14:30:53.275930       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-1
I0518 14:30:53.378496       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-2
I0518 14:30:53.481785       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-ktg87
I0518 14:30:53.578795       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr
I0518 14:30:53.680792       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7

5) When includinginitcontainers was set back to true again, i see that the initcontainer terminates and descheduler evicts the pod since the number of restarts are more than the podthresholdvalue set.

I0518 14:33:01.375038       1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr
I0518 14:33:01.375133       1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"initcontainer-db54dc85b-m29vv", UID:"cc856cfc-7dfc-4541-8128-44a47f53bc94", APIVersion:"v1", ResourceVersion:"91001", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler

Based on the above results moving the bug to verified state.

Comment 7 RamaKasturi 2020-05-18 14:46:54 UTC

Verified with the payload below:
==================================
[ramakasturinarra@dhcp35-60 ocp_files]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-17-235851   True        False         3h27m   Cluster version is 4.5.0-0.nightly-2020-05-17-235851

Comment 8 errata-xmlrpc 2020-07-13 17:36:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409