As described in https://issues.redhat.com/browse/WRKLDS-145, the descheduler operator should support the new RemovePodsWithTooManyRestarts strategy which is meant to remove crashlooping pods
Please see the updated readme in the descheduler operator on how to configure this new strategy for verification. Specifically, it will require adding a section like this to the operator config: - name: "RemovePodsHavingTooManyRestarts" params: - name: "PodRestartThreshold" value: "10" - name: "IncludeInitContainers" value: "false"
Switching to POST to include https://github.com/openshift/cluster-kube-descheduler-operator/pull/110 fixing README type for the strategy. The correct strategy param is now (IncludeInitContainers renamed to IncludingInitContainers: - name: "RemovePodsHavingTooManyRestarts" params: - name: "PodRestartThreshold" value: "10" - name: "IncludingInitContainers" value: "false"
1) could enable RemovePodsHavingTooManyRestarts strategy and set podRestartThreshold value, also see that values propagate well to configmap. apiVersion: v1 data: policy.yaml: | strategies: RemovePodsHavingTooManyRestarts: enabled: true params: podsHavingTooManyRestarts: podRestartThreshold: 4 apiVersion: v1 data: policy.yaml: | strategies: RemovePodsHavingTooManyRestarts: enabled: true params: podsHavingTooManyRestarts: includingInitContainers: true podRestartThreshold: 4 2) created a replicationcontroller with replicas set to '3' and podRestartThreshold value at 4 and i see that after 4 restarts, descheduler evicts the pod. I0518 14:15:01.282015 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-ktg87 I0518 14:15:01.397058 1 evictions.go:99] Evicted pod: "nginx-7gvfz" in namespace "test" I0518 14:15:01.397811 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-7gvfz", UID:"0d29e248-1f51-4f53-9924-94cc4def1ab0", APIVersion:"v1", ResourceVersion:"85188", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler I0518 14:15:01.475036 1 evictions.go:99] Evicted pod: "nginx-c4c5l" in namespace "test" I0518 14:15:01.475530 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-c4c5l", UID:"6ffdbe53-ce1a-4421-baea-035f12108bbf", APIVersion:"v1", ResourceVersion:"85226", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler I0518 14:15:01.515833 1 evictions.go:99] Evicted pod: "nginx-ft92r" in namespace "test" I0518 14:15:01.515854 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr I0518 14:15:01.516015 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-ft92r", UID:"a914c8a9-9f4c-4a51-8bc7-e6525c716ecd", APIVersion:"v1", ResourceVersion:"85181", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler I0518 14:15:01.578982 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7 3) Included includinginitcontainers as well into the strategy and set the value as "true" and once the init container restarts for 4 times, i see that descheduler evicts the pod. [ramakasturinarra@dhcp35-60 ocp_files]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES initcontainer-db54dc85b-fc5mv 0/1 Terminating 4 2m7s 10.129.2.65 knarra-518f-f5j6b-worker-ktg87 <none> <none> initcontainer-db54dc85b-m29vv 0/1 Init:0/1 0 5s <none> knarra-518f-f5j6b-worker-ktg87 <none> <none> I0518 14:25:41.497582 1 evictions.go:99] Evicted pod: "initcontainer-db54dc85b-7q5jp" in namespace "test" I0518 14:25:41.497606 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr I0518 14:25:41.497935 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"initcontainer-db54dc85b-7q5jp", UID:"2b52a5c4-ed5e-491d-98fd-674b7521317b", APIVersion:"v1", ResourceVersion:"88801", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler I0518 14:25:41.582640 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7 4) When includinginitcontainers is included and value is set as "false", even if the init container restarts for more than 4 times, descheduler does not evict the pod since including init containers is set as false. [ramakasturinarra@dhcp35-60 ocp_files]$ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES initcontainer-db54dc85b-m29vv 0/1 Init:Error 5 3m16s 10.129.2.66 knarra-518f-f5j6b-worker-ktg87 <none> <none> I0518 14:30:53.079707 1 node.go:45] node lister returned empty list, now fetch directly I0518 14:30:53.092628 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-0 I0518 14:30:53.275930 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-1 I0518 14:30:53.378496 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-2 I0518 14:30:53.481785 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-ktg87 I0518 14:30:53.578795 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr I0518 14:30:53.680792 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7 5) When includinginitcontainers was set back to true again, i see that the initcontainer terminates and descheduler evicts the pod since the number of restarts are more than the podthresholdvalue set. I0518 14:33:01.375038 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr I0518 14:33:01.375133 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"initcontainer-db54dc85b-m29vv", UID:"cc856cfc-7dfc-4541-8128-44a47f53bc94", APIVersion:"v1", ResourceVersion:"91001", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler Based on the above results moving the bug to verified state.
Verified with the payload below: ================================== [ramakasturinarra@dhcp35-60 ocp_files]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-17-235851 True False 3h27m Cluster version is 4.5.0-0.nightly-2020-05-17-235851
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409