Bug 1833329
| Summary: | Descheduler should remove crashlooping pods | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Dame <mdame> |
| Component: | kube-scheduler | Assignee: | Mike Dame <mdame> |
| Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.5 | CC: | aos-bugs, jchaloup, mfojtik |
| Target Milestone: | --- | ||
| Target Release: | 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Enhancement | |
| Doc Text: |
Feature: Descheduler should remove pods which exceed a certain number of restarts
Reason: Constantly-restarting pods are often crashlooping, and an eviction may put them onto a node where they are able to run
Result: RemovePodsHavingTooManyRestarts strategy is now available for this
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-13 17:36:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Mike Dame
2020-05-08 12:50:07 UTC
Please see the updated readme in the descheduler operator on how to configure this new strategy for verification. Specifically, it will require adding a section like this to the operator config:
- name: "RemovePodsHavingTooManyRestarts"
params:
- name: "PodRestartThreshold"
value: "10"
- name: "IncludeInitContainers"
value: "false"
Switching to POST to include https://github.com/openshift/cluster-kube-descheduler-operator/pull/110 fixing README type for the strategy. The correct strategy param is now (IncludeInitContainers renamed to IncludingInitContainers: - name: "RemovePodsHavingTooManyRestarts" params: - name: "PodRestartThreshold" value: "10" - name: "IncludingInitContainers" value: "false" 1) could enable RemovePodsHavingTooManyRestarts strategy and set podRestartThreshold value, also see that values propagate well to configmap.
apiVersion: v1
data:
policy.yaml: |
strategies:
RemovePodsHavingTooManyRestarts:
enabled: true
params:
podsHavingTooManyRestarts:
podRestartThreshold: 4
apiVersion: v1
data:
policy.yaml: |
strategies:
RemovePodsHavingTooManyRestarts:
enabled: true
params:
podsHavingTooManyRestarts:
includingInitContainers: true
podRestartThreshold: 4
2) created a replicationcontroller with replicas set to '3' and podRestartThreshold value at 4 and i see that after 4 restarts, descheduler evicts the pod.
I0518 14:15:01.282015 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-ktg87
I0518 14:15:01.397058 1 evictions.go:99] Evicted pod: "nginx-7gvfz" in namespace "test"
I0518 14:15:01.397811 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-7gvfz", UID:"0d29e248-1f51-4f53-9924-94cc4def1ab0", APIVersion:"v1", ResourceVersion:"85188", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0518 14:15:01.475036 1 evictions.go:99] Evicted pod: "nginx-c4c5l" in namespace "test"
I0518 14:15:01.475530 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-c4c5l", UID:"6ffdbe53-ce1a-4421-baea-035f12108bbf", APIVersion:"v1", ResourceVersion:"85226", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0518 14:15:01.515833 1 evictions.go:99] Evicted pod: "nginx-ft92r" in namespace "test"
I0518 14:15:01.515854 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr
I0518 14:15:01.516015 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"nginx-ft92r", UID:"a914c8a9-9f4c-4a51-8bc7-e6525c716ecd", APIVersion:"v1", ResourceVersion:"85181", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0518 14:15:01.578982 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7
3) Included includinginitcontainers as well into the strategy and set the value as "true" and once the init container restarts for 4 times, i see that descheduler evicts the pod.
[ramakasturinarra@dhcp35-60 ocp_files]$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
initcontainer-db54dc85b-fc5mv 0/1 Terminating 4 2m7s 10.129.2.65 knarra-518f-f5j6b-worker-ktg87 <none> <none>
initcontainer-db54dc85b-m29vv 0/1 Init:0/1 0 5s <none> knarra-518f-f5j6b-worker-ktg87 <none> <none>
I0518 14:25:41.497582 1 evictions.go:99] Evicted pod: "initcontainer-db54dc85b-7q5jp" in namespace "test"
I0518 14:25:41.497606 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr
I0518 14:25:41.497935 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"initcontainer-db54dc85b-7q5jp", UID:"2b52a5c4-ed5e-491d-98fd-674b7521317b", APIVersion:"v1", ResourceVersion:"88801", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0518 14:25:41.582640 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7
4) When includinginitcontainers is included and value is set as "false", even if the init container restarts for more than 4 times, descheduler does not evict the pod since including init containers is set as false.
[ramakasturinarra@dhcp35-60 ocp_files]$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
initcontainer-db54dc85b-m29vv 0/1 Init:Error 5 3m16s 10.129.2.66 knarra-518f-f5j6b-worker-ktg87 <none> <none>
I0518 14:30:53.079707 1 node.go:45] node lister returned empty list, now fetch directly
I0518 14:30:53.092628 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-0
I0518 14:30:53.275930 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-1
I0518 14:30:53.378496 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-master-2
I0518 14:30:53.481785 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-ktg87
I0518 14:30:53.578795 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr
I0518 14:30:53.680792 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-v95d7
5) When includinginitcontainers was set back to true again, i see that the initcontainer terminates and descheduler evicts the pod since the number of restarts are more than the podthresholdvalue set.
I0518 14:33:01.375038 1 toomanyrestarts.go:40] Processing node: knarra-518f-f5j6b-worker-nq7fr
I0518 14:33:01.375133 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"test", Name:"initcontainer-db54dc85b-m29vv", UID:"cc856cfc-7dfc-4541-8128-44a47f53bc94", APIVersion:"v1", ResourceVersion:"91001", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
Based on the above results moving the bug to verified state.
Verified with the payload below: ================================== [ramakasturinarra@dhcp35-60 ocp_files]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-17-235851 True False 3h27m Cluster version is 4.5.0-0.nightly-2020-05-17-235851 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |