Bug 1843039
| Summary: | Add support for rescheduled pods with the same name in drain | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> |
| Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | high | CC: | agarcial, apjagtap, mfuruta, mimccune |
| Version: | 3.11.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 3.11.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-06-17 20:21:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
$ oc version oc v3.11.232 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-31-132-249.us-east-2.compute.internal:8443 openshift v3.11.219 kubernetes v1.11.0+d4cacc0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2477 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Description of problem: Cluster Auto scaler failed to scale-down empty nodes with the following messages. I0519 09:00:54.693432 1 scale_down.go:488] Scale-down: removing node node.example.com, utilization: 0.4111111111111111, pods to reschedule: ...,kafka-test/threadlauncher-kafka-2,... I0519 09:00:54.723654 1 delete.go:53] Successfully added toBeDeletedTaint on node node.example.com I0519 09:00:55.048903 1 request.go:481] Throttling request took 324.976105ms, request: POST:https://172.30.0.1:443/api/v1/namespaces/kafka-test/pods/threadlauncher-kafka-2/eviction E0519 09:01:36.751622 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:01:41.886046 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:01:46.920844 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:01:51.976428 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... ... E0519 09:11:10.971390 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:11:16.056618 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:11:21.093897 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:11:26.145509 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... I0519 09:11:31.180610 1 delete.go:106] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1589878854 Effect:NoSchedule TimeAdded:<nil>} on node ip-10-100-218-67.ap-northeast-1.compute.internal I0519 09:11:31.205876 1 delete.go:119] Successfully released toBeDeletedTaint on node ip-10-100-218-67.ap-northeast-1.compute.internal E0519 09:11:31.205919 1 scale_down.go:506] Failed to delete ip-10-100-218-67.ap-northeast-1.compute.internal: Failed to drain node /ip-10-100-218-67.ap-northeast-1.compute.internal: pods remaining after timeout The log showed that threadlauncher-kafka-2 hadn't been deleted yet. However, according to our investigation, the Pod had been successfully deleted from the scale-down node node.example.com As per the investigation we needed to : Add support for rescheduled pods with the same name in drain Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Cluster auto scaler failed to scale-down empty nodes with timeout error Expected results: Cluster auto scaler should succeed successfully to scale down Additional info: Related : Upstream PR: https://github.com/kubernetes/autoscaler/pull/830