Description of problem: Cluster Auto scaler failed to scale-down empty nodes with the following messages. I0519 09:00:54.693432 1 scale_down.go:488] Scale-down: removing node node.example.com, utilization: 0.4111111111111111, pods to reschedule: ...,kafka-test/threadlauncher-kafka-2,... I0519 09:00:54.723654 1 delete.go:53] Successfully added toBeDeletedTaint on node node.example.com I0519 09:00:55.048903 1 request.go:481] Throttling request took 324.976105ms, request: POST:https://172.30.0.1:443/api/v1/namespaces/kafka-test/pods/threadlauncher-kafka-2/eviction E0519 09:01:36.751622 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:01:41.886046 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:01:46.920844 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:01:51.976428 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... ... E0519 09:11:10.971390 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:11:16.056618 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:11:21.093897 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... E0519 09:11:26.145509 1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil... I0519 09:11:31.180610 1 delete.go:106] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1589878854 Effect:NoSchedule TimeAdded:<nil>} on node ip-10-100-218-67.ap-northeast-1.compute.internal I0519 09:11:31.205876 1 delete.go:119] Successfully released toBeDeletedTaint on node ip-10-100-218-67.ap-northeast-1.compute.internal E0519 09:11:31.205919 1 scale_down.go:506] Failed to delete ip-10-100-218-67.ap-northeast-1.compute.internal: Failed to drain node /ip-10-100-218-67.ap-northeast-1.compute.internal: pods remaining after timeout The log showed that threadlauncher-kafka-2 hadn't been deleted yet. However, according to our investigation, the Pod had been successfully deleted from the scale-down node node.example.com As per the investigation we needed to : Add support for rescheduled pods with the same name in drain Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Cluster auto scaler failed to scale-down empty nodes with timeout error Expected results: Cluster auto scaler should succeed successfully to scale down Additional info: Related : Upstream PR: https://github.com/kubernetes/autoscaler/pull/830
$ oc version oc v3.11.232 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-31-132-249.us-east-2.compute.internal:8443 openshift v3.11.219 kubernetes v1.11.0+d4cacc0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2477
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days