Bug 1843039 - Add support for rescheduled pods with the same name in drain
Summary: Add support for rescheduled pods with the same name in drain
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 3.11.z
Assignee: Alberto
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-02 15:33 UTC by Jaspreet Kaur
Modified: 2023-10-06 20:22 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-17 20:21:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes-autoscaler pull 155 0 None closed Bug 1843039: Add support for rescheduled pods with the same name in drain 2021-01-18 07:42:05 UTC
Red Hat Product Errata RHBA-2020:2477 0 None None None 2020-06-17 20:21:37 UTC

Description Jaspreet Kaur 2020-06-02 15:33:53 UTC
Description of problem:

Cluster Auto scaler failed to scale-down empty nodes with the following messages.

     I0519 09:00:54.693432       1 scale_down.go:488] Scale-down: removing node node.example.com, utilization: 0.4111111111111111, pods to reschedule: ...,kafka-test/threadlauncher-kafka-2,...
     I0519 09:00:54.723654       1 delete.go:53] Successfully added toBeDeletedTaint on node node.example.com
     I0519 09:00:55.048903       1 request.go:481] Throttling request took 324.976105ms, request: POST:https://172.30.0.1:443/api/v1/namespaces/kafka-test/pods/threadlauncher-kafka-2/eviction
     E0519 09:01:36.751622       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:41.886046       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:46.920844       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:51.976428       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     ...
     E0519 09:11:10.971390       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:16.056618       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:21.093897       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:26.145509       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     I0519 09:11:31.180610       1 delete.go:106] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1589878854 Effect:NoSchedule TimeAdded:<nil>} on node ip-10-100-218-67.ap-northeast-1.compute.internal
     I0519 09:11:31.205876       1 delete.go:119] Successfully released toBeDeletedTaint on node ip-10-100-218-67.ap-northeast-1.compute.internal
     E0519 09:11:31.205919       1 scale_down.go:506] Failed to delete ip-10-100-218-67.ap-northeast-1.compute.internal: Failed to drain node /ip-10-100-218-67.ap-northeast-1.compute.internal: pods remaining after timeout

The log showed that threadlauncher-kafka-2 hadn't been deleted yet.
However, according to our investigation, the Pod had been successfully deleted from the scale-down node node.example.com


As per the investigation we needed to : Add support for rescheduled pods with the same name in drain



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Cluster auto scaler failed to scale-down empty nodes with timeout error


Expected results: Cluster auto scaler should succeed successfully to scale down

Additional info:

Related :

Upstream PR:  https://github.com/kubernetes/autoscaler/pull/830

Comment 4 sunzhaohua 2020-06-16 09:38:13 UTC
$ oc version
oc v3.11.232
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-132-249.us-east-2.compute.internal:8443
openshift v3.11.219
kubernetes v1.11.0+d4cacc0

Comment 6 errata-xmlrpc 2020-06-17 20:21:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2477

Comment 8 Red Hat Bugzilla 2023-09-14 06:01:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.