1843039 – Add support for rescheduled pods with the same name in drain

Bug 1843039 - Add support for rescheduled pods with the same name in drain

Summary: Add support for rescheduled pods with the same name in drain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Alberto
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-02 15:33 UTC by Jaspreet Kaur
Modified:	2023-10-06 20:22 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-06-17 20:21:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes-autoscaler pull 155	0	None	closed	Bug 1843039: Add support for rescheduled pods with the same name in drain	2021-01-18 07:42:05 UTC
Red Hat Product Errata	RHBA-2020:2477	0	None	None	None	2020-06-17 20:21:37 UTC

Description Jaspreet Kaur 2020-06-02 15:33:53 UTC

Description of problem:

Cluster Auto scaler failed to scale-down empty nodes with the following messages.

     I0519 09:00:54.693432       1 scale_down.go:488] Scale-down: removing node node.example.com, utilization: 0.4111111111111111, pods to reschedule: ...,kafka-test/threadlauncher-kafka-2,...
     I0519 09:00:54.723654       1 delete.go:53] Successfully added toBeDeletedTaint on node node.example.com
     I0519 09:00:55.048903       1 request.go:481] Throttling request took 324.976105ms, request: POST:https://172.30.0.1:443/api/v1/namespaces/kafka-test/pods/threadlauncher-kafka-2/eviction
     E0519 09:01:36.751622       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:41.886046       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:46.920844       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:51.976428       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     ...
     E0519 09:11:10.971390       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:16.056618       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:21.093897       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:26.145509       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     I0519 09:11:31.180610       1 delete.go:106] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1589878854 Effect:NoSchedule TimeAdded:<nil>} on node ip-10-100-218-67.ap-northeast-1.compute.internal
     I0519 09:11:31.205876       1 delete.go:119] Successfully released toBeDeletedTaint on node ip-10-100-218-67.ap-northeast-1.compute.internal
     E0519 09:11:31.205919       1 scale_down.go:506] Failed to delete ip-10-100-218-67.ap-northeast-1.compute.internal: Failed to drain node /ip-10-100-218-67.ap-northeast-1.compute.internal: pods remaining after timeout

The log showed that threadlauncher-kafka-2 hadn't been deleted yet.
However, according to our investigation, the Pod had been successfully deleted from the scale-down node node.example.com


As per the investigation we needed to : Add support for rescheduled pods with the same name in drain



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Cluster auto scaler failed to scale-down empty nodes with timeout error


Expected results: Cluster auto scaler should succeed successfully to scale down

Additional info:

Related :

Upstream PR:  https://github.com/kubernetes/autoscaler/pull/830

Comment 4 sunzhaohua 2020-06-16 09:38:13 UTC

$ oc version
oc v3.11.232
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-132-249.us-east-2.compute.internal:8443
openshift v3.11.219
kubernetes v1.11.0+d4cacc0

Comment 6 errata-xmlrpc 2020-06-17 20:21:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2477

Comment 8 Red Hat Bugzilla 2023-09-14 06:01:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.