Bug 1843039

Summary:	Add support for rescheduled pods with the same name in drain
Product:	OpenShift Container Platform	Reporter:	Jaspreet Kaur <jkaur>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	high	CC:	agarcial, apjagtap, mfuruta, mimccune
Version:	3.11.0
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-17 20:21:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2020-06-02 15:33:53 UTC

Description of problem:

Cluster Auto scaler failed to scale-down empty nodes with the following messages.

     I0519 09:00:54.693432       1 scale_down.go:488] Scale-down: removing node node.example.com, utilization: 0.4111111111111111, pods to reschedule: ...,kafka-test/threadlauncher-kafka-2,...
     I0519 09:00:54.723654       1 delete.go:53] Successfully added toBeDeletedTaint on node node.example.com
     I0519 09:00:55.048903       1 request.go:481] Throttling request took 324.976105ms, request: POST:https://172.30.0.1:443/api/v1/namespaces/kafka-test/pods/threadlauncher-kafka-2/eviction
     E0519 09:01:36.751622       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:41.886046       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:46.920844       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:01:51.976428       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     ...
     E0519 09:11:10.971390       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:16.056618       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:21.093897       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     E0519 09:11:26.145509       1 scale_down.go:766] Not deleted yet &Pod{...{Name:threadlauncher-kafka-2,...,Namespace:kafka-test,...,CreationTimestamp:2020-05-19 09:01:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil...
     I0519 09:11:31.180610       1 delete.go:106] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1589878854 Effect:NoSchedule TimeAdded:<nil>} on node ip-10-100-218-67.ap-northeast-1.compute.internal
     I0519 09:11:31.205876       1 delete.go:119] Successfully released toBeDeletedTaint on node ip-10-100-218-67.ap-northeast-1.compute.internal
     E0519 09:11:31.205919       1 scale_down.go:506] Failed to delete ip-10-100-218-67.ap-northeast-1.compute.internal: Failed to drain node /ip-10-100-218-67.ap-northeast-1.compute.internal: pods remaining after timeout

The log showed that threadlauncher-kafka-2 hadn't been deleted yet.
However, according to our investigation, the Pod had been successfully deleted from the scale-down node node.example.com


As per the investigation we needed to : Add support for rescheduled pods with the same name in drain



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Cluster auto scaler failed to scale-down empty nodes with timeout error


Expected results: Cluster auto scaler should succeed successfully to scale down

Additional info:

Related :

Upstream PR:  https://github.com/kubernetes/autoscaler/pull/830

Comment 4 sunzhaohua 2020-06-16 09:38:13 UTC

$ oc version
oc v3.11.232
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-132-249.us-east-2.compute.internal:8443
openshift v3.11.219
kubernetes v1.11.0+d4cacc0

Comment 6 errata-xmlrpc 2020-06-17 20:21:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2477

Comment 8 Red Hat Bugzilla 2023-09-14 06:01:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days