1736293 – GCP machine-controller does not honor requeue delay

Bug 1736293 - GCP machine-controller does not honor requeue delay

Summary: GCP machine-controller does not honor requeue delay

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Michael Gugino
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-01 16:47 UTC by Michael Gugino
Modified:	2019-10-16 06:34 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:34:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:34:39 UTC

Description Michael Gugino 2019-08-01 16:47:10 UTC

Description of problem:
In GCP machine-controller, logs indicate we'll retry after 20 seconds if delete operation fails.  However, we don't actually wait 20 seconds.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Delete a machine
2. Async delete operations result in requeue less than 20 seconds after.

Actual results:
I0801 16:42:57.186071       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:42:57.186081       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:42:57.718635       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:42:58.071462       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:42:58.071528       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:42:58.071540       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s
I0801 16:42:59.072007       1 controller.go:129] Reconciling Machine "mgugin-p2sqv-m-9"
I0801 16:42:59.072061       1 controller.go:298] Machine "mgugin-p2sqv-m-9" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0801 16:42:59.072074       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:42:59.072084       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:42:59.629996       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:43:00.078596       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:43:00.078672       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:43:00.078689       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s
I0801 16:43:01.359125       1 controller.go:129] Reconciling Machine "mgugin-p2sqv-m-9"
I0801 16:43:01.359177       1 controller.go:298] Machine "mgugin-p2sqv-m-9" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0801 16:43:01.359191       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:43:01.359200       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:43:01.903775       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:43:02.248526       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:43:02.248609       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:43:02.248623       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s


Expected results:
20 second delay in timestamps between delete attempts.

Additional info:
This might impact other cloud providers/actuators as well.

Comment 1 Michael Gugino 2019-08-01 23:16:00 UTC

PR for GCP in master: https://github.com/openshift/cluster-api-provider-gcp/pull/44

Comment 2 Michael Gugino 2019-08-05 22:48:05 UTC

This has merged.  I looked into AWS, didn't seem like it was affected, but I'll need to spin up a cluster to verify for sure.

In case of GCP, how to QE:

Before fix:
Delete a node that has a pod with PDB on it (such as a master node).
Lot at logs of machine-controller

oc get pods -n openshift-machine-api
#record pod name for machine-api-controllers-*
oc logs -n openshift-machine-api $PODNAME -c machine-controller

Verify that there is less than 20 seconds between deletion attempts, despite the fact it says 'requeue in: 20s'

After fix:
Do above, verify the timestamp between delete attempts is 20 sec.

Comment 4 sunzhaohua 2019-08-09 07:27:40 UTC

Verified.
clusterversion: 4.2.0-0.nightly-2019-08-08-103722

delete a machine.
check machine-controller logs. The timestamp between delete attempts is 20 sec

I0809 07:21:19.794235       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:21:19.866608       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:21:19.866653       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:21:19.866666       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:19.866696       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:21:20.345029       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:21:20.969860       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:21:20.969940       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:21:20.969949       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:21:40.970274       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:40.970332       1 controller.go:310] Machine "zhsun0-nks2r-w-c-4hjtt" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0809 07:21:40.970351       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:21:41.051958       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:21:41.052261       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:21:41.052286       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:41.052331       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:21:41.434263       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:21:41.941530       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:21:41.941872       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:21:41.942631       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:22:01.943175       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:22:01.943227       1 controller.go:310] Machine "zhsun0-nks2r-w-c-4hjtt" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0809 07:22:01.943241       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:22:02.015331       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:22:02.015358       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:22:02.015370       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:22:02.015407       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:22:02.382710       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:22:02.840781       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:22:02.840858       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:22:02.840870       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:22:22.841145       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"

Comment 5 errata-xmlrpc 2019-10-16 06:34:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.