Bug 1736293 - GCP machine-controller does not honor requeue delay
Summary: GCP machine-controller does not honor requeue delay
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.0
Assignee: Michael Gugino
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-01 16:47 UTC by Michael Gugino
Modified: 2019-10-16 06:34 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:34:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:34:39 UTC

Description Michael Gugino 2019-08-01 16:47:10 UTC
Description of problem:
In GCP machine-controller, logs indicate we'll retry after 20 seconds if delete operation fails.  However, we don't actually wait 20 seconds.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Delete a machine
2. Async delete operations result in requeue less than 20 seconds after.

Actual results:
I0801 16:42:57.186071       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:42:57.186081       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:42:57.718635       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:42:58.071462       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:42:58.071528       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:42:58.071540       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s
I0801 16:42:59.072007       1 controller.go:129] Reconciling Machine "mgugin-p2sqv-m-9"
I0801 16:42:59.072061       1 controller.go:298] Machine "mgugin-p2sqv-m-9" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0801 16:42:59.072074       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:42:59.072084       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:42:59.629996       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:43:00.078596       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:43:00.078672       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:43:00.078689       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s
I0801 16:43:01.359125       1 controller.go:129] Reconciling Machine "mgugin-p2sqv-m-9"
I0801 16:43:01.359177       1 controller.go:298] Machine "mgugin-p2sqv-m-9" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0801 16:43:01.359191       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:43:01.359200       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:43:01.903775       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:43:02.248526       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:43:02.248609       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:43:02.248623       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s


Expected results:
20 second delay in timestamps between delete attempts.

Additional info:
This might impact other cloud providers/actuators as well.

Comment 1 Michael Gugino 2019-08-01 23:16:00 UTC
PR for GCP in master: https://github.com/openshift/cluster-api-provider-gcp/pull/44

Comment 2 Michael Gugino 2019-08-05 22:48:05 UTC
This has merged.  I looked into AWS, didn't seem like it was affected, but I'll need to spin up a cluster to verify for sure.

In case of GCP, how to QE:

Before fix:
Delete a node that has a pod with PDB on it (such as a master node).
Lot at logs of machine-controller

oc get pods -n openshift-machine-api
#record pod name for machine-api-controllers-*
oc logs -n openshift-machine-api $PODNAME -c machine-controller

Verify that there is less than 20 seconds between deletion attempts, despite the fact it says 'requeue in: 20s'

After fix:
Do above, verify the timestamp between delete attempts is 20 sec.

Comment 4 sunzhaohua 2019-08-09 07:27:40 UTC
Verified.
clusterversion: 4.2.0-0.nightly-2019-08-08-103722

delete a machine.
check machine-controller logs. The timestamp between delete attempts is 20 sec

I0809 07:21:19.794235       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:21:19.866608       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:21:19.866653       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:21:19.866666       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:19.866696       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:21:20.345029       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:21:20.969860       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:21:20.969940       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:21:20.969949       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:21:40.970274       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:40.970332       1 controller.go:310] Machine "zhsun0-nks2r-w-c-4hjtt" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0809 07:21:40.970351       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:21:41.051958       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:21:41.052261       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:21:41.052286       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:41.052331       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:21:41.434263       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:21:41.941530       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:21:41.941872       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:21:41.942631       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:22:01.943175       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:22:01.943227       1 controller.go:310] Machine "zhsun0-nks2r-w-c-4hjtt" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0809 07:22:01.943241       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:22:02.015331       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:22:02.015358       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:22:02.015370       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:22:02.015407       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:22:02.382710       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:22:02.840781       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:22:02.840858       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:22:02.840870       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:22:22.841145       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"

Comment 5 errata-xmlrpc 2019-10-16 06:34:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.