Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1736293

Summary: GCP machine-controller does not honor requeue delay
Product: OpenShift Container Platform Reporter: Michael Gugino <mgugino>
Component: Cloud ComputeAssignee: Michael Gugino <mgugino>
Status: CLOSED ERRATA QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: agarcial, zhsun
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:34:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Gugino 2019-08-01 16:47:10 UTC
Description of problem:
In GCP machine-controller, logs indicate we'll retry after 20 seconds if delete operation fails.  However, we don't actually wait 20 seconds.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Delete a machine
2. Async delete operations result in requeue less than 20 seconds after.

Actual results:
I0801 16:42:57.186071       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:42:57.186081       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:42:57.718635       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:42:58.071462       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:42:58.071528       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:42:58.071540       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s
I0801 16:42:59.072007       1 controller.go:129] Reconciling Machine "mgugin-p2sqv-m-9"
I0801 16:42:59.072061       1 controller.go:298] Machine "mgugin-p2sqv-m-9" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0801 16:42:59.072074       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:42:59.072084       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:42:59.629996       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:43:00.078596       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:43:00.078672       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:43:00.078689       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s
I0801 16:43:01.359125       1 controller.go:129] Reconciling Machine "mgugin-p2sqv-m-9"
I0801 16:43:01.359177       1 controller.go:298] Machine "mgugin-p2sqv-m-9" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0801 16:43:01.359191       1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete
I0801 16:43:01.359200       1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine
I0801 16:43:01.903775       1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists
I0801 16:43:02.248526       1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing...
E0801 16:43:02.248609       1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s
E0801 16:43:02.248623       1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s


Expected results:
20 second delay in timestamps between delete attempts.

Additional info:
This might impact other cloud providers/actuators as well.

Comment 1 Michael Gugino 2019-08-01 23:16:00 UTC
PR for GCP in master: https://github.com/openshift/cluster-api-provider-gcp/pull/44

Comment 2 Michael Gugino 2019-08-05 22:48:05 UTC
This has merged.  I looked into AWS, didn't seem like it was affected, but I'll need to spin up a cluster to verify for sure.

In case of GCP, how to QE:

Before fix:
Delete a node that has a pod with PDB on it (such as a master node).
Lot at logs of machine-controller

oc get pods -n openshift-machine-api
#record pod name for machine-api-controllers-*
oc logs -n openshift-machine-api $PODNAME -c machine-controller

Verify that there is less than 20 seconds between deletion attempts, despite the fact it says 'requeue in: 20s'

After fix:
Do above, verify the timestamp between delete attempts is 20 sec.

Comment 4 sunzhaohua 2019-08-09 07:27:40 UTC
Verified.
clusterversion: 4.2.0-0.nightly-2019-08-08-103722

delete a machine.
check machine-controller logs. The timestamp between delete attempts is 20 sec

I0809 07:21:19.794235       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:21:19.866608       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:21:19.866653       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:21:19.866666       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:19.866696       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:21:20.345029       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:21:20.969860       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:21:20.969940       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:21:20.969949       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:21:40.970274       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:40.970332       1 controller.go:310] Machine "zhsun0-nks2r-w-c-4hjtt" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0809 07:21:40.970351       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:21:41.051958       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:21:41.052261       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:21:41.052286       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:21:41.052331       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:21:41.434263       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:21:41.941530       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:21:41.941872       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:21:41.942631       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:22:01.943175       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:22:01.943227       1 controller.go:310] Machine "zhsun0-nks2r-w-c-4hjtt" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0809 07:22:01.943241       1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete
I0809 07:22:02.015331       1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx
I0809 07:22:02.015358       1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal"
I0809 07:22:02.015370       1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt"
I0809 07:22:02.015407       1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine
I0809 07:22:02.382710       1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists
I0809 07:22:02.840781       1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing...
E0809 07:22:02.840858       1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s
I0809 07:22:02.840870       1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s
I0809 07:22:22.841145       1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"

Comment 5 errata-xmlrpc 2019-10-16 06:34:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922