Description of problem: In GCP machine-controller, logs indicate we'll retry after 20 seconds if delete operation fails. However, we don't actually wait 20 seconds. Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Delete a machine 2. Async delete operations result in requeue less than 20 seconds after. Actual results: I0801 16:42:57.186071 1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete I0801 16:42:57.186081 1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine I0801 16:42:57.718635 1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists I0801 16:42:58.071462 1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing... E0801 16:42:58.071528 1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s E0801 16:42:58.071540 1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s I0801 16:42:59.072007 1 controller.go:129] Reconciling Machine "mgugin-p2sqv-m-9" I0801 16:42:59.072061 1 controller.go:298] Machine "mgugin-p2sqv-m-9" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0801 16:42:59.072074 1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete I0801 16:42:59.072084 1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine I0801 16:42:59.629996 1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists I0801 16:43:00.078596 1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing... E0801 16:43:00.078672 1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s E0801 16:43:00.078689 1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s I0801 16:43:01.359125 1 controller.go:129] Reconciling Machine "mgugin-p2sqv-m-9" I0801 16:43:01.359177 1 controller.go:298] Machine "mgugin-p2sqv-m-9" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0801 16:43:01.359191 1 controller.go:193] Reconciling machine "mgugin-p2sqv-m-9" triggers delete I0801 16:43:01.359200 1 actuator.go:120] mgugin-p2sqv-m-9: Deleting machine I0801 16:43:01.903775 1 reconciler.go:254] Machine "mgugin-p2sqv-m-9" already exists I0801 16:43:02.248526 1 reconciler.go:283] mgugin-p2sqv-m-9: machine status is exists, requeuing... E0801 16:43:02.248609 1 actuator.go:58] mgugin-p2sqv-m-9: Machine error: requeue in: 20s E0801 16:43:02.248623 1 controller.go:208] Failed to delete machine "mgugin-p2sqv-m-9": requeue in: 20s Expected results: 20 second delay in timestamps between delete attempts. Additional info: This might impact other cloud providers/actuators as well.
PR for GCP in master: https://github.com/openshift/cluster-api-provider-gcp/pull/44
This has merged. I looked into AWS, didn't seem like it was affected, but I'll need to spin up a cluster to verify for sure. In case of GCP, how to QE: Before fix: Delete a node that has a pod with PDB on it (such as a master node). Lot at logs of machine-controller oc get pods -n openshift-machine-api #record pod name for machine-api-controllers-* oc logs -n openshift-machine-api $PODNAME -c machine-controller Verify that there is less than 20 seconds between deletion attempts, despite the fact it says 'requeue in: 20s' After fix: Do above, verify the timestamp between delete attempts is 20 sec.
Verified. clusterversion: 4.2.0-0.nightly-2019-08-08-103722 delete a machine. check machine-controller logs. The timestamp between delete attempts is 20 sec I0809 07:21:19.794235 1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete I0809 07:21:19.866608 1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx I0809 07:21:19.866653 1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal" I0809 07:21:19.866666 1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt" I0809 07:21:19.866696 1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine I0809 07:21:20.345029 1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists I0809 07:21:20.969860 1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing... E0809 07:21:20.969940 1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s I0809 07:21:20.969949 1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s I0809 07:21:40.970274 1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt" I0809 07:21:40.970332 1 controller.go:310] Machine "zhsun0-nks2r-w-c-4hjtt" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0809 07:21:40.970351 1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete I0809 07:21:41.051958 1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx I0809 07:21:41.052261 1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal" I0809 07:21:41.052286 1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt" I0809 07:21:41.052331 1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine I0809 07:21:41.434263 1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists I0809 07:21:41.941530 1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing... E0809 07:21:41.941872 1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s I0809 07:21:41.942631 1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s I0809 07:22:01.943175 1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt" I0809 07:22:01.943227 1 controller.go:310] Machine "zhsun0-nks2r-w-c-4hjtt" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0809 07:22:01.943241 1 controller.go:205] Reconciling machine "zhsun0-nks2r-w-c-4hjtt" triggers delete I0809 07:22:02.015331 1 info.go:16] ignoring DaemonSet-managed pods: tuned-kr5sv, dns-default-r6xwk, node-ca-h2w22, machine-config-daemon-9jjpw, node-exporter-zxn72, multus-rmmd2, ovs-wz9jp, sdn-2sgkx I0809 07:22:02.015358 1 info.go:20] drained node "zhsun0-nks2r-w-c-4hjtt.c.openshift-gce-devel.internal" I0809 07:22:02.015370 1 controller.go:302] drain successful for machine "zhsun0-nks2r-w-c-4hjtt" I0809 07:22:02.015407 1 actuator.go:116] zhsun0-nks2r-w-c-4hjtt: Deleting machine I0809 07:22:02.382710 1 reconciler.go:253] Machine "zhsun0-nks2r-w-c-4hjtt" already exists I0809 07:22:02.840781 1 reconciler.go:281] zhsun0-nks2r-w-c-4hjtt: machine status is exists, requeuing... E0809 07:22:02.840858 1 controller.go:220] Failed to delete machine "zhsun0-nks2r-w-c-4hjtt": requeue in: 20s I0809 07:22:02.840870 1 controller.go:364] Actuator returned requeue-after error: requeue in: 20s I0809 07:22:22.841145 1 controller.go:141] Reconciling Machine "zhsun0-nks2r-w-c-4hjtt"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922