Bug 1772192

Summary:	AWS machine-controller does not honor requeue delay
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Cloud Compute sub component:	Other Providers	QA Contact:	Jianwei Hou <jhou>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	medium	CC:	agarcial, jkaur
Version:	4.3.0
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:12:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-11-13 21:06:56 UTC

Similar to the GCP issue fixed in bug 1736293, the machine controller is not respecting it's intended requeue delay.  For an example, see the 4.3 CI run in [1].  Another example, provided by Nate from 4.2.2, is:

$ grep ocp-m45ls-workerocs-us-east-1c-zxpfz must-gather.local.7751879504668897656/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-51162f73217480b3b5a47b392e251e7b7a22f6c10fa4e2050ca1b54558625b07/namespaces/openshift-machine-api/pods/machine-api-controllers-df6979876-tzjgk/machine-controller/machine-controller/logs/current.log
...
2019-11-05T18:45:38.848592929Z I1105 18:45:38.848388       1 controller.go:133] Reconciling Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz"
2019-11-05T18:45:38.848592929Z I1105 18:45:38.848427       1 controller.go:304] Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-05T18:45:38.848706623Z I1105 18:45:38.848445       1 actuator.go:487] ocp-m45ls-workerocs-us-east-1c-zxpfz: Checking if machine exists
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898295       1 actuator.go:495] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance does not exist
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898317       1 controller.go:253] Reconciling machine object ocp-m45ls-workerocs-us-east-1c-zxpfz triggers idempotent create.
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898325       1 actuator.go:113] ocp-m45ls-workerocs-us-east-1c-zxpfz: creating machine
2019-11-05T18:45:38.898512666Z E1105 18:45:38.898485       1 utils.go:191] NodeRef not found in machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:38.934408372Z I1105 18:45:38.934370       1 instances.go:44] No stopped instances found for machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:40.700248288Z I1105 18:45:40.700196       1 actuator.go:199] ocp-m45ls-workerocs-us-east-1c-zxpfz: ProviderID updated at machine spec: aws:///us-east-1c/i-0538e37f1107ce960
2019-11-05T18:45:40.708475471Z I1105 18:45:40.708439       1 actuator.go:585] ocp-m45ls-workerocs-us-east-1c-zxpfz: Updating status
2019-11-05T18:45:40.708590743Z I1105 18:45:40.708545       1 actuator.go:629] ocp-m45ls-workerocs-us-east-1c-zxpfz: finished calculating AWS status
2019-11-05T18:45:40.708661793Z I1105 18:45:40.708628       1 actuator.go:231] ocp-m45ls-workerocs-us-east-1c-zxpfz: machine status has changed, updating
2019-11-05T18:45:40.717126682Z I1105 18:45:40.717060       1 actuator.go:647] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance state still pending, returning an error to requeue
2019-11-05T18:45:40.717263389Z W1105 18:45:40.717227       1 controller.go:255] Failed to create machine "ocp-m45ls-workerocs-us-east-1c-zxpfz": requeue in: 20s
2019-11-05T18:45:40.71748592Z I1105 18:45:40.717460       1 controller.go:133] Reconciling Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz"
2019-11-05T18:45:40.71758936Z I1105 18:45:40.717532       1 controller.go:304] Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-05T18:45:40.71758936Z I1105 18:45:40.717577       1 actuator.go:487] ocp-m45ls-workerocs-us-east-1c-zxpfz: Checking if machine exists
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801550       1 actuator.go:495] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance does not exist
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801579       1 controller.go:253] Reconciling machine object ocp-m45ls-workerocs-us-east-1c-zxpfz triggers idempotent create.
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801588       1 actuator.go:113] ocp-m45ls-workerocs-us-east-1c-zxpfz: creating machine
2019-11-05T18:45:40.801784021Z E1105 18:45:40.801759       1 utils.go:191] NodeRef not found in machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:40.83981453Z I1105 18:45:40.839771       1 instances.go:44] No stopped instances found for machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:42.237560282Z W1105 18:45:42.237502       1 controller.go:255] Failed to create machine "ocp-m45ls-workerocs-us-east-1c-zxpfz": ocp-m45ls-workerocs-us-east-1c-zxpfz: failed to update machine object with providerID: ocp-m45ls-workerocs-us-east-1c-zxpfz: error updating machine spec ProviderID: Operation cannot be fulfilled on machines.machine.openshift.io "ocp-m45ls-workerocs-us-east-1c-zxpfz": the object has been modified; please apply your changes to the latest version and try again
...

Michael looked over AWS in [2], but I'll go through [3] and see if anything looks suspicious.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c0
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1736293#c2
[3]: https://github.com/openshift/cluster-api-provider-gcp/pull/44/files

Comment 1 W. Trevor King 2019-11-13 21:30:29 UTC

So at least the 4.3 CI thing [1] is not the GCP error wrapping thing.  Here you can see delayIfRequeueAfterError triggering:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-7445cdcd69-vlqjh/machine-controller/machine-controller/logs/current.log | grep -1 'Actuator returned requeue-after error' | tail -n3
2019-11-13T16:57:16.780894255Z W1113 16:57:16.780861       1 controller.go:321] Failed to create machine "ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9": requeue in: 20s
2019-11-13T16:57:16.780947512Z I1113 16:57:16.780925       1 controller.go:409] Actuator returned requeue-after error: requeue in: 20s
2019-11-13T16:57:16.781080475Z I1113 16:57:16.781054       1 controller.go:164] Reconciling Machine "ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9"

but still not stopping the subsequent reconciliation.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c0

Comment 8 Milind Yadav 2020-04-13 03:56:35 UTC

Validated on : 
[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-12-180647   True        False         5m22s   Cluster version is 4.5.0-0.nightly-2020-04-12-180647

Steps:
1.[miyadav@miyadav ~]$ oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.miyadav-1304.qe.devcluster.openshift.com:6443".

2.[miyadav@miyadav ~]$ oc get machines
NAME                                         PHASE     TYPE        REGION      ZONE         AGE
miyadav-1304-plrql-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   35m
miyadav-1304-plrql-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   35m
miyadav-1304-plrql-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   35m
miyadav-1304-plrql-worker-us-east-2a-rrhzj   Running   m4.large    us-east-2   us-east-2a   5m59s
miyadav-1304-plrql-worker-us-east-2b-zdft4   Running   m4.large    us-east-2   us-east-2b   24m
miyadav-1304-plrql-worker-us-east-2c-g2kn7   Running   m4.large    us-east-2   us-east-2c   24m

3.[miyadav@miyadav ~]$ oc delete machine miyadav-1304-plrql-worker-us-east-2b-zdft4
machine.machine.openshift.io "miyadav-1304-plrql-worker-us-east-2b-zdft4" deleted

4. oc logs -f machine-api-controllers-56b68ff6d7-qczp9 -c machine-controller 
 Actual & Expected :
I0413 03:44:29.525626       1 reconciler.go:352] miyadav-1304-plrql-worker-us-east-2b-wsnrv: Instance state still pending, returning an error to requeue
I0413 03:44:29.525635       1 machine_scope.go:80] miyadav-1304-plrql-worker-us-east-2b-wsnrv: patching machine
E0413 03:44:29.538575       1 actuator.go:65] miyadav-1304-plrql-worker-us-east-2b-wsnrv error: requeue in: 20s
E0413 03:44:29.538628       1 controller.go:275] miyadav-1304-plrql-worker-us-east-2b-wsnrv: error updating machine: requeue in: 20s
I0413 03:44:29.538639       1 controller.go:401] Actuator returned requeue-after error: requeue in: 20s
I0413 03:44:29.538724       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="requeue in: 20s" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-1304-plrql-worker-us-east-2b-wsnrv","uid":"1e209294-8f6e-462b-a52c-849576609dad","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"27546"} "reason"="FailedCreate"
I0413 03:44:49.538922       1 controller.go:165] miyadav-1304-plrql-worker-us-east-2b-wsnrv: reconciling Machine
I0413 03:44:49.538950       1 actuator.go:97] miyadav-1304-plrql-worker-us-east-2b-wsnrv: actuator checking if machine exists
I0413 03:44:49.631983       1 reconciler.go:367] miyadav-1304-plrql-worker-us-east-2b-wsnrv: Found instance by id: i-0ad778699bb16aa1a
I0413 03:44:49.632007       1 controller.go:273] miyadav-1304-plrql-worker-us-east-2b-wsnrv: reconciling machine triggers idempotent update
I0413 03:44:49.632013       1 actuator.go:112] miyadav-1304-plrql-worker-us-east-2b-wsnrv: actuator updating machine
.
.
.

Additional Info : 20s difference is seen in logs honoring requeue time

Comment 12 errata-xmlrpc 2020-07-13 17:12:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 13 Red Hat Bugzilla 2023-09-14 05:46:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days