1772192 – AWS machine-controller does not honor requeue delay

Bug 1772192 - AWS machine-controller does not honor requeue delay

Summary: AWS machine-controller does not honor requeue delay

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Alberto
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-13 21:06 UTC by W. Trevor King
Modified:	2023-10-06 18:46 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:12:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:12:31 UTC

Description W. Trevor King 2019-11-13 21:06:56 UTC

Similar to the GCP issue fixed in bug 1736293, the machine controller is not respecting it's intended requeue delay.  For an example, see the 4.3 CI run in [1].  Another example, provided by Nate from 4.2.2, is:

$ grep ocp-m45ls-workerocs-us-east-1c-zxpfz must-gather.local.7751879504668897656/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-51162f73217480b3b5a47b392e251e7b7a22f6c10fa4e2050ca1b54558625b07/namespaces/openshift-machine-api/pods/machine-api-controllers-df6979876-tzjgk/machine-controller/machine-controller/logs/current.log
...
2019-11-05T18:45:38.848592929Z I1105 18:45:38.848388       1 controller.go:133] Reconciling Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz"
2019-11-05T18:45:38.848592929Z I1105 18:45:38.848427       1 controller.go:304] Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-05T18:45:38.848706623Z I1105 18:45:38.848445       1 actuator.go:487] ocp-m45ls-workerocs-us-east-1c-zxpfz: Checking if machine exists
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898295       1 actuator.go:495] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance does not exist
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898317       1 controller.go:253] Reconciling machine object ocp-m45ls-workerocs-us-east-1c-zxpfz triggers idempotent create.
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898325       1 actuator.go:113] ocp-m45ls-workerocs-us-east-1c-zxpfz: creating machine
2019-11-05T18:45:38.898512666Z E1105 18:45:38.898485       1 utils.go:191] NodeRef not found in machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:38.934408372Z I1105 18:45:38.934370       1 instances.go:44] No stopped instances found for machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:40.700248288Z I1105 18:45:40.700196       1 actuator.go:199] ocp-m45ls-workerocs-us-east-1c-zxpfz: ProviderID updated at machine spec: aws:///us-east-1c/i-0538e37f1107ce960
2019-11-05T18:45:40.708475471Z I1105 18:45:40.708439       1 actuator.go:585] ocp-m45ls-workerocs-us-east-1c-zxpfz: Updating status
2019-11-05T18:45:40.708590743Z I1105 18:45:40.708545       1 actuator.go:629] ocp-m45ls-workerocs-us-east-1c-zxpfz: finished calculating AWS status
2019-11-05T18:45:40.708661793Z I1105 18:45:40.708628       1 actuator.go:231] ocp-m45ls-workerocs-us-east-1c-zxpfz: machine status has changed, updating
2019-11-05T18:45:40.717126682Z I1105 18:45:40.717060       1 actuator.go:647] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance state still pending, returning an error to requeue
2019-11-05T18:45:40.717263389Z W1105 18:45:40.717227       1 controller.go:255] Failed to create machine "ocp-m45ls-workerocs-us-east-1c-zxpfz": requeue in: 20s
2019-11-05T18:45:40.71748592Z I1105 18:45:40.717460       1 controller.go:133] Reconciling Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz"
2019-11-05T18:45:40.71758936Z I1105 18:45:40.717532       1 controller.go:304] Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-05T18:45:40.71758936Z I1105 18:45:40.717577       1 actuator.go:487] ocp-m45ls-workerocs-us-east-1c-zxpfz: Checking if machine exists
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801550       1 actuator.go:495] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance does not exist
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801579       1 controller.go:253] Reconciling machine object ocp-m45ls-workerocs-us-east-1c-zxpfz triggers idempotent create.
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801588       1 actuator.go:113] ocp-m45ls-workerocs-us-east-1c-zxpfz: creating machine
2019-11-05T18:45:40.801784021Z E1105 18:45:40.801759       1 utils.go:191] NodeRef not found in machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:40.83981453Z I1105 18:45:40.839771       1 instances.go:44] No stopped instances found for machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:42.237560282Z W1105 18:45:42.237502       1 controller.go:255] Failed to create machine "ocp-m45ls-workerocs-us-east-1c-zxpfz": ocp-m45ls-workerocs-us-east-1c-zxpfz: failed to update machine object with providerID: ocp-m45ls-workerocs-us-east-1c-zxpfz: error updating machine spec ProviderID: Operation cannot be fulfilled on machines.machine.openshift.io "ocp-m45ls-workerocs-us-east-1c-zxpfz": the object has been modified; please apply your changes to the latest version and try again
...

Michael looked over AWS in [2], but I'll go through [3] and see if anything looks suspicious.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c0
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1736293#c2
[3]: https://github.com/openshift/cluster-api-provider-gcp/pull/44/files

Comment 1 W. Trevor King 2019-11-13 21:30:29 UTC

So at least the 4.3 CI thing [1] is not the GCP error wrapping thing.  Here you can see delayIfRequeueAfterError triggering:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-7445cdcd69-vlqjh/machine-controller/machine-controller/logs/current.log | grep -1 'Actuator returned requeue-after error' | tail -n3
2019-11-13T16:57:16.780894255Z W1113 16:57:16.780861       1 controller.go:321] Failed to create machine "ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9": requeue in: 20s
2019-11-13T16:57:16.780947512Z I1113 16:57:16.780925       1 controller.go:409] Actuator returned requeue-after error: requeue in: 20s
2019-11-13T16:57:16.781080475Z I1113 16:57:16.781054       1 controller.go:164] Reconciling Machine "ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9"

but still not stopping the subsequent reconciliation.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c0

Comment 8 Milind Yadav 2020-04-13 03:56:35 UTC

Validated on : 
[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-12-180647   True        False         5m22s   Cluster version is 4.5.0-0.nightly-2020-04-12-180647

Steps:
1.[miyadav@miyadav ~]$ oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.miyadav-1304.qe.devcluster.openshift.com:6443".

2.[miyadav@miyadav ~]$ oc get machines
NAME                                         PHASE     TYPE        REGION      ZONE         AGE
miyadav-1304-plrql-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   35m
miyadav-1304-plrql-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   35m
miyadav-1304-plrql-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   35m
miyadav-1304-plrql-worker-us-east-2a-rrhzj   Running   m4.large    us-east-2   us-east-2a   5m59s
miyadav-1304-plrql-worker-us-east-2b-zdft4   Running   m4.large    us-east-2   us-east-2b   24m
miyadav-1304-plrql-worker-us-east-2c-g2kn7   Running   m4.large    us-east-2   us-east-2c   24m

3.[miyadav@miyadav ~]$ oc delete machine miyadav-1304-plrql-worker-us-east-2b-zdft4
machine.machine.openshift.io "miyadav-1304-plrql-worker-us-east-2b-zdft4" deleted

4. oc logs -f machine-api-controllers-56b68ff6d7-qczp9 -c machine-controller 
 Actual & Expected :
I0413 03:44:29.525626       1 reconciler.go:352] miyadav-1304-plrql-worker-us-east-2b-wsnrv: Instance state still pending, returning an error to requeue
I0413 03:44:29.525635       1 machine_scope.go:80] miyadav-1304-plrql-worker-us-east-2b-wsnrv: patching machine
E0413 03:44:29.538575       1 actuator.go:65] miyadav-1304-plrql-worker-us-east-2b-wsnrv error: requeue in: 20s
E0413 03:44:29.538628       1 controller.go:275] miyadav-1304-plrql-worker-us-east-2b-wsnrv: error updating machine: requeue in: 20s
I0413 03:44:29.538639       1 controller.go:401] Actuator returned requeue-after error: requeue in: 20s
I0413 03:44:29.538724       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="requeue in: 20s" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-1304-plrql-worker-us-east-2b-wsnrv","uid":"1e209294-8f6e-462b-a52c-849576609dad","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"27546"} "reason"="FailedCreate"
I0413 03:44:49.538922       1 controller.go:165] miyadav-1304-plrql-worker-us-east-2b-wsnrv: reconciling Machine
I0413 03:44:49.538950       1 actuator.go:97] miyadav-1304-plrql-worker-us-east-2b-wsnrv: actuator checking if machine exists
I0413 03:44:49.631983       1 reconciler.go:367] miyadav-1304-plrql-worker-us-east-2b-wsnrv: Found instance by id: i-0ad778699bb16aa1a
I0413 03:44:49.632007       1 controller.go:273] miyadav-1304-plrql-worker-us-east-2b-wsnrv: reconciling machine triggers idempotent update
I0413 03:44:49.632013       1 actuator.go:112] miyadav-1304-plrql-worker-us-east-2b-wsnrv: actuator updating machine
.
.
.

Additional Info : 20s difference is seen in logs honoring requeue time

Comment 12 errata-xmlrpc 2020-07-13 17:12:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 13 Red Hat Bugzilla 2023-09-14 05:46:07 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.