Bug 1772192 - AWS machine-controller does not honor requeue delay
Summary: AWS machine-controller does not honor requeue delay
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.5.0
Assignee: Alberto
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-13 21:06 UTC by W. Trevor King
Modified: 2023-10-06 18:46 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:12:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:12:31 UTC

Description W. Trevor King 2019-11-13 21:06:56 UTC
Similar to the GCP issue fixed in bug 1736293, the machine controller is not respecting it's intended requeue delay.  For an example, see the 4.3 CI run in [1].  Another example, provided by Nate from 4.2.2, is:

$ grep ocp-m45ls-workerocs-us-east-1c-zxpfz must-gather.local.7751879504668897656/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-51162f73217480b3b5a47b392e251e7b7a22f6c10fa4e2050ca1b54558625b07/namespaces/openshift-machine-api/pods/machine-api-controllers-df6979876-tzjgk/machine-controller/machine-controller/logs/current.log
...
2019-11-05T18:45:38.848592929Z I1105 18:45:38.848388       1 controller.go:133] Reconciling Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz"
2019-11-05T18:45:38.848592929Z I1105 18:45:38.848427       1 controller.go:304] Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-05T18:45:38.848706623Z I1105 18:45:38.848445       1 actuator.go:487] ocp-m45ls-workerocs-us-east-1c-zxpfz: Checking if machine exists
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898295       1 actuator.go:495] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance does not exist
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898317       1 controller.go:253] Reconciling machine object ocp-m45ls-workerocs-us-east-1c-zxpfz triggers idempotent create.
2019-11-05T18:45:38.898335622Z I1105 18:45:38.898325       1 actuator.go:113] ocp-m45ls-workerocs-us-east-1c-zxpfz: creating machine
2019-11-05T18:45:38.898512666Z E1105 18:45:38.898485       1 utils.go:191] NodeRef not found in machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:38.934408372Z I1105 18:45:38.934370       1 instances.go:44] No stopped instances found for machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:40.700248288Z I1105 18:45:40.700196       1 actuator.go:199] ocp-m45ls-workerocs-us-east-1c-zxpfz: ProviderID updated at machine spec: aws:///us-east-1c/i-0538e37f1107ce960
2019-11-05T18:45:40.708475471Z I1105 18:45:40.708439       1 actuator.go:585] ocp-m45ls-workerocs-us-east-1c-zxpfz: Updating status
2019-11-05T18:45:40.708590743Z I1105 18:45:40.708545       1 actuator.go:629] ocp-m45ls-workerocs-us-east-1c-zxpfz: finished calculating AWS status
2019-11-05T18:45:40.708661793Z I1105 18:45:40.708628       1 actuator.go:231] ocp-m45ls-workerocs-us-east-1c-zxpfz: machine status has changed, updating
2019-11-05T18:45:40.717126682Z I1105 18:45:40.717060       1 actuator.go:647] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance state still pending, returning an error to requeue
2019-11-05T18:45:40.717263389Z W1105 18:45:40.717227       1 controller.go:255] Failed to create machine "ocp-m45ls-workerocs-us-east-1c-zxpfz": requeue in: 20s
2019-11-05T18:45:40.71748592Z I1105 18:45:40.717460       1 controller.go:133] Reconciling Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz"
2019-11-05T18:45:40.71758936Z I1105 18:45:40.717532       1 controller.go:304] Machine "ocp-m45ls-workerocs-us-east-1c-zxpfz" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-05T18:45:40.71758936Z I1105 18:45:40.717577       1 actuator.go:487] ocp-m45ls-workerocs-us-east-1c-zxpfz: Checking if machine exists
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801550       1 actuator.go:495] ocp-m45ls-workerocs-us-east-1c-zxpfz: Instance does not exist
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801579       1 controller.go:253] Reconciling machine object ocp-m45ls-workerocs-us-east-1c-zxpfz triggers idempotent create.
2019-11-05T18:45:40.801596164Z I1105 18:45:40.801588       1 actuator.go:113] ocp-m45ls-workerocs-us-east-1c-zxpfz: creating machine
2019-11-05T18:45:40.801784021Z E1105 18:45:40.801759       1 utils.go:191] NodeRef not found in machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:40.83981453Z I1105 18:45:40.839771       1 instances.go:44] No stopped instances found for machine ocp-m45ls-workerocs-us-east-1c-zxpfz
2019-11-05T18:45:42.237560282Z W1105 18:45:42.237502       1 controller.go:255] Failed to create machine "ocp-m45ls-workerocs-us-east-1c-zxpfz": ocp-m45ls-workerocs-us-east-1c-zxpfz: failed to update machine object with providerID: ocp-m45ls-workerocs-us-east-1c-zxpfz: error updating machine spec ProviderID: Operation cannot be fulfilled on machines.machine.openshift.io "ocp-m45ls-workerocs-us-east-1c-zxpfz": the object has been modified; please apply your changes to the latest version and try again
...

Michael looked over AWS in [2], but I'll go through [3] and see if anything looks suspicious.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c0
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1736293#c2
[3]: https://github.com/openshift/cluster-api-provider-gcp/pull/44/files

Comment 1 W. Trevor King 2019-11-13 21:30:29 UTC
So at least the 4.3 CI thing [1] is not the GCP error wrapping thing.  Here you can see delayIfRequeueAfterError triggering:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-7445cdcd69-vlqjh/machine-controller/machine-controller/logs/current.log | grep -1 'Actuator returned requeue-after error' | tail -n3
2019-11-13T16:57:16.780894255Z W1113 16:57:16.780861       1 controller.go:321] Failed to create machine "ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9": requeue in: 20s
2019-11-13T16:57:16.780947512Z I1113 16:57:16.780925       1 controller.go:409] Actuator returned requeue-after error: requeue in: 20s
2019-11-13T16:57:16.781080475Z I1113 16:57:16.781054       1 controller.go:164] Reconciling Machine "ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9"

but still not stopping the subsequent reconciliation.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c0

Comment 8 Milind Yadav 2020-04-13 03:56:35 UTC
Validated on : 
[miyadav@miyadav ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-04-12-180647   True        False         5m22s   Cluster version is 4.5.0-0.nightly-2020-04-12-180647

Steps:
1.[miyadav@miyadav ~]$ oc project openshift-machine-api
Now using project "openshift-machine-api" on server "https://api.miyadav-1304.qe.devcluster.openshift.com:6443".

2.[miyadav@miyadav ~]$ oc get machines
NAME                                         PHASE     TYPE        REGION      ZONE         AGE
miyadav-1304-plrql-master-0                  Running   m4.xlarge   us-east-2   us-east-2a   35m
miyadav-1304-plrql-master-1                  Running   m4.xlarge   us-east-2   us-east-2b   35m
miyadav-1304-plrql-master-2                  Running   m4.xlarge   us-east-2   us-east-2c   35m
miyadav-1304-plrql-worker-us-east-2a-rrhzj   Running   m4.large    us-east-2   us-east-2a   5m59s
miyadav-1304-plrql-worker-us-east-2b-zdft4   Running   m4.large    us-east-2   us-east-2b   24m
miyadav-1304-plrql-worker-us-east-2c-g2kn7   Running   m4.large    us-east-2   us-east-2c   24m

3.[miyadav@miyadav ~]$ oc delete machine miyadav-1304-plrql-worker-us-east-2b-zdft4
machine.machine.openshift.io "miyadav-1304-plrql-worker-us-east-2b-zdft4" deleted

4. oc logs -f machine-api-controllers-56b68ff6d7-qczp9 -c machine-controller 
 Actual & Expected :
I0413 03:44:29.525626       1 reconciler.go:352] miyadav-1304-plrql-worker-us-east-2b-wsnrv: Instance state still pending, returning an error to requeue
I0413 03:44:29.525635       1 machine_scope.go:80] miyadav-1304-plrql-worker-us-east-2b-wsnrv: patching machine
E0413 03:44:29.538575       1 actuator.go:65] miyadav-1304-plrql-worker-us-east-2b-wsnrv error: requeue in: 20s
E0413 03:44:29.538628       1 controller.go:275] miyadav-1304-plrql-worker-us-east-2b-wsnrv: error updating machine: requeue in: 20s
I0413 03:44:29.538639       1 controller.go:401] Actuator returned requeue-after error: requeue in: 20s
I0413 03:44:29.538724       1 recorder.go:52] controller-runtime/manager/events "msg"="Warning"  "message"="requeue in: 20s" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"miyadav-1304-plrql-worker-us-east-2b-wsnrv","uid":"1e209294-8f6e-462b-a52c-849576609dad","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"27546"} "reason"="FailedCreate"
I0413 03:44:49.538922       1 controller.go:165] miyadav-1304-plrql-worker-us-east-2b-wsnrv: reconciling Machine
I0413 03:44:49.538950       1 actuator.go:97] miyadav-1304-plrql-worker-us-east-2b-wsnrv: actuator checking if machine exists
I0413 03:44:49.631983       1 reconciler.go:367] miyadav-1304-plrql-worker-us-east-2b-wsnrv: Found instance by id: i-0ad778699bb16aa1a
I0413 03:44:49.632007       1 controller.go:273] miyadav-1304-plrql-worker-us-east-2b-wsnrv: reconciling machine triggers idempotent update
I0413 03:44:49.632013       1 actuator.go:112] miyadav-1304-plrql-worker-us-east-2b-wsnrv: actuator updating machine
.
.
.

Additional Info : 20s difference is seen in logs honoring requeue time

Comment 12 errata-xmlrpc 2020-07-13 17:12:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 13 Red Hat Bugzilla 2023-09-14 05:46:07 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.