+++ This bug was initially created as a clone of Bug #1781339 +++ From a recent disruption job [1], where two control-plane Machines are deleted and then pushed back into the cluster. Here's the old Machine going out: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/previous.log | grep -B1 i-0e21ada20e9b19da2 2019-12-08T22:19:53.070511591Z I1208 22:19:53.070473 1 actuator.go:336] ci-op-vkpy76m9-16905-pdt5z-master-1: deleting machine 2019-12-08T22:19:53.185377501Z I1208 22:19:53.185272 1 utils.go:218] Cleaning up extraneous instance for machine: i-0e21ada20e9b19da2, state: running, launchTime: 2019-12-08 21:43:24 +0000 UTC 2019-12-08T22:19:53.185486624Z I1208 22:19:53.185462 1 utils.go:222] Terminating i-0e21ada20e9b19da2 instance And then (in a new controller pod), the new Machine coming in and picking up that same (now terminated) i-0e21ada20e9b19da2: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/current.log | grep ci-op-vkpy76m9-16905-pdt5z-master-1 | head -n12 2019-12-08T22:24:58.11203914Z I1208 22:24:58.110637 1 controller.go:161] Reconciling Machine "ci-op-vkpy76m9-16905-pdt5z-master-1" 2019-12-08T22:24:58.11203914Z I1208 22:24:58.110660 1 controller.go:201] Reconciling machine "ci-op-vkpy76m9-16905-pdt5z-master-1" triggers delete 2019-12-08T22:24:58.124204426Z I1208 22:24:58.124188 1 actuator.go:336] ci-op-vkpy76m9-16905-pdt5z-master-1: deleting machine 2019-12-08T22:24:58.20959814Z E1208 22:24:58.209529 1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running 2019-12-08T22:24:58.20959814Z W1208 22:24:58.209569 1 actuator.go:377] ci-op-vkpy76m9-16905-pdt5z-master-1: no instances found to delete for machine 2019-12-08T22:24:58.209648523Z I1208 22:24:58.209630 1 controller.go:228] Deleting node "ip-10-0-154-177.ec2.internal" for machine "ci-op-vkpy76m9-16905-pdt5z-master-1" 2019-12-08T22:24:58.231206051Z I1208 22:24:58.231148 1 controller.go:242] Machine "ci-op-vkpy76m9-16905-pdt5z-master-1" deletion successful 2019-12-08T22:25:00.610252762Z I1208 22:25:00.610169 1 controller.go:161] Reconciling Machine "ci-op-vkpy76m9-16905-pdt5z-master-1" 2019-12-08T22:25:00.610252762Z I1208 22:25:00.610215 1 actuator.go:497] ci-op-vkpy76m9-16905-pdt5z-master-1: Checking if machine exists 2019-12-08T22:25:00.667060176Z E1208 22:25:00.666191 1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2019-12-08T22:25:00.667060176Z I1208 22:25:00.666222 1 actuator.go:506] ci-op-vkpy76m9-16905-pdt5z-master-1: Possible eventual-consistency discrepancy; returning an error to requeue 2019-12-08T22:25:00.667060176Z E1208 22:25:00.666241 1 controller.go:253] Failed to check if machine "ci-op-vkpy76m9-16905-pdt5z-master-1" exists: requeue in: 20s which continues for at least 30m: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/current.log | grep ci-op-vkpy76m9-16905-pdt5z-master-1 | tail -n4 2019-12-08T22:54:58.361762418Z I1208 22:54:58.360176 1 actuator.go:497] ci-op-vkpy76m9-16905-pdt5z-master-1: Checking if machine exists 2019-12-08T22:54:58.442454699Z E1208 22:54:58.442398 1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2019-12-08T22:54:58.442454699Z I1208 22:54:58.442429 1 actuator.go:506] ci-op-vkpy76m9-16905-pdt5z-master-1: Possible eventual-consistency discrepancy; returning an error to requeue 2019-12-08T22:54:58.442503051Z E1208 22:54:58.442449 1 controller.go:253] Failed to check if machine "ci-op-vkpy76m9-16905-pdt5z-master-1" exists: requeue in: 20s The confused logic around eventual-consistency landed for bug 1772163. I think we may need to go back and add limits to those eventual-consistency guards as discussed in [2]. From Clayton: > mark urgent since disruption tests have caught at least 5 serious bugs that would impact customers at the worst possible time - disruptive suites get priority because of that [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12 [2]: https://github.com/openshift/cluster-api-provider-aws/pull/271#discussion_r348300076 --- Additional comment from Michael Gugino on 2019-12-09 22:13:48 UTC --- IMO, the fix is to revert everything that has to do with determining whether a machine is 'failed' and all the related actions. We shouldn't remove the details of the instance from the machine object, it should just be an 'instance state' field that describes the current state. --- Additional comment from W. Trevor King on 2019-12-09 22:19:52 UTC --- Are you asking for a revert of [1]? [1]: https://github.com/openshift/origin/commit/bd064a37e3ec5bf6db4b6a890096398a27be4459 --- Additional comment from Brad Ison on 2019-12-10 11:57:45 UTC --- > IMO, the fix is to revert everything that has to do with determining whether a machine is 'failed' and all the related actions. That seems pretty drastic. If you're going to suggest something like that, can you *please* be more specific? It seems to me that the existing PR should fix this: https://github.com/openshift/cluster-api-provider-aws/pull/280
According to a most recent disruptive test for 4.3, it's still failing [1]. Two machines that were deleted and recreated by the disruptive test were https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/74/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-12-05-183852-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-vxh8i8vn-2770b-v8htb-master-0.yaml https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/74/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-12-05-183852-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-vxh8i8vn-2770b-v8htb-master-1.yaml Machine controller seems to produce same message[2] as the bug reporting. ``` 2019-12-12T18:09:12.213366746Z I1212 18:09:12.212375 1 actuator.go:499] ci-op-vxh8i8vn-2770b-v8htb-master-0: Checking if machine exists 2019-12-12T18:09:12.355415018Z E1212 18:09:12.355370 1 utils.go:186] Excluding instance matching ci-op-vxh8i8vn-2770b-v8htb-master-0: instance i-03bc8d3394d9ac2a1 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2019-12-12T18:09:12.35548575Z I1212 18:09:12.355466 1 actuator.go:508] ci-op-vxh8i8vn-2770b-v8htb-master-0: Possible eventual-consistency discrepancy; returning an error to requeue 2019-12-12T18:09:12.355588804Z E1212 18:09:12.355522 1 controller.go:279] Failed to check if machine "ci-op-vxh8i8vn-2770b-v8htb-master-0" exists: requeue in: 20s ... 2019-12-12T18:38:32.316281711Z I1212 18:38:32.316267 1 actuator.go:499] ci-op-vxh8i8vn-2770b-v8htb-master-0: Checking if machine exists 2019-12-12T18:38:32.39923898Z E1212 18:38:32.399198 1 utils.go:186] Excluding instance matching ci-op-vxh8i8vn-2770b-v8htb-master-0: instance i-03bc8d3394d9ac2a1 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2019-12-12T18:38:32.39923898Z I1212 18:38:32.399221 1 actuator.go:508] ci-op-vxh8i8vn-2770b-v8htb-master-0: Possible eventual-consistency discrepancy; returning an error to requeue 2019-12-12T18:38:32.399279482Z E1212 18:38:32.399234 1 controller.go:279] Failed to check if machine "ci-op-vxh8i8vn-2770b-v8htb-master-0" exists: requeue in: 20s ``` [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/73 [2] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/74/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-12-05-183852-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-6c667d98d5-v4kwr/machine-controller/machine-controller/logs/current.log
I don't think either of the linked runs actually used an image that contained the fix: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/74/artifacts/release-images-latest/release-images-latest That shows: "name": "aws-machine-controllers", "annotations": { "io.openshift.build.commit.id": "a4be8350439b1da03902a65f1399b1269141c137", "io.openshift.build.commit.ref": "release-4.3", "io.openshift.build.source-location": "https://github.com/openshift/cluster-api-provider-aws" } That commit (a4be835) is from November 26th, and doesn't contain the fix. Can we try again with a newer release image?
Thank you Brad, sorry for the false alarm. I inspected a new disruptive test that had picked up the fix[1]. The above issue has been fixed, the eventual-consistency error had not occurred[2]. "name": "aws-machine-controllers", "annotations": { "io.openshift.build.commit.id": "21713459ae063978bac32b8ea35dc6d68998ed89", "io.openshift.build.commit.ref": "release-4.3", "io.openshift.build.source-location": "https://github.com/openshift/cluster-api-provider-aws" }, commit 21713459ae063978bac32b8ea35dc6d68998ed89 (HEAD -> release-4.3, origin/release-4.3) Merge: 1555366a ba72c04d Author: OpenShift Merge Robot <openshift-merge-robot.github.com> Date: Wed Dec 11 14:51:41 2019 +0100 Merge pull request #281 from openshift-cherrypick-robot/cherry-pick-280-to-release-4.3 [release-4.3] Bug 1782176: Ensure Spec.ProviderID is not empty string [1] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/77/artifacts/release-images-latest/release-images-latest [2] https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.3/77/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-12-14-042216-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-75d55f5c97-ccn9k/machine-controller/machine-controller/logs/current.log
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062