From a recent disruption job [1], where two control-plane Machines are deleted and then pushed back into the cluster. Here's the old Machine going out: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/previous.log | grep -B1 i-0e21ada20e9b19da2 2019-12-08T22:19:53.070511591Z I1208 22:19:53.070473 1 actuator.go:336] ci-op-vkpy76m9-16905-pdt5z-master-1: deleting machine 2019-12-08T22:19:53.185377501Z I1208 22:19:53.185272 1 utils.go:218] Cleaning up extraneous instance for machine: i-0e21ada20e9b19da2, state: running, launchTime: 2019-12-08 21:43:24 +0000 UTC 2019-12-08T22:19:53.185486624Z I1208 22:19:53.185462 1 utils.go:222] Terminating i-0e21ada20e9b19da2 instance And then (in a new controller pod), the new Machine coming in and picking up that same (now terminated) i-0e21ada20e9b19da2: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/current.log | grep ci-op-vkpy76m9-16905-pdt5z-master-1 | head -n12 2019-12-08T22:24:58.11203914Z I1208 22:24:58.110637 1 controller.go:161] Reconciling Machine "ci-op-vkpy76m9-16905-pdt5z-master-1" 2019-12-08T22:24:58.11203914Z I1208 22:24:58.110660 1 controller.go:201] Reconciling machine "ci-op-vkpy76m9-16905-pdt5z-master-1" triggers delete 2019-12-08T22:24:58.124204426Z I1208 22:24:58.124188 1 actuator.go:336] ci-op-vkpy76m9-16905-pdt5z-master-1: deleting machine 2019-12-08T22:24:58.20959814Z E1208 22:24:58.209529 1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running 2019-12-08T22:24:58.20959814Z W1208 22:24:58.209569 1 actuator.go:377] ci-op-vkpy76m9-16905-pdt5z-master-1: no instances found to delete for machine 2019-12-08T22:24:58.209648523Z I1208 22:24:58.209630 1 controller.go:228] Deleting node "ip-10-0-154-177.ec2.internal" for machine "ci-op-vkpy76m9-16905-pdt5z-master-1" 2019-12-08T22:24:58.231206051Z I1208 22:24:58.231148 1 controller.go:242] Machine "ci-op-vkpy76m9-16905-pdt5z-master-1" deletion successful 2019-12-08T22:25:00.610252762Z I1208 22:25:00.610169 1 controller.go:161] Reconciling Machine "ci-op-vkpy76m9-16905-pdt5z-master-1" 2019-12-08T22:25:00.610252762Z I1208 22:25:00.610215 1 actuator.go:497] ci-op-vkpy76m9-16905-pdt5z-master-1: Checking if machine exists 2019-12-08T22:25:00.667060176Z E1208 22:25:00.666191 1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2019-12-08T22:25:00.667060176Z I1208 22:25:00.666222 1 actuator.go:506] ci-op-vkpy76m9-16905-pdt5z-master-1: Possible eventual-consistency discrepancy; returning an error to requeue 2019-12-08T22:25:00.667060176Z E1208 22:25:00.666241 1 controller.go:253] Failed to check if machine "ci-op-vkpy76m9-16905-pdt5z-master-1" exists: requeue in: 20s which continues for at least 30m: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12/artifacts/e2e-aws-disruptive/must-gather/registry-svc-ci-openshift-org-ocp-4-4-2019-12-08-180953-sha256-c31e3068a603b8d8add473dbaaa5b933323a23dc862cb266855248e7cba5ac99/namespaces/openshift-machine-api/pods/machine-api-controllers-dfdc5f957-6fj59/machine-controller/machine-controller/logs/current.log | grep ci-op-vkpy76m9-16905-pdt5z-master-1 | tail -n4 2019-12-08T22:54:58.361762418Z I1208 22:54:58.360176 1 actuator.go:497] ci-op-vkpy76m9-16905-pdt5z-master-1: Checking if machine exists 2019-12-08T22:54:58.442454699Z E1208 22:54:58.442398 1 utils.go:187] Excluding instance matching ci-op-vkpy76m9-16905-pdt5z-master-1: instance i-0e21ada20e9b19da2 state "terminated" is not in running, pending, stopped, stopping, shutting-down 2019-12-08T22:54:58.442454699Z I1208 22:54:58.442429 1 actuator.go:506] ci-op-vkpy76m9-16905-pdt5z-master-1: Possible eventual-consistency discrepancy; returning an error to requeue 2019-12-08T22:54:58.442503051Z E1208 22:54:58.442449 1 controller.go:253] Failed to check if machine "ci-op-vkpy76m9-16905-pdt5z-master-1" exists: requeue in: 20s The confused logic around eventual-consistency landed for bug 1772163. I think we may need to go back and add limits to those eventual-consistency guards as discussed in [2]. From Clayton: > mark urgent since disruption tests have caught at least 5 serious bugs that would impact customers at the worst possible time - disruptive suites get priority because of that [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/12 [2]: https://github.com/openshift/cluster-api-provider-aws/pull/271#discussion_r348300076
IMO, the fix is to revert everything that has to do with determining whether a machine is 'failed' and all the related actions. We shouldn't remove the details of the instance from the machine object, it should just be an 'instance state' field that describes the current state.
Are you asking for a revert of [1]? [1]: https://github.com/openshift/origin/commit/bd064a37e3ec5bf6db4b6a890096398a27be4459
> IMO, the fix is to revert everything that has to do with determining whether a machine is 'failed' and all the related actions. That seems pretty drastic. If you're going to suggest something like that, can you *please* be more specific? It seems to me that the existing PR should fix this: https://github.com/openshift/cluster-api-provider-aws/pull/280
Verified this has been fixed in 4.4.0-0.ci-2019-12-13-145806. Job https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/15 Version: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/15/artifacts/e2e-aws-disruptive/clusterversion.json https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-disruptive-4.4/15/artifacts/release-images-latest/release-images-latest commit 07a8fd7b9920bcc9c6e6e0ef37529f0b3969836f Merge: c22ade5b 3d1e2c69 Author: OpenShift Merge Robot <openshift-merge-robot.github.com> Date: Wed Dec 11 13:20:47 2019 +0100 Merge pull request #280 from mgugino-upstream-stage/check-empty-quote-providerspec Bug 1781339: Ensure Spec.ProviderID is not empty string