In a 4.3 CI run [1], we have a Failed machine with "Can't find created instance" [2]. Looking in the logs, here's a reconstructed timeline: * 16:57:14.441534003Z MAO creates the machine: `creating machine` [3] * 16:57:16.780784984Z MAO sees the machine pending: `Instance state still pending, returning an error to requeue` [3] * 16:57:16.780894255Z MAO requeues for a later check: `requeue in: 20s` [3] * 16:57:16.898246721Z MAO checks again milliseconds later: `Instance does not exist` [3] * 17:00:37Z scaleup pod fails: `Container scaleup in pod e2e-aws-scaleup-rhel7 failed, exit code 2, reason Error` [4] * ~17:02:18Z must-gather run [5] * 17:07:04Z Installer 'destroy cluster' running in teardown pod terminates the instance: `Terminating instance=i-0b08e65c0de9820be` [6] (confirmed in CloudTrail logs, this was AWS request a7b2be94-5f2d-4a32-9f48-2d885fe8abdd). * 17:07:34Z Installer 'destroy cluster' running in teardown pod sees the state move to `terminated`: `Terminated instance=i-0b08e65c0de9820be` [6] It looks like there are two MAO controller streams checking on the same Machine in parallel, but the fundamental issue seems to be that MAO failed to see the instance in one of the exist states [7,8,9]. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9.yaml [3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-7445cdcd69-vlqjh/machine-controller/machine-controller/logs/current.log [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242#1:build-log.txt%3A7638 [5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/must-gather.log [6]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/installer/.openshift_install.log [7]: https://github.com/openshift/cluster-api-provider-aws/blob/6814fc34f30a4c98d5edeb0a4ef73bbd03fa959b/pkg/actuators/machine/utils.go#L36-L46 [8]: https://github.com/openshift/cluster-api-provider-aws/blob/6814fc34f30a4c98d5edeb0a4ef73bbd03fa959b/pkg/actuators/machine/actuator.go#L486-L504 [9]: https://github.com/openshift/cluster-api/blob/9a3a7bbe9258f5c3117cb2aba6a560babe0d59f8/pkg/controller/machine/controller.go#L277-L313
> It looks like there are two MAO controller streams... So one side-effect from this is that once we know the instance is pending, we must know its ID. That is logged at 16:57:16.780784984Z, and we see the ID in the Machine object: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9.yaml | yaml2json | jq -r .status.providerStatus.instanceId i-0b08e65c0de9820be That means subsequent status lookups should use by-ID queries [1], but we see none of those entries in the logs: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-7445cdcd69-vlqjh/machine-controller/machine-controller/logs/current.log | grep -c 'by id: i-0b08e65c0de9820be' 0 getExistingInstances should still turn the instance up, and I'm not sure why it doesn't in the 16:57:16.898246721Z attempt yet, but if the MAO controller wasn't racing itself this would have been an easier "AWS request ID ... for DescribeInstances with the instance-id filter set returned no results" or some such, and that's both a more efficient query and an easier case to make when reporting AWS bugs to AWS. [1]: https://github.com/openshift/cluster-api-provider-aws/blob/6814fc34f30a4c98d5edeb0a4ef73bbd03fa959b/pkg/actuators/machine/actuator.go#L544-L557
Pulled the ignored-requeue-delay thing out into bug 1772192.
*** Bug 1771903 has been marked as a duplicate of this bug. ***
here the time lapse between the conflicting exists() discrepancy is 2 seconds: 2019-11-13T16:57:14.38218354Z I1113 16:57:14.382147 1 actuator.go:489] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Checking if machine exists 2019-11-13T16:57:14.431854727Z I1113 16:57:14.431815 1 actuator.go:497] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Instance does not exist And 2019-11-13T16:57:16.781207156Z I1113 16:57:16.781193 1 actuator.go:489] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Checking if machine exists 2019-11-13T16:57:16.898246721Z I1113 16:57:16.897946 1 actuator.go:497] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Instance does not exist You can see in this other occurrence the time lapse is 3 min: I1112 21:35:29.387534 1 actuator.go:489] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Checking if machine exists I1112 21:35:29.485646 1 actuator.go:497] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Instance does not exist I1112 21:35:29.485668 1 controller.go:428] Machine "ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc" going into phase "Provisioning" Successfully report exists()=true several times in the middle, e.g I1112 21:35:44.121408 1 actuator.go:489] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Checking if machine exists I1112 21:35:44.937542 1 actuator.go:553] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Found instance by id: i-01f95db5aeec7c187 And eventually I1112 21:38:03.542779 1 actuator.go:497] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Instance does not exist I1112 21:38:03.542801 1 controller.go:428] Machine "ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc" going into phase "Failed" Which makes me think the next possible scenarios: -aws is failing to bootstrap the machine and going from pending to terminated https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html#troubleshooting-launch-internal - We are not accounting for a specific aws state e.g "unknown" in our Exists() func Trevor Is there any way to check the termination reason in those logs? e.g: "StateReason": { "Message": "Client.VolumeLimitExceeded: Volume limit exceeded", "Code": "Server.InternalError" }, This should help in case we are missing a possible aws state in our exists() check https://github.com/openshift/cluster-api-provider-aws/pull/269
At the time the api reported phase 'Failed', aws was still reporting instanceState 'pending' for the instance. At least that is what is being returned from the api. status: addresses: - address: 10.0.151.226 type: InternalIP - address: '' type: ExternalDNS - address: ip-10-0-151-226.ec2.internal type: InternalDNS errorMessage: Can't find created instance. lastUpdated: '2019-11-13T16:57:16Z' phase: Failed providerStatus: apiVersion: awsproviderconfig.openshift.io/v1beta1 conditions: - lastProbeTime: '2019-11-13T16:57:16Z' lastTransitionTime: '2019-11-13T16:57:16Z' message: machine successfully created reason: MachineCreationSucceeded status: 'True' type: MachineCreation instanceId: i-0b08e65c0de9820be instanceState: pending kind: AWSMachineProviderStatus
Another instance of this occurred in this job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144 status: addresses: - address: 10.0.145.58 type: InternalIP - address: '' type: ExternalDNS - address: ip-10-0-145-58.ec2.internal type: InternalDNS errorMessage: Can't find created instance. lastUpdated: '2019-11-14T16:15:20Z' phase: Failed providerStatus: apiVersion: awsproviderconfig.openshift.io/v1beta1 conditions: - lastProbeTime: '2019-11-14T16:15:19Z' lastTransitionTime: '2019-11-14T16:15:19Z' message: machine successfully created reason: MachineCreationSucceeded status: 'True' type: MachineCreation instanceId: i-01f493c22da1d89d9 instanceState: pending kind: AWSMachineProviderStatus
(In reply to Russell Teague from comment #7) > Another instance of this occurred in this job: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144 That machine is ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g [1]. Checking for it in the controller logs: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-h27b6mjn-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-744574c5f7-v2944/machine-controller/machine-controller/logs/current.log | grep -B2 'ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: \(Instance does not exist\|.*pending\)' 2019-11-14T16:15:17.830004619Z I1114 16:15:17.829986 1 controller.go:376] Machine "ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-14T16:15:17.830041524Z I1114 16:15:17.829998 1 actuator.go:489] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Checking if machine exists 2019-11-14T16:15:17.938841572Z I1114 16:15:17.938790 1 actuator.go:497] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance does not exist -- 2019-11-14T16:15:19.988382838Z I1114 16:15:19.988372 1 actuator.go:646] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: finished calculating AWS status 2019-11-14T16:15:19.988486132Z I1114 16:15:19.988422 1 actuator.go:227] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: machine status has changed, updating 2019-11-14T16:15:19.996200375Z I1114 16:15:19.996165 1 actuator.go:664] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance state still pending, returning an error to requeue -- 2019-11-14T16:15:19.996302084Z I1114 16:15:19.996289 1 controller.go:376] Machine "ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-14T16:15:19.996320327Z I1114 16:15:19.996305 1 actuator.go:489] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Checking if machine exists 2019-11-14T16:15:20.11500679Z I1114 16:15:20.114961 1 actuator.go:497] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance does not exist That job didn't include [2]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "aws-machine-controllers").annotations["io.openshift.build.commit.id"]' f68fb22560615bc2b556c7f5b7aff5dfcddd74ca So future failures should have better logging, but I'm moving back to ASSIGNED because it does contain [3] and the issue is obviously not fixed. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-h27b6mjn-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g.yaml [2]: https://github.com/openshift/cluster-api-provider-aws/pull/268 [3]: https://github.com/openshift/cluster-api-provider-aws/pull/269
Seen another occurrence of "Can't find created instance." https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255
And again, https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2659/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/2994
(In reply to Russell Teague from comment #9) > Seen another occurrence of "Can't find created instance." > > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255 $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep -B5 'ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine' 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686811 1 actuator.go:552] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Found instance by id: i-04e60273f38857eed 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686839 1 actuator.go:502] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Instance exists as "i-04e60273f38857eed" 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686853 1 controller.go:284] Reconciling machine "ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696" triggers idempotent update 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686862 1 actuator.go:406] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: updating machine 2019-11-18T01:08:14.68696548Z I1118 01:08:14.686938 1 actuator.go:414] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: obtaining EC2 client for region 2019-11-18T01:08:14.741237517Z I1118 01:08:14.741189 1 actuator.go:430] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine But [2] just landed in this space; maybe it will help. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696.yaml [2]: https://github.com/openshift/cluster-api-provider-aws/pull/270
Certainly looks to me like it might be an eventual-consistency issue on the AWS side [1]. Logged line from [2] came after the old getExistingInstances call [3] which would have logged any state exclusions [4]. But no exclusion warnings in the logs: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep 'Excluding instance matching' ...no hits... I don't see a way to log request IDs or describe results to pin that down, but I'm guessing that we need to wait and retry for some reasonable delay (30s?) before marking a machine failed. Fixing bug 1772192 would help with that, by giving us a working way to queue the retry. [1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency [2]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/actuator.go#L430 [3]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/utils.go#L90 [4]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/utils.go#L186
Ah, looks like this bug was a dup of a previously-reported issue. Closing as such. *** This bug has been marked as a duplicate of bug 1761882 ***