Bug 1772163

Summary:	AWS: Machine API operator marks (pending) Machine Failed: Can't find created instance
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cloud Compute	Assignee:	Alberto <agarcial>
Status:	CLOSED DUPLICATE	QA Contact:	Jianwei Hou <jhou>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	rteague, spasquie
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-19 22:04:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-11-13 19:32:30 UTC

In a 4.3 CI run [1], we have a Failed machine with "Can't find created instance" [2].  Looking in the logs, here's a reconstructed timeline:

* 16:57:14.441534003Z MAO creates the machine: `creating machine` [3]
* 16:57:16.780784984Z MAO sees the machine pending: `Instance state still pending, returning an error to requeue` [3]
* 16:57:16.780894255Z MAO requeues for a later check: `requeue in: 20s` [3]
* 16:57:16.898246721Z MAO checks again milliseconds later: `Instance does not exist` [3]
* 17:00:37Z scaleup pod fails: `Container scaleup in pod e2e-aws-scaleup-rhel7 failed, exit code 2, reason Error` [4]
* ~17:02:18Z must-gather run [5]
* 17:07:04Z Installer 'destroy cluster' running in teardown pod terminates the instance: `Terminating instance=i-0b08e65c0de9820be` [6] (confirmed in CloudTrail logs, this was AWS request a7b2be94-5f2d-4a32-9f48-2d885fe8abdd).
* 17:07:34Z Installer 'destroy cluster' running in teardown pod sees the state move to `terminated`: `Terminated instance=i-0b08e65c0de9820be` [6]

It looks like there are two MAO controller streams checking on the same Machine in parallel, but the fundamental issue seems to be that MAO failed to see the instance in one of the exist states [7,8,9].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9.yaml
[3]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-7445cdcd69-vlqjh/machine-controller/machine-controller/logs/current.log
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242#1:build-log.txt%3A7638
[5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/must-gather.log
[6]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/installer/.openshift_install.log
[7]: https://github.com/openshift/cluster-api-provider-aws/blob/6814fc34f30a4c98d5edeb0a4ef73bbd03fa959b/pkg/actuators/machine/utils.go#L36-L46
[8]: https://github.com/openshift/cluster-api-provider-aws/blob/6814fc34f30a4c98d5edeb0a4ef73bbd03fa959b/pkg/actuators/machine/actuator.go#L486-L504
[9]: https://github.com/openshift/cluster-api/blob/9a3a7bbe9258f5c3117cb2aba6a560babe0d59f8/pkg/controller/machine/controller.go#L277-L313

Comment 1 W. Trevor King 2019-11-13 19:49:42 UTC

> It looks like there are two MAO controller streams...

So one side-effect from this is that once we know the instance is pending, we must know its ID.  That is logged at 16:57:16.780784984Z, and we see the ID in the Machine object:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9.yaml | yaml2json | jq -r .status.providerStatus.instanceId
  i-0b08e65c0de9820be

That means subsequent status lookups should use by-ID queries [1], but we see none of those entries in the logs:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-7445cdcd69-vlqjh/machine-controller/machine-controller/logs/current.log | grep -c 'by id: i-0b08e65c0de9820be'
  0

getExistingInstances should still turn the instance up, and I'm not sure why it doesn't in the 16:57:16.898246721Z attempt yet, but if the MAO controller wasn't racing itself this would have been an easier "AWS request ID ... for DescribeInstances with the instance-id filter set returned no results" or some such, and that's both a more efficient query and an easier case to make when reporting AWS bugs to AWS.

[1]: https://github.com/openshift/cluster-api-provider-aws/blob/6814fc34f30a4c98d5edeb0a4ef73bbd03fa959b/pkg/actuators/machine/actuator.go#L544-L557

Comment 2 W. Trevor King 2019-11-13 21:07:35 UTC

Pulled the ignored-requeue-delay thing out into bug 1772192.

Comment 3 Alberto 2019-11-14 09:34:45 UTC

*** Bug 1771903 has been marked as a duplicate of this bug. ***

Comment 4 Alberto 2019-11-14 12:29:16 UTC

here the time lapse between the conflicting exists() discrepancy is 2 seconds:
2019-11-13T16:57:14.38218354Z I1113 16:57:14.382147       1 actuator.go:489] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Checking if machine exists
2019-11-13T16:57:14.431854727Z I1113 16:57:14.431815       1 actuator.go:497] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Instance does not exist
And
2019-11-13T16:57:16.781207156Z I1113 16:57:16.781193       1 actuator.go:489] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Checking if machine exists
2019-11-13T16:57:16.898246721Z I1113 16:57:16.897946       1 actuator.go:497] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Instance does not exist

You can see in this other occurrence the time lapse is 3 min:
I1112 21:35:29.387534       1 actuator.go:489] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Checking if machine exists
I1112 21:35:29.485646       1 actuator.go:497] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Instance does not exist
I1112 21:35:29.485668       1 controller.go:428] Machine "ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc" going into phase "Provisioning"
Successfully report exists()=true several times in the middle, e.g
I1112 21:35:44.121408       1 actuator.go:489] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Checking if machine exists
I1112 21:35:44.937542       1 actuator.go:553] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Found instance by id: i-01f95db5aeec7c187 
And eventually
I1112 21:38:03.542779       1 actuator.go:497] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Instance does not exist
I1112 21:38:03.542801       1 controller.go:428] Machine "ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc" going into phase "Failed"

Which makes me think the next possible scenarios:
-aws is failing to bootstrap the machine and going from pending to terminated https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html#troubleshooting-launch-internal
- We are not accounting for a specific aws state e.g "unknown" in our Exists() func

Trevor Is there any way to check the termination reason in those logs? e.g:
"StateReason": {
  "Message": "Client.VolumeLimitExceeded: Volume limit exceeded", 
  "Code": "Server.InternalError"
},

This should help in case we are missing a possible aws state in our exists() check https://github.com/openshift/cluster-api-provider-aws/pull/269

Comment 5 Russell Teague 2019-11-14 13:27:19 UTC

At the time the api reported phase 'Failed', aws was still reporting instanceState 'pending' for the instance.  At least that is what is being returned from the api.

status:
      addresses:
      - address: 10.0.151.226
        type: InternalIP
      - address: ''
        type: ExternalDNS
      - address: ip-10-0-151-226.ec2.internal
        type: InternalDNS
      errorMessage: Can't find created instance.
      lastUpdated: '2019-11-13T16:57:16Z'
      phase: Failed
      providerStatus:
        apiVersion: awsproviderconfig.openshift.io/v1beta1
        conditions:
        - lastProbeTime: '2019-11-13T16:57:16Z'
          lastTransitionTime: '2019-11-13T16:57:16Z'
          message: machine successfully created
          reason: MachineCreationSucceeded
          status: 'True'
          type: MachineCreation
        instanceId: i-0b08e65c0de9820be
        instanceState: pending
        kind: AWSMachineProviderStatus

Comment 7 Russell Teague 2019-11-14 16:43:26 UTC

Another instance of this occurred in this job:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144

    status:
      addresses:
      - address: 10.0.145.58
        type: InternalIP
      - address: ''
        type: ExternalDNS
      - address: ip-10-0-145-58.ec2.internal
        type: InternalDNS
      errorMessage: Can't find created instance.
      lastUpdated: '2019-11-14T16:15:20Z'
      phase: Failed
      providerStatus:
        apiVersion: awsproviderconfig.openshift.io/v1beta1
        conditions:
        - lastProbeTime: '2019-11-14T16:15:19Z'
          lastTransitionTime: '2019-11-14T16:15:19Z'
          message: machine successfully created
          reason: MachineCreationSucceeded
          status: 'True'
          type: MachineCreation
        instanceId: i-01f493c22da1d89d9
        instanceState: pending
        kind: AWSMachineProviderStatus

Comment 8 W. Trevor King 2019-11-15 00:51:19 UTC

(In reply to Russell Teague from comment #7)
> Another instance of this occurred in this job:
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144

That machine is ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g [1].  Checking for it in the controller logs:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-h27b6mjn-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-744574c5f7-v2944/machine-controller/machine-controller/logs/current.log | grep -B2 'ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: \(Instance does not exist\|.*pending\)'
2019-11-14T16:15:17.830004619Z I1114 16:15:17.829986       1 controller.go:376] Machine "ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-14T16:15:17.830041524Z I1114 16:15:17.829998       1 actuator.go:489] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Checking if machine exists
2019-11-14T16:15:17.938841572Z I1114 16:15:17.938790       1 actuator.go:497] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance does not exist
--
2019-11-14T16:15:19.988382838Z I1114 16:15:19.988372       1 actuator.go:646] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: finished calculating AWS status
2019-11-14T16:15:19.988486132Z I1114 16:15:19.988422       1 actuator.go:227] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: machine status has changed, updating
2019-11-14T16:15:19.996200375Z I1114 16:15:19.996165       1 actuator.go:664] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance state still pending, returning an error to requeue
--
2019-11-14T16:15:19.996302084Z I1114 16:15:19.996289       1 controller.go:376] Machine "ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
2019-11-14T16:15:19.996320327Z I1114 16:15:19.996305       1 actuator.go:489] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Checking if machine exists
2019-11-14T16:15:20.11500679Z I1114 16:15:20.114961       1 actuator.go:497] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance does not exist

That job didn't include [2]:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "aws-machine-controllers").annotations["io.openshift.build.commit.id"]'
f68fb22560615bc2b556c7f5b7aff5dfcddd74ca

So future failures should have better logging, but I'm moving back to ASSIGNED because it does contain [3] and the issue is obviously not fixed.

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-h27b6mjn-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g.yaml
[2]: https://github.com/openshift/cluster-api-provider-aws/pull/268
[3]: https://github.com/openshift/cluster-api-provider-aws/pull/269

Comment 9 Russell Teague 2019-11-18 13:33:28 UTC

Seen another occurrence of "Can't find created instance."

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255

Comment 10 Russell Teague 2019-11-18 18:32:14 UTC

And again,
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2659/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/2994

Comment 11 W. Trevor King 2019-11-18 19:28:24 UTC

(In reply to Russell Teague from comment #9)
> Seen another occurrence of "Can't find created instance."
> 
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep -B5 'ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine'
2019-11-18T01:08:14.686879882Z I1118 01:08:14.686811       1 actuator.go:552] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Found instance by id: i-04e60273f38857eed
2019-11-18T01:08:14.686879882Z I1118 01:08:14.686839       1 actuator.go:502] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Instance exists as "i-04e60273f38857eed"
2019-11-18T01:08:14.686879882Z I1118 01:08:14.686853       1 controller.go:284] Reconciling machine "ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696" triggers idempotent update
2019-11-18T01:08:14.686879882Z I1118 01:08:14.686862       1 actuator.go:406] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: updating machine
2019-11-18T01:08:14.68696548Z I1118 01:08:14.686938       1 actuator.go:414] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: obtaining EC2 client for region
2019-11-18T01:08:14.741237517Z I1118 01:08:14.741189       1 actuator.go:430] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine

But [2] just landed in this space; maybe it will help.

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696.yaml
[2]: https://github.com/openshift/cluster-api-provider-aws/pull/270

Comment 12 W. Trevor King 2019-11-18 19:49:20 UTC

Certainly looks to me like it might be an eventual-consistency issue on the AWS side [1].  Logged line from [2] came after the old getExistingInstances call [3] which would have logged any state exclusions [4].  But no exclusion warnings in the logs:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep 'Excluding instance matching'
...no hits...

I don't see a way to log request IDs or describe results to pin that down, but I'm guessing that we need to wait and retry for some reasonable delay (30s?) before marking a machine failed.  Fixing bug 1772192 would help with that, by giving us a working way to queue the retry.

[1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency
[2]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/actuator.go#L430
[3]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/utils.go#L90
[4]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/utils.go#L186

Comment 13 W. Trevor King 2019-11-19 22:04:31 UTC

Ah, looks like this bug was a dup of a previously-reported issue.  Closing as such.

*** This bug has been marked as a duplicate of bug 1761882 ***