Bug 1772163
Summary: | AWS: Machine API operator marks (pending) Machine Failed: Can't find created instance | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
Status: | CLOSED DUPLICATE | QA Contact: | Jianwei Hou <jhou> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.3.0 | CC: | rteague, spasquie |
Target Milestone: | --- | ||
Target Release: | 4.3.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-11-19 22:04:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2019-11-13 19:32:30 UTC
> It looks like there are two MAO controller streams... So one side-effect from this is that once we know the instance is pending, we must know its ID. That is logged at 16:57:16.780784984Z, and we see the ID in the Machine object: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9.yaml | yaml2json | jq -r .status.providerStatus.instanceId i-0b08e65c0de9820be That means subsequent status lookups should use by-ID queries [1], but we see none of those entries in the logs: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/242/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-49m8kdi7-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-7445cdcd69-vlqjh/machine-controller/machine-controller/logs/current.log | grep -c 'by id: i-0b08e65c0de9820be' 0 getExistingInstances should still turn the instance up, and I'm not sure why it doesn't in the 16:57:16.898246721Z attempt yet, but if the MAO controller wasn't racing itself this would have been an easier "AWS request ID ... for DescribeInstances with the instance-id filter set returned no results" or some such, and that's both a more efficient query and an easier case to make when reporting AWS bugs to AWS. [1]: https://github.com/openshift/cluster-api-provider-aws/blob/6814fc34f30a4c98d5edeb0a4ef73bbd03fa959b/pkg/actuators/machine/actuator.go#L544-L557 Pulled the ignored-requeue-delay thing out into bug 1772192. *** Bug 1771903 has been marked as a duplicate of this bug. *** here the time lapse between the conflicting exists() discrepancy is 2 seconds: 2019-11-13T16:57:14.38218354Z I1113 16:57:14.382147 1 actuator.go:489] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Checking if machine exists 2019-11-13T16:57:14.431854727Z I1113 16:57:14.431815 1 actuator.go:497] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Instance does not exist And 2019-11-13T16:57:16.781207156Z I1113 16:57:16.781193 1 actuator.go:489] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Checking if machine exists 2019-11-13T16:57:16.898246721Z I1113 16:57:16.897946 1 actuator.go:497] ci-op-49m8kdi7-6df53-9bmsm-worker-us-east-1b-centos-mwlr9: Instance does not exist You can see in this other occurrence the time lapse is 3 min: I1112 21:35:29.387534 1 actuator.go:489] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Checking if machine exists I1112 21:35:29.485646 1 actuator.go:497] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Instance does not exist I1112 21:35:29.485668 1 controller.go:428] Machine "ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc" going into phase "Provisioning" Successfully report exists()=true several times in the middle, e.g I1112 21:35:44.121408 1 actuator.go:489] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Checking if machine exists I1112 21:35:44.937542 1 actuator.go:553] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Found instance by id: i-01f95db5aeec7c187 And eventually I1112 21:38:03.542779 1 actuator.go:497] ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc: Instance does not exist I1112 21:38:03.542801 1 controller.go:428] Machine "ci-op-lklc3wmp-2249a-v4thp-worker-us-east-1a-mcxxc" going into phase "Failed" Which makes me think the next possible scenarios: -aws is failing to bootstrap the machine and going from pending to terminated https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html#troubleshooting-launch-internal - We are not accounting for a specific aws state e.g "unknown" in our Exists() func Trevor Is there any way to check the termination reason in those logs? e.g: "StateReason": { "Message": "Client.VolumeLimitExceeded: Volume limit exceeded", "Code": "Server.InternalError" }, This should help in case we are missing a possible aws state in our exists() check https://github.com/openshift/cluster-api-provider-aws/pull/269 At the time the api reported phase 'Failed', aws was still reporting instanceState 'pending' for the instance. At least that is what is being returned from the api. status: addresses: - address: 10.0.151.226 type: InternalIP - address: '' type: ExternalDNS - address: ip-10-0-151-226.ec2.internal type: InternalDNS errorMessage: Can't find created instance. lastUpdated: '2019-11-13T16:57:16Z' phase: Failed providerStatus: apiVersion: awsproviderconfig.openshift.io/v1beta1 conditions: - lastProbeTime: '2019-11-13T16:57:16Z' lastTransitionTime: '2019-11-13T16:57:16Z' message: machine successfully created reason: MachineCreationSucceeded status: 'True' type: MachineCreation instanceId: i-0b08e65c0de9820be instanceState: pending kind: AWSMachineProviderStatus Another instance of this occurred in this job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144 status: addresses: - address: 10.0.145.58 type: InternalIP - address: '' type: ExternalDNS - address: ip-10-0-145-58.ec2.internal type: InternalDNS errorMessage: Can't find created instance. lastUpdated: '2019-11-14T16:15:20Z' phase: Failed providerStatus: apiVersion: awsproviderconfig.openshift.io/v1beta1 conditions: - lastProbeTime: '2019-11-14T16:15:19Z' lastTransitionTime: '2019-11-14T16:15:19Z' message: machine successfully created reason: MachineCreationSucceeded status: 'True' type: MachineCreation instanceId: i-01f493c22da1d89d9 instanceState: pending kind: AWSMachineProviderStatus (In reply to Russell Teague from comment #7) > Another instance of this occurred in this job: > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144 That machine is ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g [1]. Checking for it in the controller logs: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-h27b6mjn-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-744574c5f7-v2944/machine-controller/machine-controller/logs/current.log | grep -B2 'ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: \(Instance does not exist\|.*pending\)' 2019-11-14T16:15:17.830004619Z I1114 16:15:17.829986 1 controller.go:376] Machine "ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-14T16:15:17.830041524Z I1114 16:15:17.829998 1 actuator.go:489] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Checking if machine exists 2019-11-14T16:15:17.938841572Z I1114 16:15:17.938790 1 actuator.go:497] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance does not exist -- 2019-11-14T16:15:19.988382838Z I1114 16:15:19.988372 1 actuator.go:646] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: finished calculating AWS status 2019-11-14T16:15:19.988486132Z I1114 16:15:19.988422 1 actuator.go:227] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: machine status has changed, updating 2019-11-14T16:15:19.996200375Z I1114 16:15:19.996165 1 actuator.go:664] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance state still pending, returning an error to requeue -- 2019-11-14T16:15:19.996302084Z I1114 16:15:19.996289 1 controller.go:376] Machine "ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster 2019-11-14T16:15:19.996320327Z I1114 16:15:19.996305 1 actuator.go:489] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Checking if machine exists 2019-11-14T16:15:20.11500679Z I1114 16:15:20.114961 1 actuator.go:497] ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g: Instance does not exist That job didn't include [2]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "aws-machine-controllers").annotations["io.openshift.build.commit.id"]' f68fb22560615bc2b556c7f5b7aff5dfcddd74ca So future failures should have better logging, but I'm moving back to ASSIGNED because it does contain [3] and the issue is obviously not fixed. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/434/pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7/144/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-h27b6mjn-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-h27b6mjn-bbd80-bjvpg-worker-us-east-1b-centos-fw89g.yaml [2]: https://github.com/openshift/cluster-api-provider-aws/pull/268 [3]: https://github.com/openshift/cluster-api-provider-aws/pull/269 Seen another occurrence of "Can't find created instance." https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255 (In reply to Russell Teague from comment #9) > Seen another occurrence of "Can't find created instance." > > https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255 $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep -B5 'ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine' 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686811 1 actuator.go:552] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Found instance by id: i-04e60273f38857eed 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686839 1 actuator.go:502] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Instance exists as "i-04e60273f38857eed" 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686853 1 controller.go:284] Reconciling machine "ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696" triggers idempotent update 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686862 1 actuator.go:406] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: updating machine 2019-11-18T01:08:14.68696548Z I1118 01:08:14.686938 1 actuator.go:414] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: obtaining EC2 client for region 2019-11-18T01:08:14.741237517Z I1118 01:08:14.741189 1 actuator.go:430] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine But [2] just landed in this space; maybe it will help. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/machine.openshift.io/machines/ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696.yaml [2]: https://github.com/openshift/cluster-api-provider-aws/pull/270 Certainly looks to me like it might be an eventual-consistency issue on the AWS side [1]. Logged line from [2] came after the old getExistingInstances call [3] which would have logged any state exclusions [4]. But no exclusion warnings in the logs: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep 'Excluding instance matching' ...no hits... I don't see a way to log request IDs or describe results to pin that down, but I'm guessing that we need to wait and retry for some reasonable delay (30s?) before marking a machine failed. Fixing bug 1772192 would help with that, by giving us a working way to queue the retry. [1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency [2]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/actuator.go#L430 [3]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/utils.go#L90 [4]: https://github.com/openshift/cluster-api-provider-aws/blob/0ee2017971f62a3050bf44dac4753698e27e573e/pkg/actuators/machine/utils.go#L186 Ah, looks like this bug was a dup of a previously-reported issue. Closing as such. *** This bug has been marked as a duplicate of bug 1761882 *** |