2043080 – openshift-installer intermittent failure on AWS with Error: InvalidVpcID.NotFound: The vpc ID 'vpc-123456789' does not exist

Bug 2043080 - openshift-installer intermittent failure on AWS with Error: InvalidVpcID.NotFound: The vpc ID 'vpc-123456789' does not exist

Summary: openshift-installer intermittent failure on AWS with Error: InvalidVpcID.NotF...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Nobody
QA Contact:	Yunfei Jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2047390
TreeView+	depends on / blocked

Reported:	2022-01-20 15:25 UTC by Greg Sheremeta
Modified:	2022-08-10 10:43 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: there was an eventual consistency issue in the aws-terraform-provider when trying to update newly created VPCs Consequence: installs would fail trying to access VPCs Fix: installer updated to upstream terraform-provider which has fix to respect eventual consistency Result: install does not fail
Clone Of:
Environment:
Last Closed:	2022-08-10 10:43:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:43:28 UTC

Description Greg Sheremeta 2022-01-20 15:25:58 UTC

openshift-installer intermittent failure on AWS with Error: InvalidVpcID.NotFound: The vpc ID 'vpc-123456789' does not exist

I believe this is a variation of Bug 2033256 and Bug 2032521

$ openshift-install version
4.9.x

Platform: AWS -- OSD and ROSA, specifically

Please specify:
IPI

What happened?

time="2022-01-19T09:39:08Z" level=debug msg="module.vpc.aws_vpc_dhcp_options.main[0]: Creation complete after 0s [id=dopt-09ca2034f7ea9d11d]"
time="2022-01-19T09:39:08Z" level=error
time="2022-01-19T09:39:08Z" level=error msg="Error: InvalidVpcID.NotFound: The vpc ID 'vpc-0c9b3c27047567519' does not exist"
time="2022-01-19T09:39:08Z" level=error msg="\tstatus code: 400, request id: 93713120-b081-4fdf-b7f9-35754aea8d31"


What did you expect to happen?
Installer creates the VPC. It should certainly be able to find what it itself just created. --> Successful install

How to reproduce it (as minimally and precisely as possible)?
It is random and rare

Flow seems to be:
1 Installer creates a thing
2 AWS creates it
3 AWS says it doesn't exist
4 Terraform dies

Comment 1 Matthew Staebler 2022-01-20 15:55:19 UTC

> I believe this is a variation of Bug 2033256 and Bug 2032521

Yes, this is another case of eventual consistency issues with the AWS terraform provider. This will be addressed in 4.11 with the upgrade to the latest terraform provider.

Comment 2 Nick Stielau 2022-01-24 20:06:48 UTC

Can we get any more specifics on failure rate here?  "It is random and rare".... rare sounds good, it's an edge-case, but more concrete data (x out of y, bursty or not) would be helpful.

Comment 3 W. Trevor King 2022-01-24 20:19:42 UTC

Looking for "InvalidVpcID.NotFound" in CI over the past 2 days, it is very rare, with only a single of our many, many runs hitting this issue:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=build-log&search=InvalidVpcID.NotFound' | grep 'failures match'
pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade (all) - 35 runs, 100% failed, 3% of failures match = 3% impact
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=build-log&search=InvalidVpcID.NotFound' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_elasticsearch-operator/829/pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade/1485563070087434240

That might be rare enough that we can drop severity below high.  Although my impression is that frequency will depend on how fast AWS is able to reconcile eventual consistency on their end, which can vary by day and by region/zone, so "try a new install right now" might keep failing until AWS recovers from whatever is causing their elevated reconciliation delay.

Stretching back to 6 days:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=144h&type=build-log&search=InvalidVpcID.NotFound' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-driver-toolkit-release-4.8-e2e-aws-driver-toolkit/1483831736272949248
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-knative-eventing-kafka-release-v1.0-47-e2e-aws-ocp-47-continuous/1483590265674403840
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-e2e-aws-serial/1484017049284907008
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-fips/1484462827933536256
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-rh-ecosystem-edge-ci-artifacts-master-4.9-gpu-operator-e2e-17x/1484662237464367104
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/codeready-toolchain_member-operator/327/pull-ci-codeready-toolchain-member-operator-master-e2e/1484552929967869952
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1721/pull-ci-kubevirt-hyperconverged-cluster-operator-main-okd-hco-e2e-upgrade-index-aws/1483828661768425472
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1264/pull-ci-openshift-cluster-network-operator-release-4.9-e2e-aws-sdn-multi/1483928751006814208
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_elasticsearch-operator/829/pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade/1485563070087434240
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/917/pull-ci-openshift-ovn-kubernetes-master-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1483908332463853568
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25388/rehearse-25388-pull-ci-openshift-windows-machine-config-operator-release-4.9-aws-e2e-upgrade/1483895059630788608
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/redhat-openshift-ecosystem_community-operators-prod/642/pull-ci-redhat-openshift-ecosystem-community-operators-prod-main-4.9-deploy-operator-on-openshift/1483550101199654912

I haven't dug in to see if that increase prevalence (which is still rare, vs. our overall job volume) is clustered around a specific time, or if it was just that we run more jobs during the work week than we do on weekends.

Comment 4 Nick Stielau 2022-01-24 22:19:14 UTC

Yeah, I'm seeing that at ~ 0.2% failure rate for our CI runs over the past week.  It is probably spikey, but doesn't see all that high from this datapoint.

Comment 5 Matthew Staebler 2022-01-31 18:03:58 UTC

This was fixed upstream with https://github.com/hashicorp/terraform-provider-aws/commit/ba949c9b7c72d9ebccd1357ca0683ab8636a538e.

Comment 7 Patrick Dillon 2022-05-03 17:50:43 UTC

This bug seems to have been fixed indirectly by the terraform-aws-provider bump in https://github.com/openshift/installer/pull/5666

Confirmed that the upstream fix has been included with the current terraform-aws-provider version. CI search confirmed no results for 2 days but longer searches are timing out.

Moving to MODIFIED for QE verification.

Comment 9 Yunfei Jiang 2022-05-09 03:27:49 UTC

No errors found in ci logs for 7 days.

Comment 12 errata-xmlrpc 2022-08-10 10:43:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.