Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2043080

Summary: openshift-installer intermittent failure on AWS with Error: InvalidVpcID.NotFound: The vpc ID 'vpc-123456789' does not exist
Product: OpenShift Container Platform Reporter: Greg Sheremeta <gshereme>
Component: InstallerAssignee: Nobody <nobody>
Installer sub component: openshift-installer QA Contact: Yunfei Jiang <yunjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: cblecker, nstielau, padillon, wking, yunjiang
Version: 4.9Keywords: ServiceDeliveryBlocker
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: there was an eventual consistency issue in the aws-terraform-provider when trying to update newly created VPCs Consequence: installs would fail trying to access VPCs Fix: installer updated to upstream terraform-provider which has fix to respect eventual consistency Result: install does not fail
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:43:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2047390    

Description Greg Sheremeta 2022-01-20 15:25:58 UTC
openshift-installer intermittent failure on AWS with Error: InvalidVpcID.NotFound: The vpc ID 'vpc-123456789' does not exist

I believe this is a variation of Bug 2033256 and Bug 2032521

$ openshift-install version
4.9.x

Platform: AWS -- OSD and ROSA, specifically

Please specify:
IPI

What happened?

time="2022-01-19T09:39:08Z" level=debug msg="module.vpc.aws_vpc_dhcp_options.main[0]: Creation complete after 0s [id=dopt-09ca2034f7ea9d11d]"
time="2022-01-19T09:39:08Z" level=error
time="2022-01-19T09:39:08Z" level=error msg="Error: InvalidVpcID.NotFound: The vpc ID 'vpc-0c9b3c27047567519' does not exist"
time="2022-01-19T09:39:08Z" level=error msg="\tstatus code: 400, request id: 93713120-b081-4fdf-b7f9-35754aea8d31"


What did you expect to happen?
Installer creates the VPC. It should certainly be able to find what it itself just created. --> Successful install

How to reproduce it (as minimally and precisely as possible)?
It is random and rare

Flow seems to be:
1 Installer creates a thing
2 AWS creates it
3 AWS says it doesn't exist
4 Terraform dies

Comment 1 Matthew Staebler 2022-01-20 15:55:19 UTC
> I believe this is a variation of Bug 2033256 and Bug 2032521

Yes, this is another case of eventual consistency issues with the AWS terraform provider. This will be addressed in 4.11 with the upgrade to the latest terraform provider.

Comment 2 Nick Stielau 2022-01-24 20:06:48 UTC
Can we get any more specifics on failure rate here?  "It is random and rare".... rare sounds good, it's an edge-case, but more concrete data (x out of y, bursty or not) would be helpful.

Comment 3 W. Trevor King 2022-01-24 20:19:42 UTC
Looking for "InvalidVpcID.NotFound" in CI over the past 2 days, it is very rare, with only a single of our many, many runs hitting this issue:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=build-log&search=InvalidVpcID.NotFound' | grep 'failures match'
pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade (all) - 35 runs, 100% failed, 3% of failures match = 3% impact
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=build-log&search=InvalidVpcID.NotFound' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_elasticsearch-operator/829/pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade/1485563070087434240

That might be rare enough that we can drop severity below high.  Although my impression is that frequency will depend on how fast AWS is able to reconcile eventual consistency on their end, which can vary by day and by region/zone, so "try a new install right now" might keep failing until AWS recovers from whatever is causing their elevated reconciliation delay.

Stretching back to 6 days:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=144h&type=build-log&search=InvalidVpcID.NotFound' | jq -r 'keys[]'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-driver-toolkit-release-4.8-e2e-aws-driver-toolkit/1483831736272949248
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-knative-eventing-kafka-release-v1.0-47-e2e-aws-ocp-47-continuous/1483590265674403840
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-e2e-aws-serial/1484017049284907008
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-fips/1484462827933536256
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-rh-ecosystem-edge-ci-artifacts-master-4.9-gpu-operator-e2e-17x/1484662237464367104
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/codeready-toolchain_member-operator/327/pull-ci-codeready-toolchain-member-operator-master-e2e/1484552929967869952
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1721/pull-ci-kubevirt-hyperconverged-cluster-operator-main-okd-hco-e2e-upgrade-index-aws/1483828661768425472
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1264/pull-ci-openshift-cluster-network-operator-release-4.9-e2e-aws-sdn-multi/1483928751006814208
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_elasticsearch-operator/829/pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade/1485563070087434240
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/917/pull-ci-openshift-ovn-kubernetes-master-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1483908332463853568
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25388/rehearse-25388-pull-ci-openshift-windows-machine-config-operator-release-4.9-aws-e2e-upgrade/1483895059630788608
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/redhat-openshift-ecosystem_community-operators-prod/642/pull-ci-redhat-openshift-ecosystem-community-operators-prod-main-4.9-deploy-operator-on-openshift/1483550101199654912

I haven't dug in to see if that increase prevalence (which is still rare, vs. our overall job volume) is clustered around a specific time, or if it was just that we run more jobs during the work week than we do on weekends.

Comment 4 Nick Stielau 2022-01-24 22:19:14 UTC
Yeah, I'm seeing that at ~ 0.2% failure rate for our CI runs over the past week.  It is probably spikey, but doesn't see all that high from this datapoint.

Comment 5 Matthew Staebler 2022-01-31 18:03:58 UTC
This was fixed upstream with https://github.com/hashicorp/terraform-provider-aws/commit/ba949c9b7c72d9ebccd1357ca0683ab8636a538e.

Comment 7 Patrick Dillon 2022-05-03 17:50:43 UTC
This bug seems to have been fixed indirectly by the terraform-aws-provider bump in https://github.com/openshift/installer/pull/5666

Confirmed that the upstream fix has been included with the current terraform-aws-provider version. CI search confirmed no results for 2 days but longer searches are timing out.

Moving to MODIFIED for QE verification.

Comment 9 Yunfei Jiang 2022-05-09 03:27:49 UTC
No errors found in ci logs for 7 days.

Comment 12 errata-xmlrpc 2022-08-10 10:43:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069