openshift-installer intermittent failure on AWS with Error: InvalidVpcID.NotFound: The vpc ID 'vpc-123456789' does not exist I believe this is a variation of Bug 2033256 and Bug 2032521 $ openshift-install version 4.9.x Platform: AWS -- OSD and ROSA, specifically Please specify: IPI What happened? time="2022-01-19T09:39:08Z" level=debug msg="module.vpc.aws_vpc_dhcp_options.main[0]: Creation complete after 0s [id=dopt-09ca2034f7ea9d11d]" time="2022-01-19T09:39:08Z" level=error time="2022-01-19T09:39:08Z" level=error msg="Error: InvalidVpcID.NotFound: The vpc ID 'vpc-0c9b3c27047567519' does not exist" time="2022-01-19T09:39:08Z" level=error msg="\tstatus code: 400, request id: 93713120-b081-4fdf-b7f9-35754aea8d31" What did you expect to happen? Installer creates the VPC. It should certainly be able to find what it itself just created. --> Successful install How to reproduce it (as minimally and precisely as possible)? It is random and rare Flow seems to be: 1 Installer creates a thing 2 AWS creates it 3 AWS says it doesn't exist 4 Terraform dies
> I believe this is a variation of Bug 2033256 and Bug 2032521 Yes, this is another case of eventual consistency issues with the AWS terraform provider. This will be addressed in 4.11 with the upgrade to the latest terraform provider.
Can we get any more specifics on failure rate here? "It is random and rare".... rare sounds good, it's an edge-case, but more concrete data (x out of y, bursty or not) would be helpful.
Looking for "InvalidVpcID.NotFound" in CI over the past 2 days, it is very rare, with only a single of our many, many runs hitting this issue: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=build-log&search=InvalidVpcID.NotFound' | grep 'failures match' pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade (all) - 35 runs, 100% failed, 3% of failures match = 3% impact $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=build-log&search=InvalidVpcID.NotFound' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_elasticsearch-operator/829/pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade/1485563070087434240 That might be rare enough that we can drop severity below high. Although my impression is that frequency will depend on how fast AWS is able to reconcile eventual consistency on their end, which can vary by day and by region/zone, so "try a new install right now" might keep failing until AWS recovers from whatever is causing their elevated reconciliation delay. Stretching back to 6 days: $ curl -s 'https://search.ci.openshift.org/search?maxAge=144h&type=build-log&search=InvalidVpcID.NotFound' | jq -r 'keys[]' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-driver-toolkit-release-4.8-e2e-aws-driver-toolkit/1483831736272949248 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-knative-eventing-kafka-release-v1.0-47-e2e-aws-ocp-47-continuous/1483590265674403840 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.6-e2e-aws-serial/1484017049284907008 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-fips/1484462827933536256 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-rh-ecosystem-edge-ci-artifacts-master-4.9-gpu-operator-e2e-17x/1484662237464367104 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/codeready-toolchain_member-operator/327/pull-ci-codeready-toolchain-member-operator-master-e2e/1484552929967869952 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/kubevirt_hyperconverged-cluster-operator/1721/pull-ci-kubevirt-hyperconverged-cluster-operator-main-okd-hco-e2e-upgrade-index-aws/1483828661768425472 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/1264/pull-ci-openshift-cluster-network-operator-release-4.9-e2e-aws-sdn-multi/1483928751006814208 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_elasticsearch-operator/829/pull-ci-openshift-elasticsearch-operator-release-5.2-e2e-upgrade/1485563070087434240 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/917/pull-ci-openshift-ovn-kubernetes-master-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1483908332463853568 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/25388/rehearse-25388-pull-ci-openshift-windows-machine-config-operator-release-4.9-aws-e2e-upgrade/1483895059630788608 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/redhat-openshift-ecosystem_community-operators-prod/642/pull-ci-redhat-openshift-ecosystem-community-operators-prod-main-4.9-deploy-operator-on-openshift/1483550101199654912 I haven't dug in to see if that increase prevalence (which is still rare, vs. our overall job volume) is clustered around a specific time, or if it was just that we run more jobs during the work week than we do on weekends.
Yeah, I'm seeing that at ~ 0.2% failure rate for our CI runs over the past week. It is probably spikey, but doesn't see all that high from this datapoint.
This was fixed upstream with https://github.com/hashicorp/terraform-provider-aws/commit/ba949c9b7c72d9ebccd1357ca0683ab8636a538e.
This bug seems to have been fixed indirectly by the terraform-aws-provider bump in https://github.com/openshift/installer/pull/5666 Confirmed that the upstream fix has been included with the current terraform-aws-provider version. CI search confirmed no results for 2 days but longer searches are timing out. Moving to MODIFIED for QE verification.
No errors found in ci logs for 7 days.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069