Bug 2032521
Summary: | openshift-installer intermittent failure on AWS with "Error: Provider produced inconsistent result after apply" when creating the aws_vpc_dhcp_options_association resource | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Greg Sheremeta <gshereme> | |
Component: | Installer | Assignee: | Matthew Staebler <mstaeble> | |
Installer sub component: | openshift-installer | QA Contact: | Yunfei Jiang <yunjiang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | calfonso, cblecker, julim, mmasters, mstaeble, nstielau, wking, yunjiang | |
Version: | 4.9 | Keywords: | ServiceDeliveryBlocker | |
Target Milestone: | --- | |||
Target Release: | 4.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: After successfully creating a aws_vpc_dhcp_options_association resource, AWS may still report that the resource does not exist. In that case, the AWS terraform provider balks and fails the installation.
Consequence: Potential failed installation.
Fix: Retry the query of the aws_vpc_dhcp_options_association resource for a period of time after creation until AWS reports that the resource exists.
Result: Successful installations despite AWS reporting that the aws_vpc_dhcp_options_association resource does not exist for a period of time after it was created.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2043590 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-10 16:33:57 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2043590, 2047390 |
Description
Greg Sheremeta
2021-12-14 15:58:10 UTC
Since the upgrade of the aws terraform provider used by the installer to v3.31.0, the percentage of CI runs that have failed to due to inconsistent results for aws_vpc_dhcp_options_association resources has increased from <1% to 8-10%. A fix for this was added to the aws terraform provider in version v3.35.0 [1]. Unfortunately, we cannot update the installer to use any version beyond v3.31.0 due to being limited to using v1 of the terraform plugin sdk. [1] https://github.com/hashicorp/terraform-provider-aws/commit/8e0e9c74c82026876c27bded761ae626b5d05cbf > Since the upgrade of the aws terraform provider used by the installer to v3.31.0, the percentage of CI runs that have failed to due to inconsistent results for aws_vpc_dhcp_options_association resources has increased from <1% to 8-10%.
We usually see a different resource in OSD failures:
level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: Provider produced inconsistent result after apply
level=error
level=error msg=When applying changes to module.vpc.aws_route_table.private_routes[2],
level=error msg=provider "registry.terraform.io/-/aws" produced an unexpected new value for
level=error msg=was present, but now absent.
I think I've seen other ones, too. Is this the same bug?
Greg, no, this is not the same bug as the consistency problem with aws_route_table resources. Unfortunately, every resource needs its own separate fix. Since you did not specify which resource you were experiencing in the title or description of the BZ, I commandeered this BZ for the aws_vcp_dhcp_options_association resource, which is the most pressing issue for 4.10. ok, I spawned Bug 2033256 for module.vpc.aws_route_table.private_routes If I see any others, I'll open individual bugs for each resource. Yanming, anything blocking reviewing this, or is it just lower in the queue? Let us know what questions you have about verifying. Is this something that could be verified by OSD/ROSA if QE is unable to reproduce? Checked both QE's CI pipeline and Prow CI [1], no aws_vpc_dhcp_options_association resources creation error. @gshereme, @mstaeble I got some questions: 1. per discussion in comment 1, I do not see aws_vpc_dhcp_options_association related error, why it was reported in this bug? Did I miss any information? 2. per bug description, `Platform: AWS -- OSD and ROSA, specifically`, does it mean it occurs on OSD/ROSA more often? or only occurs on OSD/ROSA, compare to the general installation method, is there any special deployment configurations in OSD? 3. per discussion in comment 1, I see module.vpc.aws_subnet.private_subnet creation errors were mentioned, searching in CI [2], still see lots of such errors, e.g. [3], not sure if this PR is trying to fix this issue [1] https://search.ci.openshift.org/?search=aws_vpc_dhcp_options_association&maxAge=168h&context=7&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job [2] https://search.ci.openshift.org/?search=module.vpc.aws_subnet.private_subnet&maxAge=168h&context=7&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job [3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.10/1486071548644167680 We ocm qes create 20+ OSD/rosa clusters, but can NOT generate the bug. > Is this something that could be verified by OSD/ROSA if QE is unable to reproduce?
No, it's impossible for us to reproduce as well. I think AWS needs to be having a bad day for it to happen.
Greg, many thanks, very helpful information. Per comment 12, comment 13, the issue was not found in OCP CI and OSD/ROSA recently, and it occurs very rarely (comment 0, comment 14), also there is no regression issue found after this PR merged, setting to VERIFIED now. Feel free to re-open it if the error occurs again. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |