Bug 2032521 - openshift-installer intermittent failure on AWS with "Error: Provider produced inconsistent result after apply" when creating the aws_vpc_dhcp_options_association resource
Summary: openshift-installer intermittent failure on AWS with "Error: Provider produce...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Matthew Staebler
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks: 2043590 2047390
TreeView+ depends on / blocked
 
Reported: 2021-12-14 15:58 UTC by Greg Sheremeta
Modified: 2022-03-10 16:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: After successfully creating a aws_vpc_dhcp_options_association resource, AWS may still report that the resource does not exist. In that case, the AWS terraform provider balks and fails the installation. Consequence: Potential failed installation. Fix: Retry the query of the aws_vpc_dhcp_options_association resource for a period of time after creation until AWS reports that the resource exists. Result: Successful installations despite AWS reporting that the aws_vpc_dhcp_options_association resource does not exist for a period of time after it was created.
Clone Of:
: 2043590 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:33:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 5488 0 None Merged Bug 2032521: vendor: address eventually consistency creating aws dhcp options associations 2022-02-01 06:17:54 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:34:19 UTC

Description Greg Sheremeta 2021-12-14 15:58:10 UTC
$ openshift-install version
4.9.x

Platform: AWS -- OSD and ROSA, specifically

Please specify:
IPI

What happened?
Error: Provider produced inconsistent result after apply

What did you expect to happen?
Successful install

How to reproduce it (as minimally and precisely as possible)?
It is random and rare

Flow seems to be:
1 Installer creates a thing
2 AWS creates it
3 AWS says it doesn't exist
4 Terrform dies

Comment 2 Matthew Staebler 2021-12-15 02:52:36 UTC
Since the upgrade of the aws terraform provider used by the installer to v3.31.0, the percentage of CI runs that have failed to due to inconsistent results for aws_vpc_dhcp_options_association resources has increased from <1% to 8-10%.

A fix for this was added to the aws terraform provider in version v3.35.0 [1]. Unfortunately, we cannot update the installer to use any version beyond v3.31.0 due to being limited to using v1 of the terraform plugin sdk.

[1] https://github.com/hashicorp/terraform-provider-aws/commit/8e0e9c74c82026876c27bded761ae626b5d05cbf

Comment 4 Greg Sheremeta 2021-12-15 23:49:59 UTC
> Since the upgrade of the aws terraform provider used by the installer to v3.31.0, the percentage of CI runs that have failed to due to inconsistent results for aws_vpc_dhcp_options_association resources has increased from <1% to 8-10%.

We usually see a different resource in OSD failures:

      level=info msg=Creating infrastructure resources...
      level=error
      level=error msg=Error: Provider produced inconsistent result after apply
      level=error
      level=error msg=When applying changes to module.vpc.aws_route_table.private_routes[2],
      level=error msg=provider "registry.terraform.io/-/aws" produced an unexpected new value for
      level=error msg=was present, but now absent.

I think I've seen other ones, too. Is this the same bug?

Comment 5 Matthew Staebler 2021-12-16 01:09:17 UTC
Greg, no, this is not the same bug as the consistency problem with aws_route_table resources. Unfortunately, every resource needs its own separate fix. Since you did not specify which resource you were experiencing in the title or description of the BZ, I commandeered this BZ for the aws_vcp_dhcp_options_association resource, which is the most pressing issue for 4.10.

Comment 6 Greg Sheremeta 2021-12-16 11:31:27 UTC
ok, I spawned Bug 2033256 for module.vpc.aws_route_table.private_routes

If I see any others, I'll open individual bugs for each resource.

Comment 9 Nick Stielau 2022-01-24 22:22:56 UTC
Yanming, anything blocking reviewing this, or is it just lower in the queue?  Let us know what questions you have about verifying.

Comment 11 Scott Dodson 2022-01-26 03:31:58 UTC
Is this something that could be verified by OSD/ROSA if QE is unable to reproduce?

Comment 12 Yunfei Jiang 2022-01-26 06:02:00 UTC
Checked both QE's CI pipeline and Prow CI [1], no aws_vpc_dhcp_options_association resources creation error.

@gshereme, @mstaeble I got some questions:
1. per discussion in comment 1, I do not see aws_vpc_dhcp_options_association related error, why it was reported in this bug? Did I miss any information?
2. per bug description, `Platform: AWS -- OSD and ROSA, specifically`, does it mean it occurs on OSD/ROSA more often? or only occurs on OSD/ROSA, compare to the general installation method, is there any special deployment configurations in OSD?
3. per discussion in comment 1, I see module.vpc.aws_subnet.private_subnet creation errors were mentioned, searching in CI [2], still see lots of such errors, e.g. [3], not sure if this PR is trying to fix this issue

[1] https://search.ci.openshift.org/?search=aws_vpc_dhcp_options_association&maxAge=168h&context=7&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[2] https://search.ci.openshift.org/?search=module.vpc.aws_subnet.private_subnet&maxAge=168h&context=7&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-csi-4.10/1486071548644167680

Comment 13 yasun 2022-01-26 07:56:58 UTC
We ocm qes create 20+ OSD/rosa clusters, but can NOT generate the bug.

Comment 14 Greg Sheremeta 2022-01-26 13:44:53 UTC
> Is this something that could be verified by OSD/ROSA if QE is unable to reproduce?

No, it's impossible for us to reproduce as well. I think AWS needs to be having a bad day for it to happen.

Comment 17 Yunfei Jiang 2022-01-27 02:07:18 UTC
Greg, many thanks, very helpful information. 

Per comment 12, comment 13, the issue was not found in OCP CI and OSD/ROSA recently, and it occurs very rarely (comment 0, comment 14), also there is no regression issue found after this PR merged, setting to VERIFIED now. Feel free to re-open it if the error occurs again.

Comment 20 errata-xmlrpc 2022-03-10 16:33:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.