Bug 1745196 - AWS installs occasionally fail due to S3 bucket race: [bucket] produced an unexpected new value for was present, but now absent
Summary: AWS installs occasionally fail due to S3 bucket race: [bucket] produced an un...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: W. Trevor King
QA Contact: Johnny Liu
URL:
Whiteboard:
: 1752355 1776423 (view as bug list)
Depends On:
Blocks: 1752313
TreeView+ depends on / blocked
 
Reported: 2019-08-23 20:56 UTC by W. Trevor King
Modified: 2020-07-13 17:11 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The AWS Terraform provider vendored by the installer would occasionally race S3's eventual consistency and get confused. Consequence: Installation would fail with: When applying changes to module.bootstrap.aws_s3_bucket.ignition, provider" level=error msg="\"aws\" produced an unexpected new value for was present, but now absent." Fix: The installer has vendored improved AWS Terraform provider code, which now robustly handles S3 eventual consistency. Result: Installer-provisioned AWS no longer flakes on "unexpected new value for was present, but now absent".
Clone Of:
Environment:
Last Closed: 2020-07-13 17:11:28 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github terraform-providers terraform-provider-aws issues 9725 'None' open aws_s3_bucket_object inconsistently returning "Provider produced inconsistent result after apply" 2020-07-06 20:51:57 UTC
Github terraform-providers terraform-provider-aws pull 11894 None closed resource/aws_s3_bucket: Retry read after creation for 404 status code 2020-07-06 20:51:57 UTC
Red Hat Knowledge Base (Solution) 4562991 None None None 2019-11-07 21:34:49 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:11:52 UTC

Description W. Trevor King 2019-08-23 20:56:43 UTC
We have an ~0.3% (3 of the ~1000 .*aws.* jobs we've run in the past 24 hours [1]) rate of hitting errors like [2]:

level=error msg="Error: Provider produced inconsistent result after apply"
level=error
level=error msg="When applying changes to module.bootstrap.aws_s3_bucket.ignition, provider"
level=error msg="\"aws\" produced an unexpected new value for was present, but now absent."
level=error
level=error msg="This is a bug in the provider, which should be reported in the provider's own"
level=error msg="issue tracker."
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform" 

It's being tracked upstream in [3], and I have a ticket open with AWS to explain the inconsistency.

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=aws&search=produced%20an%20unexpected%20new%20value%20for%20was%20present,%20but%20now%20absent
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6097
[3]: https://github.com/terraform-providers/terraform-provider-aws/issues/9725

Comment 1 Abhinav Dahiya 2019-09-16 15:58:32 UTC
*** Bug 1752355 has been marked as a duplicate of this bug. ***

Comment 9 Abhinav Dahiya 2019-11-25 16:42:47 UTC
*** Bug 1776423 has been marked as a duplicate of this bug. ***

Comment 10 Lokesh Mandvekar 2019-11-25 18:10:40 UTC
some ssh related issues in the latest aws-fips-4.3 run at https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.3/588

Lease acquired, installing...
Installing from release registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-25-153929
level=warning msg="Found override for release image. Please be warned, this is not advised"
level=info msg="Consuming Install Config from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-00k7xrfx-3fb9c.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-00k7xrfx-3fb9c.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.214.115.89:6443: connect: connection refused"
level=info msg="Pulling debug logs from the bootstrap machine"
level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 18.207.205.70:22: connect: connection refused"
level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

Comment 12 Abhinav Dahiya 2020-02-24 21:52:14 UTC
This was fixed in 4.5 when we bumped the provider version to 2.49.0 in https://github.com/openshift/installer/pull/3140 which was a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1766691

Comment 16 Johnny Liu 2020-03-02 10:23:45 UTC
Ignore comment 15, it is copy/paste mistake.

Search the past 7 days' log, https://search.svc.ci.openshift.org/?search=produced+an+unexpected+new+value+for+was+present&maxAge=168h&context=1&type=all, not found similar error. Move this bug to verified.

Comment 18 Abhinav Dahiya 2020-03-27 15:56:10 UTC
as you see from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23308#1:build-log.txt%3A49

` Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.4.0-rc.4`

the installer bring used in that job is 4.4.0-rc.4, which doesn't have the fix, we merged the fix only to 4.5(master)

So I do not think this bug should be re-opened.

Comment 19 Johnny Liu 2020-03-30 02:13:11 UTC
Thanks for Abhinav's explanation.

Comment 21 errata-xmlrpc 2020-07-13 17:11:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.