Bug 1745196

Summary: AWS installs occasionally fail due to S3 bucket race: [bucket] produced an unexpected new value for was present, but now absent
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: InstallerAssignee: W. Trevor King <wking>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: unspecified CC: adahiya, hongkliu, jialiu, jlebon, kgarriso, lsm5, obulatov, pmuller, shlao, surbania, wwurzbac
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The AWS Terraform provider vendored by the installer would occasionally race S3's eventual consistency and get confused. Consequence: Installation would fail with: When applying changes to module.bootstrap.aws_s3_bucket.ignition, provider" level=error msg="\"aws\" produced an unexpected new value for was present, but now absent." Fix: The installer has vendored improved AWS Terraform provider code, which now robustly handles S3 eventual consistency. Result: Installer-provisioned AWS no longer flakes on "unexpected new value for was present, but now absent".
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:11:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1752313    

Description W. Trevor King 2019-08-23 20:56:43 UTC
We have an ~0.3% (3 of the ~1000 .*aws.* jobs we've run in the past 24 hours [1]) rate of hitting errors like [2]:

level=error msg="Error: Provider produced inconsistent result after apply"
level=error
level=error msg="When applying changes to module.bootstrap.aws_s3_bucket.ignition, provider"
level=error msg="\"aws\" produced an unexpected new value for was present, but now absent."
level=error
level=error msg="This is a bug in the provider, which should be reported in the provider's own"
level=error msg="issue tracker."
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform" 

It's being tracked upstream in [3], and I have a ticket open with AWS to explain the inconsistency.

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=aws&search=produced%20an%20unexpected%20new%20value%20for%20was%20present,%20but%20now%20absent
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6097
[3]: https://github.com/terraform-providers/terraform-provider-aws/issues/9725

Comment 1 Abhinav Dahiya 2019-09-16 15:58:32 UTC
*** Bug 1752355 has been marked as a duplicate of this bug. ***

Comment 9 Abhinav Dahiya 2019-11-25 16:42:47 UTC
*** Bug 1776423 has been marked as a duplicate of this bug. ***

Comment 10 Lokesh Mandvekar 2019-11-25 18:10:40 UTC
some ssh related issues in the latest aws-fips-4.3 run at https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.3/588

Lease acquired, installing...
Installing from release registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-25-153929
level=warning msg="Found override for release image. Please be warned, this is not advised"
level=info msg="Consuming Install Config from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-00k7xrfx-3fb9c.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-00k7xrfx-3fb9c.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.214.115.89:6443: connect: connection refused"
level=info msg="Pulling debug logs from the bootstrap machine"
level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 18.207.205.70:22: connect: connection refused"
level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

Comment 12 Abhinav Dahiya 2020-02-24 21:52:14 UTC
This was fixed in 4.5 when we bumped the provider version to 2.49.0 in https://github.com/openshift/installer/pull/3140 which was a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1766691

Comment 16 Johnny Liu 2020-03-02 10:23:45 UTC
Ignore comment 15, it is copy/paste mistake.

Search the past 7 days' log, https://search.svc.ci.openshift.org/?search=produced+an+unexpected+new+value+for+was+present&maxAge=168h&context=1&type=all, not found similar error. Move this bug to verified.

Comment 18 Abhinav Dahiya 2020-03-27 15:56:10 UTC
as you see from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23308#1:build-log.txt%3A49

` Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.4.0-rc.4`

the installer bring used in that job is 4.4.0-rc.4, which doesn't have the fix, we merged the fix only to 4.5(master)

So I do not think this bug should be re-opened.

Comment 19 Johnny Liu 2020-03-30 02:13:11 UTC
Thanks for Abhinav's explanation.

Comment 21 errata-xmlrpc 2020-07-13 17:11:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409