Bug 1745196

Summary:	AWS installs occasionally fail due to S3 bucket race: [bucket] produced an unexpected new value for was present, but now absent
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Installer	Assignee:	W. Trevor King <wking>
Installer sub component:	openshift-installer	QA Contact:	Johnny Liu <jialiu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	adahiya, hongkliu, jialiu, jlebon, kgarriso, lsm5, obulatov, pmuller, shlao, surbania, wwurzbac
Version:	4.2.0
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The AWS Terraform provider vendored by the installer would occasionally race S3's eventual consistency and get confused. Consequence: Installation would fail with: When applying changes to module.bootstrap.aws_s3_bucket.ignition, provider" level=error msg="\"aws\" produced an unexpected new value for was present, but now absent." Fix: The installer has vendored improved AWS Terraform provider code, which now robustly handles S3 eventual consistency. Result: Installer-provisioned AWS no longer flakes on "unexpected new value for was present, but now absent".	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-13 17:11:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1752313

Description W. Trevor King 2019-08-23 20:56:43 UTC

We have an ~0.3% (3 of the ~1000 .*aws.* jobs we've run in the past 24 hours [1]) rate of hitting errors like [2]:

level=error msg="Error: Provider produced inconsistent result after apply"
level=error
level=error msg="When applying changes to module.bootstrap.aws_s3_bucket.ignition, provider"
level=error msg="\"aws\" produced an unexpected new value for was present, but now absent."
level=error
level=error msg="This is a bug in the provider, which should be reported in the provider's own"
level=error msg="issue tracker."
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform" 

It's being tracked upstream in [3], and I have a ticket open with AWS to explain the inconsistency.

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=aws&search=produced%20an%20unexpected%20new%20value%20for%20was%20present,%20but%20now%20absent
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6097
[3]: https://github.com/terraform-providers/terraform-provider-aws/issues/9725

Comment 1 Abhinav Dahiya 2019-09-16 15:58:32 UTC

*** Bug 1752355 has been marked as a duplicate of this bug. ***

Comment 2 Kirsten Garrison 2019-10-08 17:30:40 UTC

Just saw this today:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1089/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/1990

Comment 3 Sergiusz Urbaniak 2019-10-28 14:27:03 UTC

Confirming, also seen recently in e2e tests: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovs-kubernetes-4.3/29

Comment 4 Oleg Bulatov 2019-10-29 14:58:04 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/10032

Comment 5 Oleg Bulatov 2019-10-29 15:00:46 UTC

It is 2% of failures: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=produced+an+unexpected+new+value+for+was+present%2C+but+now+absent&maxAge=336h&context=2&type=all

Comment 6 Jeff Peeler 2019-11-04 16:40:57 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2/343

Comment 7 Jeff Peeler 2019-11-04 16:47:50 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.3/2876

Comment 8 Petr Muller 2019-11-20 16:04:36 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-mirrors-4.2/193

Comment 9 Abhinav Dahiya 2019-11-25 16:42:47 UTC

*** Bug 1776423 has been marked as a duplicate of this bug. ***

Comment 10 Lokesh Mandvekar 2019-11-25 18:10:40 UTC

some ssh related issues in the latest aws-fips-4.3 run at https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.3/588

Lease acquired, installing...
Installing from release registry.svc.ci.openshift.org/ocp/release:4.3.0-0.nightly-2019-11-25-153929
level=warning msg="Found override for release image. Please be warned, this is not advised"
level=info msg="Consuming Install Config from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-00k7xrfx-3fb9c.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=error msg="Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get https://api.ci-op-00k7xrfx-3fb9c.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusteroperators: dial tcp 3.214.115.89:6443: connect: connection refused"
level=info msg="Pulling debug logs from the bootstrap machine"
level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client, ensure the proper ssh key is in your keyring or specify with --key: dial tcp 18.207.205.70:22: connect: connection refused"
level=fatal msg="Bootstrap failed to complete: waiting for Kubernetes API: context deadline exceeded"

Comment 11 Jonathan Lebon 2019-12-04 15:37:54 UTC

Saw this today too: https://prow.svc.ci.openshift.org/log?job=release-openshift-ocp-installer-e2e-aws-ovn-4.3&id=14

Comment 12 Abhinav Dahiya 2020-02-24 21:52:14 UTC

This was fixed in 4.5 when we bumped the provider version to 2.49.0 in https://github.com/openshift/installer/pull/3140 which was a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1766691

Comment 16 Johnny Liu 2020-03-02 10:23:45 UTC

Ignore comment 15, it is copy/paste mistake.

Search the past 7 days' log, https://search.svc.ci.openshift.org/?search=produced+an+unexpected+new+value+for+was+present&maxAge=168h&context=1&type=all, not found similar error. Move this bug to verified.

Comment 17 Hongkai Liu 2020-03-27 14:55:33 UTC

Showed up again https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23308#1:build-log.txt%3A58

Should I reopen it?

Comment 18 Abhinav Dahiya 2020-03-27 15:56:10 UTC

as you see from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/23308#1:build-log.txt%3A49

` Installing from initial release registry.svc.ci.openshift.org/ocp/release:4.4.0-rc.4`

the installer bring used in that job is 4.4.0-rc.4, which doesn't have the fix, we merged the fix only to 4.5(master)

So I do not think this bug should be re-opened.

Comment 19 Johnny Liu 2020-03-30 02:13:11 UTC

Thanks for Abhinav's explanation.

Comment 21 errata-xmlrpc 2020-07-13 17:11:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409