1898265 – [OCP 4.5][AWS] Installation failed: error updating LB Target Group

Bug 1898265 - [OCP 4.5][AWS] Installation failed: error updating LB Target Group

Summary: [OCP 4.5][AWS] Installation failed: error updating LB Target Group

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Nobody
QA Contact:	Yunfei Jiang
Docs Contact:
URL:
Whiteboard:
Depends On:	1981941
Blocks:	2047390
TreeView+	depends on / blocked

Reported:	2020-11-16 17:36 UTC by Mario Abajo
Modified:	2024-03-25 17:06 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: there was an eventual consistency issue in the aws-terraform-provider when trying to update newly created load balancers Consequence: installs would fail trying to access load balancer Fix: installer updated to upstream terraform-provider which has fix to respect eventual consistency Result: install does not fail
Clone Of:
Environment:
Last Closed:	2022-08-10 10:35:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
spike in this error from CI over last 2 weeks (678.68 KB, image/png) 2021-11-26 12:37 UTC, Devan Goodwin	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 5666	0	None	Merged	Bug 2059213: build all terraform providers and terraform binary locally	2022-04-14 16:43:40 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:36:06 UTC

Description Mario Abajo 2020-11-16 17:36:51 UTC

Thanks for opening a bug report!
Before hitting the button, please fill in as much of the template below as you can.
If you leave out information, it's harder to help you.
Be ready for follow-up questions, and please respond in a timely manner.
If we can't reproduce a bug we might close your issue.
If we're wrong, PLEASE feel free to reopen it and explain why.

Version:

$ openshift-install version
(from installer log)
OpenShift Installer 4.5.18
Built from commit a1f43445e365d186c3359c43961fa8974251edc0

Platform:

aws

Please specify:
IPI

What happened?

Terraform installer fails updating the load balancer Targer Group with this error:

level=error msg="Error: error updating LB Target Group (arn:aws:elasticloadbalancing:ap-south-1:295635262768:targetgroup/vmyiameockhwbaybnkpx-wtqx6-aint/d7ed6ed34fc3410e) tags: error tagging resource (arn:aws:elasticloadbalancing:ap-south-1:295635262768:targetgroup/vmyiameockhwbaybnkpx-wtqx6-aint/d7ed6ed34fc3410e): TargetGroupNotFound: One or more target groups not found"
level=error msg="\tstatus code: 400, request id: fc13a53b-efb9-4bd5-b31a-bf21c627a7d0"

It consistently happens in a fresh installation scenario.
I'm attaching installation logs.

What did you expect to happen?

Install succesfully.

Comment 4 Ben Parees 2021-05-03 14:09:21 UTC

this is continuing to happen in CI, and on v4.8:

https://search.ci.openshift.org/?search=error+updating+LB+Target+Group+&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

sample recent job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1388317338285117440

level=info msg=Credentials loaded from the "default" profile in file "/var/run/secrets/ci.openshift.io/cluster-profile/.awscred"
level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: error updating LB Target Group (arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-dvl7sy9d-799a1-8l4xl-aext/b5810b441a33e0ea) tags: error tagging resource (arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-dvl7sy9d-799a1-8l4xl-aext/b5810b441a33e0ea): TargetGroupNotFound: Target groups 'arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-dvl7sy9d-799a1-8l4xl-aext/b5810b441a33e0ea' not found
level=error msg=	status code: 400, request id: 9d18e641-9b76-4684-9e68-023b3369545f
level=error
level=error msg=  on ../tmp/openshift-install-073131869/vpc/master-elb.tf line 71, in resource "aws_lb_target_group" "api_external":
level=error msg=  71: resource "aws_lb_target_group" "api_external" {

Comment 5 Jeremiah Stuever 2021-05-19 17:56:30 UTC

I believe this is a bug in the upstream terraform-provider-aws. The issue is that resources are being created and then acted upon before AWS fully propagates them. This appears to be fixed in version 3.22.0 (December 18, 2020) with the PR 16808. The solution here is to upgrade our Terraform provider to the same or newer version. This is work in progress with CORS-1511.

https://github.com/hashicorp/terraform-provider-aws/blob/master/CHANGELOG.md#3220-december-18-2020
https://github.com/hashicorp/terraform-provider-aws/pull/16808
https://issues.redhat.com/browse/CORS-1511

Comment 6 Russell Teague 2021-07-12 17:31:34 UTC

Still waiting for terraform upgrade.

Comment 7 Russell Teague 2021-08-02 17:17:56 UTC

Still waiting for terraform upgrade.

Comment 9 Laurent TOURREAU 2021-11-08 17:17:19 UTC

Bug confirmed in OCP 4.9.5 too:

level=info msg=Credentials loaded from the "default" profile in file "/home/ec2-user/.aws/credentials"
level=info msg=Creating infrastructure resources...
level=error
level=error msg=Error: error updating LB Target Group (arn:aws:elasticloadbalancing:us-east-2:304692911362:targetgroup/myocp-jr9c8-sint/45fcd9d192da8731) tags: error tagging resource (arn:aws:elasticloadbalancing:us-east-2:304692911362:targetgroup/myocp-jr9c8-sint/45fcd9d192da8731): TargetGroupNotFound: Target groups 'arn:aws:elasticloadbalancing:us-east-2:304692911362:targetgroup/myocp-jr9c8-sint/45fcd9d192da8731' not found
level=error msg=	status code: 400, request id: e43dcf16-3402-4078-8225-fcf465c7953e
level=error
level=error msg=  on ../../tmp/openshift-install-cluster-014210554/vpc/master-elb.tf line 99, in resource "aws_lb_target_group" "services":
level=error msg=  99: resource "aws_lb_target_group" "services" {
level=error
level=error
level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change

Comment 10 Devan Goodwin 2021-11-26 12:37:43 UTC

Created attachment 1843703 [details]
spike in this error from CI over last 2 weeks

Major increase in this from CI over starting Wed Nov 24 in the afternoon. Presumably must be AWS side if we're hitting in multiple releases rather than anything we merged.

Screenshot is from https://search.ci.openshift.org/chart?search=TargetGroupNotFound&maxAge=168h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job but will lose the data in a few weeks and won't look the same, so screenshot for posterity and comparison.

Comment 11 Devan Goodwin 2021-11-29 14:22:02 UTC

Clarification of my above comment, I misread graph a bit, and it looks like we may be missing some data, so I cannot say for sure this picked up on Nov 24 because the data appears to be somewhat missing before that. The problem is however, happening quite often, and capable of taking out expensive 10x aggregated jobs.

Comment 12 Matthew Staebler 2022-01-18 18:45:22 UTC

This will be resolved when the aws terraform provider is separated from the installer and updated.

Comment 15 Matthew Staebler 2022-01-31 18:41:39 UTC

There is no upstream fix for this. If this issue persists after we upgrade to the latest terraform provider, then we will need to contribute a fix upstream.

Comment 16 Patrick Dillon 2022-03-22 17:36:30 UTC

We believe this has been fixed with the recent upgrade to the aws-terraform-provider. We will attempt to verify that this error is no longer occurring in master CI runs. Once we determine that this is no longer occurring in master, we will close this bz.

Comment 17 Patrick Dillon 2022-04-12 15:37:55 UTC

Looking through CI, this BZ appears to be fixed. There are no occurrences in master, all occurrences are in earlier branches.

Comment 18 Patrick Dillon 2022-04-14 16:43:41 UTC

While this BZ may have been fixed earlier than https://github.com/openshift/installer/pull/5666, this PR introduced the AWS provider in the current "pattern" we use now (embedding locally rather than pulling from the public registry). The aws terraform provider has also subsequently been updated in follow up PRs.

The upstream issue is https://github.com/hashicorp/terraform-provider-aws/issues/16860.

Comment 21 Yunfei Jiang 2022-04-28 05:52:10 UTC

searched ci log [1], no errors found in 4.11

https://search.ci.openshift.org/?search=Error%3A+error+updating+LB+Target+Group&maxAge=168h&context=10&type=build-log&name=.%2B4%5C.11.%2Baws.%2B&excludeName=.%2Bupgrade.%2B&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 24 errata-xmlrpc 2022-08-10 10:35:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.