Thanks for opening a bug report! Before hitting the button, please fill in as much of the template below as you can. If you leave out information, it's harder to help you. Be ready for follow-up questions, and please respond in a timely manner. If we can't reproduce a bug we might close your issue. If we're wrong, PLEASE feel free to reopen it and explain why. Version: $ openshift-install version (from installer log) OpenShift Installer 4.5.18 Built from commit a1f43445e365d186c3359c43961fa8974251edc0 Platform: aws Please specify: IPI What happened? Terraform installer fails updating the load balancer Targer Group with this error: level=error msg="Error: error updating LB Target Group (arn:aws:elasticloadbalancing:ap-south-1:295635262768:targetgroup/vmyiameockhwbaybnkpx-wtqx6-aint/d7ed6ed34fc3410e) tags: error tagging resource (arn:aws:elasticloadbalancing:ap-south-1:295635262768:targetgroup/vmyiameockhwbaybnkpx-wtqx6-aint/d7ed6ed34fc3410e): TargetGroupNotFound: One or more target groups not found" level=error msg="\tstatus code: 400, request id: fc13a53b-efb9-4bd5-b31a-bf21c627a7d0" It consistently happens in a fresh installation scenario. I'm attaching installation logs. What did you expect to happen? Install succesfully.
this is continuing to happen in CI, and on v4.8: https://search.ci.openshift.org/?search=error+updating+LB+Target+Group+&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job sample recent job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1388317338285117440 level=info msg=Credentials loaded from the "default" profile in file "/var/run/secrets/ci.openshift.io/cluster-profile/.awscred" level=info msg=Creating infrastructure resources... level=error level=error msg=Error: error updating LB Target Group (arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-dvl7sy9d-799a1-8l4xl-aext/b5810b441a33e0ea) tags: error tagging resource (arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-dvl7sy9d-799a1-8l4xl-aext/b5810b441a33e0ea): TargetGroupNotFound: Target groups 'arn:aws:elasticloadbalancing:us-east-1:460538899914:targetgroup/ci-op-dvl7sy9d-799a1-8l4xl-aext/b5810b441a33e0ea' not found level=error msg= status code: 400, request id: 9d18e641-9b76-4684-9e68-023b3369545f level=error level=error msg= on ../tmp/openshift-install-073131869/vpc/master-elb.tf line 71, in resource "aws_lb_target_group" "api_external": level=error msg= 71: resource "aws_lb_target_group" "api_external" {
I believe this is a bug in the upstream terraform-provider-aws. The issue is that resources are being created and then acted upon before AWS fully propagates them. This appears to be fixed in version 3.22.0 (December 18, 2020) with the PR 16808. The solution here is to upgrade our Terraform provider to the same or newer version. This is work in progress with CORS-1511. https://github.com/hashicorp/terraform-provider-aws/blob/master/CHANGELOG.md#3220-december-18-2020 https://github.com/hashicorp/terraform-provider-aws/pull/16808 https://issues.redhat.com/browse/CORS-1511
Still waiting for terraform upgrade.
Bug confirmed in OCP 4.9.5 too: level=info msg=Credentials loaded from the "default" profile in file "/home/ec2-user/.aws/credentials" level=info msg=Creating infrastructure resources... level=error level=error msg=Error: error updating LB Target Group (arn:aws:elasticloadbalancing:us-east-2:304692911362:targetgroup/myocp-jr9c8-sint/45fcd9d192da8731) tags: error tagging resource (arn:aws:elasticloadbalancing:us-east-2:304692911362:targetgroup/myocp-jr9c8-sint/45fcd9d192da8731): TargetGroupNotFound: Target groups 'arn:aws:elasticloadbalancing:us-east-2:304692911362:targetgroup/myocp-jr9c8-sint/45fcd9d192da8731' not found level=error msg= status code: 400, request id: e43dcf16-3402-4078-8225-fcf465c7953e level=error level=error msg= on ../../tmp/openshift-install-cluster-014210554/vpc/master-elb.tf line 99, in resource "aws_lb_target_group" "services": level=error msg= 99: resource "aws_lb_target_group" "services" { level=error level=error level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply Terraform: failed to complete the change
Created attachment 1843703 [details] spike in this error from CI over last 2 weeks Major increase in this from CI over starting Wed Nov 24 in the afternoon. Presumably must be AWS side if we're hitting in multiple releases rather than anything we merged. Screenshot is from https://search.ci.openshift.org/chart?search=TargetGroupNotFound&maxAge=168h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job but will lose the data in a few weeks and won't look the same, so screenshot for posterity and comparison.
Clarification of my above comment, I misread graph a bit, and it looks like we may be missing some data, so I cannot say for sure this picked up on Nov 24 because the data appears to be somewhat missing before that. The problem is however, happening quite often, and capable of taking out expensive 10x aggregated jobs.
This will be resolved when the aws terraform provider is separated from the installer and updated.
There is no upstream fix for this. If this issue persists after we upgrade to the latest terraform provider, then we will need to contribute a fix upstream.
We believe this has been fixed with the recent upgrade to the aws-terraform-provider. We will attempt to verify that this error is no longer occurring in master CI runs. Once we determine that this is no longer occurring in master, we will close this bz.
Looking through CI, this BZ appears to be fixed. There are no occurrences in master, all occurrences are in earlier branches.
While this BZ may have been fixed earlier than https://github.com/openshift/installer/pull/5666, this PR introduced the AWS provider in the current "pattern" we use now (embedding locally rather than pulling from the public registry). The aws terraform provider has also subsequently been updated in follow up PRs. The upstream issue is https://github.com/hashicorp/terraform-provider-aws/issues/16860.
searched ci log [1], no errors found in 4.11 https://search.ci.openshift.org/?search=Error%3A+error+updating+LB+Target+Group&maxAge=168h&context=10&type=build-log&name=.%2B4%5C.11.%2Baws.%2B&excludeName=.%2Bupgrade.%2B&maxMatches=5&maxBytes=20971520&groupBy=job
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069