Description of problem: After installing a 4.7.11 cluster on AWS recently with STS enabled, I noticed a lot of failing AWS calls that appear to be originating from the machine-api-operator. Looking in the CloudTrail logs I see that, every 3-4 minutes, there are 10+ calls to `DescribeInternetGateways` in the VPC of the deployed cluster. Half of these calls are assuming the role provided for the operator, something like: "arn:aws:sts::<account_num>:assumed-role/delegate-admin-openshift-machine-api-aws-cloud-credentials/..." . These calls fail with the following response: "eventSource": "ec2.amazonaws.com", "eventName": "DescribeInternetGateways", "awsRegion": "us-east-1", "sourceIPAddress": "elasticloadbalancing.amazonaws.com", "userAgent": "elasticloadbalancing.amazonaws.com", "errorCode": "Client.UnauthorizedOperation", "errorMessage": "You are not authorized to perform this operation." The other half of these 10 or so AWS calls use a service-linked role to make the same `DescribeInternetGateways` call for the same VPC and it succeeds. In these instances the service-linked role making the request has an arn like : ""arn": "arn:aws:sts::<account_num>:assumed-role/AWSServiceRoleForElasticLoadBalancing/ELASTICLOADBALANCING"," Version-Release number of selected component (if applicable): OCP 4.7.11 with STS enabled How reproducible: 100% Steps to Reproduce: 1. Install an STS cluster on AWS using the extracted credentials from "oc adm release extract quay.io/openshift-release-dev/ocp-release:${version}.0-x86_64 --credentials-requests --cloud=aws" 2. Wait for cluster installation to complete 3. Look in AWS CloudTrail event history for Events named "DescribeInternetGateways" Actual results: There are regularly occurring calls using the machine-api-operator's assumed credentials that are unauthorized and fail, followed by an equivalent call using a service-linked role that are successful. Expected results: There aren't any failing requests. Additional info:
This seems to be a side effect of the RegisterTargets call we are making. While the call itself doesn't fail, it seems that internally, AWS first attempts the VPC list call using the provided STS credentials, and then some internal credentials. This doesn't seem to happen in the normal case and it always uses the internal credentials. There are two fixes I would like to do here, we will have to add the DescribeInternetGateways permission to prevent this from happening. But also, we can check that the health check targets are registered and avoid registering them if they are already registered. This makes the problem invisible but doesn't actually mitigate the issue.
Failed to verify Tried on clusterversion: 4.9.0-0.nightly-2021-06-20-015701 4.8.0-0.nightly-2021-06-19-005119 cannot get a successfully installed sts cluster. After the installation fails, check the cluster, many pods are in CrashLoopBackOff state, wait for about 30 minutes, everything becomes normal. $ oc get pods -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-686dbf5678-wk9fg 2/2 Running 14 60m cluster-baremetal-operator-6cfd45886f-n85zb 2/2 Running 1 60m machine-api-controllers-646456ff4-pwbzv 4/7 CrashLoopBackOff 53 51m machine-api-operator-7c8d757d58-942zm 2/2 Running 1 60m Machine-controller log: 2021-06-21T05:08:10.699599194Z F0621 05:08:10.699565 1 main.go:131] Error creating manager: Unauthorized 2021-06-21T05:08:10.699714879Z goroutine 1 [running]: 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.stacks(0xc000128001, 0xc0001b62a0, 0x50, 0xda) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).output(0x35abbc0, 0xc000000003, 0x0, 0x0, 0xc000450cb0, 0x2b0a4ae, 0x7, 0x83, 0x0) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:970 +0x191 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).printf(0x35abbc0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x227d29c, 0x1a, 0xc00048f020, 0x1, ...) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:751 +0x191 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.Fatalf(...) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1509 2021-06-21T05:08:10.699714879Z main.main() 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/cmd/manager/main.go:131 +0x6ad 2021-06-21T05:08:10.699714879Z 2021-06-21T05:08:10.699714879Z goroutine 18 [chan receive]: 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).flushDaemon(0x35abbc0) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b 2021-06-21T05:08:10.699714879Z created by k8s.io/klog/v2.init.0 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:418 +0xdf 2021-06-21T05:08:10.699714879Z 2021-06-21T05:08:10.699714879Z goroutine 19 [sleep]: 2021-06-21T05:08:10.699714879Z time.Sleep(0x6fc23ac00) 2021-06-21T05:08:10.699714879Z /usr/lib/golang/src/runtime/time.go:193 +0xd2 2021-06-21T05:08:10.699714879Z sigs.k8s.io/controller-runtime/pkg/log.init.0.func1() 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:63 +0x45 2021-06-21T05:08:10.699714879Z created by sigs.k8s.io/controller-runtime/pkg/log.init.0 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:62 +0x35 2021-06-21T05:08:10.699714879Z 2021-06-21T05:08:10.699714879Z goroutine 70 [IO wait]: 2021-06-21T05:08:10.699714879Z internal/poll.runtime_pollWait(0x7f719a2c6658, 0x72, 0xffffffffffffffff) 2021-06-21T05:08:10.699714879Z /usr/lib/golang/src/runtime/netpoll.go:222 +0x55 2021-06-21T05:08:10.699714879Z internal/poll.(*pollDesc).wait(0xc0001a1698, 0x72, 0x1200, 0x12ad, 0xffffffffffffffff) 2021-06-21T05:08:10.699714879Z /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45 2021-06-21T05:08:10.699714879Z internal/poll.(*pollDesc).waitRead(...) must-gather: 4.9.0-0.nightly-2021-06-20-015701 http://file.rdu.redhat.com/~zhsun/must-gather.local.70017523856438490.tar.gz 4.8.0-0.nightly-2021-06-19-005119 http://file.rdu.redhat.com/~zhsun/must-gather.local.1102752771192400939.tar.gz
None of the changes to the code in this PR are anywhere near where this failure happens. I think this is a coincidence and not something related to this change. I will investigate this morning to see if I can work out why it isn't working initially
I've tried today to reproduce this but was unable to successfully stand up an STS cluster. Having tried this out on a regular cluster, it appears to work as expected when not running in STS. I don't think STS is specifically doing anything different in this case, so not sure this is the source of the issue re STS.
Move to verified clusterversion: 4.9.0-0.nightly-2021-06-24-123715 set a sts environment, two hours later, checking the CloudTrail logs, I only see 1 calls to `DescribeInternetGateways` in the VPC of the deployed cluster.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759