1965080 – machine-api-operator constantly makes unauthorized AWS calls to DescribeInternetGateways

Bug 1965080 - machine-api-operator constantly makes unauthorized AWS calls to DescribeInternetGateways

Summary: machine-api-operator constantly makes unauthorized AWS calls to DescribeInter...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-26 19:08 UTC by bmorriso
Modified:	2022-04-11 08:33 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Rather than determining whether load balancer attachment was a required operation, the operator would constantly attach new targets even if they were already attached Consequence: Excessive calls to the AWS API result in a high number of errors Fix: Check whether a load balancer attachment was required before attempting the attachment process Result: Failed API request frequency has been reduced
Clone Of:
Environment:
Last Closed:	2021-10-18 17:31:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-aws pull 410	None	open	Bug 1965080: Reduce frequency of calls to register targets with load balancers	2021-06-07 14:33:47 UTC
Github	openshift machine-api-operator pull 870	None	open	Bug 1965080: Update AWS permissions to reflect requirements for load balancers	2021-06-07 14:34:29 UTC
Red Hat Product Errata	RHSA-2021:3759	None	Waiting on Customer	[BZ] Getting following error on both Red hat servers	2022-06-27 14:01:08 UTC

Description bmorriso 2021-05-26 19:08:04 UTC

Description of problem:

After installing a 4.7.11 cluster on AWS recently with STS enabled, I noticed a lot of failing AWS calls that appear to be originating from the machine-api-operator. Looking in the CloudTrail logs I see that, every 3-4 minutes, there are 10+ calls to `DescribeInternetGateways` in the VPC of the deployed cluster. 

Half of these calls are assuming the role provided for the operator, something like: "arn:aws:sts::<account_num>:assumed-role/delegate-admin-openshift-machine-api-aws-cloud-credentials/..." . These calls fail with the following response:

    "eventSource": "ec2.amazonaws.com",
    "eventName": "DescribeInternetGateways",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "elasticloadbalancing.amazonaws.com",
    "userAgent": "elasticloadbalancing.amazonaws.com",
    "errorCode": "Client.UnauthorizedOperation",
    "errorMessage": "You are not authorized to perform this operation."


The other half of these 10 or so AWS calls use a service-linked role to make the same `DescribeInternetGateways` call for the same VPC and it succeeds. In these instances the service-linked role making the request has an arn like : ""arn": "arn:aws:sts::<account_num>:assumed-role/AWSServiceRoleForElasticLoadBalancing/ELASTICLOADBALANCING","

Version-Release number of selected component (if applicable):

OCP 4.7.11 with STS enabled

How reproducible:
100%

Steps to Reproduce:
1. Install an STS cluster on AWS using the extracted credentials from "oc adm release extract quay.io/openshift-release-dev/ocp-release:${version}.0-x86_64 --credentials-requests --cloud=aws"
2. Wait for cluster installation to complete
3. Look in AWS CloudTrail event history for Events named "DescribeInternetGateways"

Actual results:
There are regularly occurring calls using the machine-api-operator's assumed credentials that are unauthorized and fail, followed by 
an equivalent call using a service-linked role that are successful.


Expected results:
There aren't any failing requests.


Additional info:

Comment 3 Joel Speed 2021-05-27 16:04:44 UTC

This seems to be a side effect of the RegisterTargets call we are making. While the call itself doesn't fail, it seems that internally, AWS first attempts the VPC list call using the provided STS credentials, and then some internal credentials.
This doesn't seem to happen in the normal case and it always uses the internal credentials.

There are two fixes I would like to do here, we will have to add the DescribeInternetGateways permission to prevent this from happening.
But also, we can check that the health check targets are registered and avoid registering them if they are already registered.
This makes the problem invisible but doesn't actually mitigate the issue.

Comment 5 sunzhaohua 2021-06-21 08:07:26 UTC

Failed to verify
Tried on clusterversion: 4.9.0-0.nightly-2021-06-20-015701 4.8.0-0.nightly-2021-06-19-005119 cannot get a successfully installed sts cluster. 
After the installation fails, check the cluster, many pods are in CrashLoopBackOff state, wait for about 30 minutes, everything becomes normal.

$ oc get pods -n openshift-machine-api
NAME                                           READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-686dbf5678-wk9fg   2/2     Running            14         60m
cluster-baremetal-operator-6cfd45886f-n85zb    2/2     Running            1          60m
machine-api-controllers-646456ff4-pwbzv        4/7     CrashLoopBackOff   53         51m
machine-api-operator-7c8d757d58-942zm          2/2     Running            1          60m

Machine-controller log:
2021-06-21T05:08:10.699599194Z F0621 05:08:10.699565       1 main.go:131] Error creating manager: Unauthorized
2021-06-21T05:08:10.699714879Z goroutine 1 [running]:
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.stacks(0xc000128001, 0xc0001b62a0, 0x50, 0xda)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).output(0x35abbc0, 0xc000000003, 0x0, 0x0, 0xc000450cb0, 0x2b0a4ae, 0x7, 0x83, 0x0)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:970 +0x191
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).printf(0x35abbc0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x227d29c, 0x1a, 0xc00048f020, 0x1, ...)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:751 +0x191
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.Fatalf(...)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1509
2021-06-21T05:08:10.699714879Z main.main()
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/cmd/manager/main.go:131 +0x6ad
2021-06-21T05:08:10.699714879Z
2021-06-21T05:08:10.699714879Z goroutine 18 [chan receive]:
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).flushDaemon(0x35abbc0)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
2021-06-21T05:08:10.699714879Z created by k8s.io/klog/v2.init.0
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:418 +0xdf
2021-06-21T05:08:10.699714879Z
2021-06-21T05:08:10.699714879Z goroutine 19 [sleep]:
2021-06-21T05:08:10.699714879Z time.Sleep(0x6fc23ac00)
2021-06-21T05:08:10.699714879Z  /usr/lib/golang/src/runtime/time.go:193 +0xd2
2021-06-21T05:08:10.699714879Z sigs.k8s.io/controller-runtime/pkg/log.init.0.func1()
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:63 +0x45
2021-06-21T05:08:10.699714879Z created by sigs.k8s.io/controller-runtime/pkg/log.init.0
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:62 +0x35
2021-06-21T05:08:10.699714879Z
2021-06-21T05:08:10.699714879Z goroutine 70 [IO wait]:
2021-06-21T05:08:10.699714879Z internal/poll.runtime_pollWait(0x7f719a2c6658, 0x72, 0xffffffffffffffff)
2021-06-21T05:08:10.699714879Z  /usr/lib/golang/src/runtime/netpoll.go:222 +0x55
2021-06-21T05:08:10.699714879Z internal/poll.(*pollDesc).wait(0xc0001a1698, 0x72, 0x1200, 0x12ad, 0xffffffffffffffff)
2021-06-21T05:08:10.699714879Z  /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
2021-06-21T05:08:10.699714879Z internal/poll.(*pollDesc).waitRead(...)

must-gather: 
4.9.0-0.nightly-2021-06-20-015701 http://file.rdu.redhat.com/~zhsun/must-gather.local.70017523856438490.tar.gz
4.8.0-0.nightly-2021-06-19-005119 http://file.rdu.redhat.com/~zhsun/must-gather.local.1102752771192400939.tar.gz

Comment 6 Joel Speed 2021-06-21 09:43:26 UTC

None of the changes to the code in this PR are anywhere near where this failure happens. I think this is a coincidence and not something related to this change.

I will investigate this morning to see if I can work out why it isn't working initially

Comment 7 Joel Speed 2021-06-21 16:12:43 UTC

I've tried today to reproduce this but was unable to successfully stand up an STS cluster.

Having tried this out on a regular cluster, it appears to work as expected when not running in STS.
I don't think STS is specifically doing anything different in this case, so not sure this is the source of the issue re STS.

Comment 8 sunzhaohua 2021-06-25 05:38:33 UTC

Move to verified
clusterversion: 4.9.0-0.nightly-2021-06-24-123715

set a sts environment, two hours later, checking the CloudTrail logs, I only see 1 calls to `DescribeInternetGateways` in the VPC of the deployed cluster.

Comment 11 errata-xmlrpc 2021-10-18 17:31:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.