Bug 1965080

Summary:	machine-api-operator constantly makes unauthorized AWS calls to DescribeInternetGateways
Product:	OpenShift Container Platform	Reporter:	bmorriso
Component:	Cloud Compute	Assignee:	Joel Speed <jspeed>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, gshereme, mimccune
Version:	4.7	Keywords:	ServiceDeliveryImpact
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Rather than determining whether load balancer attachment was a required operation, the operator would constantly attach new targets even if they were already attached Consequence: Excessive calls to the AWS API result in a high number of errors Fix: Check whether a load balancer attachment was required before attempting the attachment process Result: Failed API request frequency has been reduced	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:31:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description bmorriso 2021-05-26 19:08:04 UTC

Description of problem:

After installing a 4.7.11 cluster on AWS recently with STS enabled, I noticed a lot of failing AWS calls that appear to be originating from the machine-api-operator. Looking in the CloudTrail logs I see that, every 3-4 minutes, there are 10+ calls to `DescribeInternetGateways` in the VPC of the deployed cluster. 

Half of these calls are assuming the role provided for the operator, something like: "arn:aws:sts::<account_num>:assumed-role/delegate-admin-openshift-machine-api-aws-cloud-credentials/..." . These calls fail with the following response:

    "eventSource": "ec2.amazonaws.com",
    "eventName": "DescribeInternetGateways",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "elasticloadbalancing.amazonaws.com",
    "userAgent": "elasticloadbalancing.amazonaws.com",
    "errorCode": "Client.UnauthorizedOperation",
    "errorMessage": "You are not authorized to perform this operation."


The other half of these 10 or so AWS calls use a service-linked role to make the same `DescribeInternetGateways` call for the same VPC and it succeeds. In these instances the service-linked role making the request has an arn like : ""arn": "arn:aws:sts::<account_num>:assumed-role/AWSServiceRoleForElasticLoadBalancing/ELASTICLOADBALANCING","

Version-Release number of selected component (if applicable):

OCP 4.7.11 with STS enabled

How reproducible:
100%

Steps to Reproduce:
1. Install an STS cluster on AWS using the extracted credentials from "oc adm release extract quay.io/openshift-release-dev/ocp-release:${version}.0-x86_64 --credentials-requests --cloud=aws"
2. Wait for cluster installation to complete
3. Look in AWS CloudTrail event history for Events named "DescribeInternetGateways"

Actual results:
There are regularly occurring calls using the machine-api-operator's assumed credentials that are unauthorized and fail, followed by 
an equivalent call using a service-linked role that are successful.


Expected results:
There aren't any failing requests.


Additional info:

Comment 3 Joel Speed 2021-05-27 16:04:44 UTC

This seems to be a side effect of the RegisterTargets call we are making. While the call itself doesn't fail, it seems that internally, AWS first attempts the VPC list call using the provided STS credentials, and then some internal credentials.
This doesn't seem to happen in the normal case and it always uses the internal credentials.

There are two fixes I would like to do here, we will have to add the DescribeInternetGateways permission to prevent this from happening.
But also, we can check that the health check targets are registered and avoid registering them if they are already registered.
This makes the problem invisible but doesn't actually mitigate the issue.

Comment 5 sunzhaohua 2021-06-21 08:07:26 UTC

Failed to verify
Tried on clusterversion: 4.9.0-0.nightly-2021-06-20-015701 4.8.0-0.nightly-2021-06-19-005119 cannot get a successfully installed sts cluster. 
After the installation fails, check the cluster, many pods are in CrashLoopBackOff state, wait for about 30 minutes, everything becomes normal.

$ oc get pods -n openshift-machine-api
NAME                                           READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-686dbf5678-wk9fg   2/2     Running            14         60m
cluster-baremetal-operator-6cfd45886f-n85zb    2/2     Running            1          60m
machine-api-controllers-646456ff4-pwbzv        4/7     CrashLoopBackOff   53         51m
machine-api-operator-7c8d757d58-942zm          2/2     Running            1          60m

Machine-controller log:
2021-06-21T05:08:10.699599194Z F0621 05:08:10.699565       1 main.go:131] Error creating manager: Unauthorized
2021-06-21T05:08:10.699714879Z goroutine 1 [running]:
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.stacks(0xc000128001, 0xc0001b62a0, 0x50, 0xda)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).output(0x35abbc0, 0xc000000003, 0x0, 0x0, 0xc000450cb0, 0x2b0a4ae, 0x7, 0x83, 0x0)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:970 +0x191
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).printf(0x35abbc0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x227d29c, 0x1a, 0xc00048f020, 0x1, ...)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:751 +0x191
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.Fatalf(...)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1509
2021-06-21T05:08:10.699714879Z main.main()
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/cmd/manager/main.go:131 +0x6ad
2021-06-21T05:08:10.699714879Z
2021-06-21T05:08:10.699714879Z goroutine 18 [chan receive]:
2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).flushDaemon(0x35abbc0)
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
2021-06-21T05:08:10.699714879Z created by k8s.io/klog/v2.init.0
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:418 +0xdf
2021-06-21T05:08:10.699714879Z
2021-06-21T05:08:10.699714879Z goroutine 19 [sleep]:
2021-06-21T05:08:10.699714879Z time.Sleep(0x6fc23ac00)
2021-06-21T05:08:10.699714879Z  /usr/lib/golang/src/runtime/time.go:193 +0xd2
2021-06-21T05:08:10.699714879Z sigs.k8s.io/controller-runtime/pkg/log.init.0.func1()
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:63 +0x45
2021-06-21T05:08:10.699714879Z created by sigs.k8s.io/controller-runtime/pkg/log.init.0
2021-06-21T05:08:10.699714879Z  /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:62 +0x35
2021-06-21T05:08:10.699714879Z
2021-06-21T05:08:10.699714879Z goroutine 70 [IO wait]:
2021-06-21T05:08:10.699714879Z internal/poll.runtime_pollWait(0x7f719a2c6658, 0x72, 0xffffffffffffffff)
2021-06-21T05:08:10.699714879Z  /usr/lib/golang/src/runtime/netpoll.go:222 +0x55
2021-06-21T05:08:10.699714879Z internal/poll.(*pollDesc).wait(0xc0001a1698, 0x72, 0x1200, 0x12ad, 0xffffffffffffffff)
2021-06-21T05:08:10.699714879Z  /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45
2021-06-21T05:08:10.699714879Z internal/poll.(*pollDesc).waitRead(...)

must-gather: 
4.9.0-0.nightly-2021-06-20-015701 http://file.rdu.redhat.com/~zhsun/must-gather.local.70017523856438490.tar.gz
4.8.0-0.nightly-2021-06-19-005119 http://file.rdu.redhat.com/~zhsun/must-gather.local.1102752771192400939.tar.gz

Comment 6 Joel Speed 2021-06-21 09:43:26 UTC

None of the changes to the code in this PR are anywhere near where this failure happens. I think this is a coincidence and not something related to this change.

I will investigate this morning to see if I can work out why it isn't working initially

Comment 7 Joel Speed 2021-06-21 16:12:43 UTC

I've tried today to reproduce this but was unable to successfully stand up an STS cluster.

Having tried this out on a regular cluster, it appears to work as expected when not running in STS.
I don't think STS is specifically doing anything different in this case, so not sure this is the source of the issue re STS.

Comment 8 sunzhaohua 2021-06-25 05:38:33 UTC

Move to verified
clusterversion: 4.9.0-0.nightly-2021-06-24-123715

set a sts environment, two hours later, checking the CloudTrail logs, I only see 1 calls to `DescribeInternetGateways` in the VPC of the deployed cluster.

Comment 11 errata-xmlrpc 2021-10-18 17:31:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759