Bug 1965080
Summary: | machine-api-operator constantly makes unauthorized AWS calls to DescribeInternetGateways | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | bmorriso |
Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | aos-bugs, gshereme, mimccune |
Version: | 4.7 | Keywords: | ServiceDeliveryImpact |
Target Milestone: | --- | ||
Target Release: | 4.9.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Rather than determining whether load balancer attachment was a required operation, the operator would constantly attach new targets even if they were already attached
Consequence: Excessive calls to the AWS API result in a high number of errors
Fix: Check whether a load balancer attachment was required before attempting the attachment process
Result: Failed API request frequency has been reduced
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-10-18 17:31:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
bmorriso
2021-05-26 19:08:04 UTC
This seems to be a side effect of the RegisterTargets call we are making. While the call itself doesn't fail, it seems that internally, AWS first attempts the VPC list call using the provided STS credentials, and then some internal credentials. This doesn't seem to happen in the normal case and it always uses the internal credentials. There are two fixes I would like to do here, we will have to add the DescribeInternetGateways permission to prevent this from happening. But also, we can check that the health check targets are registered and avoid registering them if they are already registered. This makes the problem invisible but doesn't actually mitigate the issue. Failed to verify Tried on clusterversion: 4.9.0-0.nightly-2021-06-20-015701 4.8.0-0.nightly-2021-06-19-005119 cannot get a successfully installed sts cluster. After the installation fails, check the cluster, many pods are in CrashLoopBackOff state, wait for about 30 minutes, everything becomes normal. $ oc get pods -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-686dbf5678-wk9fg 2/2 Running 14 60m cluster-baremetal-operator-6cfd45886f-n85zb 2/2 Running 1 60m machine-api-controllers-646456ff4-pwbzv 4/7 CrashLoopBackOff 53 51m machine-api-operator-7c8d757d58-942zm 2/2 Running 1 60m Machine-controller log: 2021-06-21T05:08:10.699599194Z F0621 05:08:10.699565 1 main.go:131] Error creating manager: Unauthorized 2021-06-21T05:08:10.699714879Z goroutine 1 [running]: 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.stacks(0xc000128001, 0xc0001b62a0, 0x50, 0xda) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).output(0x35abbc0, 0xc000000003, 0x0, 0x0, 0xc000450cb0, 0x2b0a4ae, 0x7, 0x83, 0x0) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:970 +0x191 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).printf(0x35abbc0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x227d29c, 0x1a, 0xc00048f020, 0x1, ...) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:751 +0x191 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.Fatalf(...) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1509 2021-06-21T05:08:10.699714879Z main.main() 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/cmd/manager/main.go:131 +0x6ad 2021-06-21T05:08:10.699714879Z 2021-06-21T05:08:10.699714879Z goroutine 18 [chan receive]: 2021-06-21T05:08:10.699714879Z k8s.io/klog/v2.(*loggingT).flushDaemon(0x35abbc0) 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b 2021-06-21T05:08:10.699714879Z created by k8s.io/klog/v2.init.0 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/k8s.io/klog/v2/klog.go:418 +0xdf 2021-06-21T05:08:10.699714879Z 2021-06-21T05:08:10.699714879Z goroutine 19 [sleep]: 2021-06-21T05:08:10.699714879Z time.Sleep(0x6fc23ac00) 2021-06-21T05:08:10.699714879Z /usr/lib/golang/src/runtime/time.go:193 +0xd2 2021-06-21T05:08:10.699714879Z sigs.k8s.io/controller-runtime/pkg/log.init.0.func1() 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:63 +0x45 2021-06-21T05:08:10.699714879Z created by sigs.k8s.io/controller-runtime/pkg/log.init.0 2021-06-21T05:08:10.699714879Z /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:62 +0x35 2021-06-21T05:08:10.699714879Z 2021-06-21T05:08:10.699714879Z goroutine 70 [IO wait]: 2021-06-21T05:08:10.699714879Z internal/poll.runtime_pollWait(0x7f719a2c6658, 0x72, 0xffffffffffffffff) 2021-06-21T05:08:10.699714879Z /usr/lib/golang/src/runtime/netpoll.go:222 +0x55 2021-06-21T05:08:10.699714879Z internal/poll.(*pollDesc).wait(0xc0001a1698, 0x72, 0x1200, 0x12ad, 0xffffffffffffffff) 2021-06-21T05:08:10.699714879Z /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:87 +0x45 2021-06-21T05:08:10.699714879Z internal/poll.(*pollDesc).waitRead(...) must-gather: 4.9.0-0.nightly-2021-06-20-015701 http://file.rdu.redhat.com/~zhsun/must-gather.local.70017523856438490.tar.gz 4.8.0-0.nightly-2021-06-19-005119 http://file.rdu.redhat.com/~zhsun/must-gather.local.1102752771192400939.tar.gz None of the changes to the code in this PR are anywhere near where this failure happens. I think this is a coincidence and not something related to this change. I will investigate this morning to see if I can work out why it isn't working initially I've tried today to reproduce this but was unable to successfully stand up an STS cluster. Having tried this out on a regular cluster, it appears to work as expected when not running in STS. I don't think STS is specifically doing anything different in this case, so not sure this is the source of the issue re STS. Move to verified clusterversion: 4.9.0-0.nightly-2021-06-24-123715 set a sts environment, two hours later, checking the CloudTrail logs, I only see 1 calls to `DescribeInternetGateways` in the VPC of the deployed cluster. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |