Bug 1910396
Summary: | 4.6.9 cred operator should back-off when provisioning fails on throttling | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Cloud Credential Operator | Assignee: | Devan Goodwin <dgoodwin> |
Status: | CLOSED ERRATA | QA Contact: | wang lin <lwan> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.6 | CC: | arane, jdiaz, lwan |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: Error message stored in CredentialsRequest condition status message.
Consequence: Infinite status update / reconcile loop in controller in some error scenarios (AWS rate limiting), leading to excessive CPU use and logging.
Fix: Scrubbed request ID coming back from cloud provider to more safely store error message in conditions where users can more easily find them.
Result: No infinite loop.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:35:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2020-12-23 18:02:34 UTC
A must gather would have helped here to see the CredentialsRequests, but looking at the code, this appears to be a pretty standard infinite loop. We're putting an AWS error into a condition, checking if status has changed, updating if so. The AWS error has a request ID that changes everytime. We update status, reconcile again, update status, etc, which prevents our normal error backoff from kicking in. Discussing with team how to solve as we hit this often when we try to give users good error messages from the cloud provider. Verified on 4.8.0-0.nightly-2021-05-19-123944 test steps: 1. Change cloudcredential mode to Mint 2. Modified root secret aws-creds in kube-system namespace to an invalid one 3. remove one component secret, such as secret installer-cloud-credentials in openshift-image-registry namespace 4. Check credentialsrequest openshift-image-registry status For 4.8.0-fc.3 without this fix, the message of condition=CredentialsProvisionFailure will contain "request id" which will trigger loop status update $ oc get credentialsrequest openshift-image-registry -o json | jq -r ".status" { "conditions": [ { "lastProbeTime": "2021-05-21T11:14:07Z", "lastTransitionTime": "2021-05-21T10:50:31Z", "message": "failed to grant creds: error syncing creds in mint-mode: AWS Error: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.\n\tstatus code: 403, request id: b46a1197-178e-4c2f-9916-5034e17546e6", "reason": "CredentialsProvisionFailure", "status": "True", "type": "CredentialsProvisionFailure" } ], "lastSyncCloudCredsSecretResourceVersion": "2511", "lastSyncGeneration": 1, "lastSyncTimestamp": "2021-05-21T09:28:18Z", "providerStatus": { "apiVersion": "cloudcredential.openshift.io/v1", "kind": "AWSProviderStatus", "policy": "lwanaws0521-3-4k597-openshift-image-registry-bfvnc-policy", "user": "lwanaws0521-3-4k597-openshift-image-registry-bfvnc" }, "provisioned": false } For payload 4.8.0-0.nightly-2021-05-19-123944 with this fix, the request id is removed oc get credentialsrequest openshift-image-registry -o json | jq -r ".status" { "conditions": [ { "lastProbeTime": "2021-05-21T11:16:08Z", "lastTransitionTime": "2021-05-21T08:35:10Z", "message": "failed to grant creds: error syncing creds in mint-mode: AWS Error: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details., status code: 403", "reason": "CredentialsProvisionFailure", "status": "True", "type": "CredentialsProvisionFailure" } ], "lastSyncCloudCredsSecretResourceVersion": "2476", "lastSyncGeneration": 1, "lastSyncTimestamp": "2021-05-21T07:11:28Z", "providerStatus": { "apiVersion": "cloudcredential.openshift.io/v1", "kind": "AWSProviderStatus", "policy": "lwanaws0521-2-t7fhj-openshift-image-registry-cnj76-policy", "user": "lwanaws0521-2-t7fhj-openshift-image-registry-cnj76" }, "provisioned": false } Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |