1910396 – 4.6.9 cred operator should back-off when provisioning fails on throttling

Bug 1910396 - 4.6.9 cred operator should back-off when provisioning fails on throttling

Summary: 4.6.9 cred operator should back-off when provisioning fails on throttling

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Devan Goodwin
QA Contact:	wang lin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-23 18:02 UTC by W. Trevor King
Modified:	2021-07-27 22:36 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Error message stored in CredentialsRequest condition status message. Consequence: Infinite status update / reconcile loop in controller in some error scenarios (AWS rate limiting), leading to excessive CPU use and logging. Fix: Scrubbed request ID coming back from cloud provider to more safely store error message in conditions where users can more easily find them. Result: No infinite loop.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:35:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cloud-credential-operator pull 342	0	None	open	Bug 1910396: Add ErrorScrub utility to prevent infinite update/reconc…	2021-05-18 13:29:51 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:36:04 UTC

Description W. Trevor King 2020-12-23 18:02:34 UTC

Spinning off of bug 1910318, where a 4.6.9 install failed due to missing AWS creds:

$ yaml2json <namespaces/openshift-cloud-credential-operator/cloudcredential.openshift.io/credentialsrequests/openshift-machine-api-aws.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-12-23T14:23:28Z CredentialsProvisionFailure=True CredentialsProvisionFailure: failed to grant creds: error syncing creds in mint-mode: AWS Error: Throttling: Rate exceeded
        status code: 400, request id: ...

Looks like the cred operator was banging away at AWS, and never backing off enough to let the throttling calm down and a request succeed.  Doesn't mean that the cred operator was the only source of throttle-inducing load, but still would be nice if there was some exponential backoff:

 $ grep 'AWS error creating user' namespaces/openshift-cloud-credential-operator/pods/cloud-credential-operator-7f8fb8dcb6-xkx85/cloud-credential-operator/cloud-credential-operator/logs/current.log | head | sed 's/request id: [0-9a-f-]*/request id: .../'
2020-12-23T14:36:56.556652671Z time="2020-12-23T14:36:56Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-openshift-image-registry-68hlr
2020-12-23T14:37:05.069413901Z time="2020-12-23T14:37:05Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-openshift-ingress-g98qq
2020-12-23T14:37:13.524134841Z time="2020-12-23T14:37:13Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-aws-ebs-csi-driver-operator-jlwkm
2020-12-23T14:37:20.943219630Z time="2020-12-23T14:37:20Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-openshift-machine-api-aws-7g5kf
2020-12-23T14:37:28.314451986Z time="2020-12-23T14:37:28Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-openshift-image-registry-68hlr
2020-12-23T14:37:35.495548067Z time="2020-12-23T14:37:35Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-openshift-ingress-g98qq
2020-12-23T14:37:43.087697817Z time="2020-12-23T14:37:43Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/aws-ebs-csi-driver-operator error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-aws-ebs-csi-driver-operator-jlwkm
2020-12-23T14:37:51.523816073Z time="2020-12-23T14:37:51Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-openshift-machine-api-aws-7g5kf
2020-12-23T14:37:59.470420843Z time="2020-12-23T14:37:59Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-openshift-image-registry-68hlr
2020-12-23T14:38:07.399429728Z time="2020-12-23T14:38:07Z" level=error msg="AWS error creating user" actuator=aws cr=openshift-cloud-credential-operator/openshift-ingress error="AWS Error: Throttling: Rate exceeded\n\tstatus code: 400, request id: ..." userName=nmanos-cluster-a-ktb-openshift-ingress-g98qq

Comment 5 Devan Goodwin 2021-05-17 13:59:09 UTC

A must gather would have helped here to see the CredentialsRequests, but looking at the code, this appears to be a pretty standard infinite loop. We're putting an AWS error into a condition, checking if status has changed, updating if so. The AWS error has a request ID that changes everytime. We update status, reconcile again, update status, etc, which prevents our normal error backoff from kicking in.

Discussing with team how to solve as we hit this often when we try to give users good error messages from the cloud provider.

Comment 9 wang lin 2021-05-24 03:01:50 UTC

Verified on 4.8.0-0.nightly-2021-05-19-123944


test steps:
1. Change cloudcredential mode to Mint
2. Modified root secret aws-creds in kube-system namespace to an invalid one
3. remove one component secret, such as secret installer-cloud-credentials in openshift-image-registry namespace
4. Check credentialsrequest openshift-image-registry status


For 4.8.0-fc.3 without this fix, the message of condition=CredentialsProvisionFailure will contain "request id" which will trigger loop status update
$ oc get credentialsrequest openshift-image-registry -o json | jq -r ".status"
{
  "conditions": [
    {
      "lastProbeTime": "2021-05-21T11:14:07Z",
      "lastTransitionTime": "2021-05-21T10:50:31Z",
      "message": "failed to grant creds: error syncing creds in mint-mode: AWS Error: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.\n\tstatus code: 403, request id: b46a1197-178e-4c2f-9916-5034e17546e6",
      "reason": "CredentialsProvisionFailure",
      "status": "True",
      "type": "CredentialsProvisionFailure"
    }
  ],
  "lastSyncCloudCredsSecretResourceVersion": "2511",
  "lastSyncGeneration": 1,
  "lastSyncTimestamp": "2021-05-21T09:28:18Z",
  "providerStatus": {
    "apiVersion": "cloudcredential.openshift.io/v1",
    "kind": "AWSProviderStatus",
    "policy": "lwanaws0521-3-4k597-openshift-image-registry-bfvnc-policy",
    "user": "lwanaws0521-3-4k597-openshift-image-registry-bfvnc"
  },
  "provisioned": false
}


For payload 4.8.0-0.nightly-2021-05-19-123944 with this fix, the request id is removed
oc get credentialsrequest openshift-image-registry -o json | jq -r ".status"
{
  "conditions": [
    {
      "lastProbeTime": "2021-05-21T11:16:08Z",
      "lastTransitionTime": "2021-05-21T08:35:10Z",
      "message": "failed to grant creds: error syncing creds in mint-mode: AWS Error: SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details., status code: 403",
      "reason": "CredentialsProvisionFailure",
      "status": "True",
      "type": "CredentialsProvisionFailure"
    }
  ],
  "lastSyncCloudCredsSecretResourceVersion": "2476",
  "lastSyncGeneration": 1,
  "lastSyncTimestamp": "2021-05-21T07:11:28Z",
  "providerStatus": {
    "apiVersion": "cloudcredential.openshift.io/v1",
    "kind": "AWSProviderStatus",
    "policy": "lwanaws0521-2-t7fhj-openshift-image-registry-cnj76-policy",
    "user": "lwanaws0521-2-t7fhj-openshift-image-registry-cnj76"
  },
  "provisioned": false
}

Comment 12 errata-xmlrpc 2021-07-27 22:35:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.