Bug 1779390

Summary: failed to grant creds: error syncing creds in mint-mode: secrets "installer-cloud-credentials" already exists
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cloud Credential OperatorAssignee: Joel Diaz <jdiaz>
Status: CLOSED ERRATA QA Contact: Johnny Liu <jialiu>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.4CC: jdiaz, jialiu
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Some CredentialsRequests would take a long time to be re-reconciled even though they had error conditions set. Consequence: The retry times between attempts would be long. Fix: Always reconcile a CredentialsRequest with conditions set on it, or if it is marked as not yet provisioned. Result: CredentialsRequests with conditions should have attempts to be fully reconciled at a more regular rate.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:18:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-12-03 21:46:16 UTC
In CI [1]:

level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-wild7jfc-d89b3.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=error msg="Cluster operator cloud-credential Degraded is True with CredentialsFailing: 1 of 4 credentials requests are failing to sync."
level=info msg="Cluster operator cloud-credential Progressing is True with Reconciling: 3 of 4 credentials requests provisioned, 1 reporting errors."
level=info msg="Cluster operator image-registry Available is False with DeploymentNotFound: The deployment does not exist"
level=info msg="Cluster operator image-registry Progressing is True with Error: Unable to apply resources: unable to sync storage configuration: InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.\n\tstatus code: 403, request id: 2FBC89FCFA529CC8, host id: jM0bV0q16Z7Xn4OE3iMO73jmh4ETbFQF8hCCQnDtz9PWVrBtosEzvNvOFk7xQaW161LV2yIaUzo="
level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Cluster operator image-registry is still updating"

apparently due to a failure to create the registry creds:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.3/148/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-cloud-credential-operator/cloudcredential.openshift.io/credentialsrequests/openshift-image-registry.yaml | yaml2json | jq -r '{name: .metadata.name, status}' | json2yaml 
name: openshift-image-registry
status:
  conditions:
  - lastProbeTime: '2019-12-03T11:50:53Z'
    lastTransitionTime: '2019-12-03T11:50:53Z'
    message: 'failed to grant creds: error syncing creds in mint-mode: secrets "installer-cloud-credentials" already exists'
    reason: CredentialsProvisionFailure
    status: 'True'
    type: CredentialsProvisionFailure
  lastSyncGeneration: '1'
  lastSyncTimestamp: '2019-12-03T11:50:53Z'
  providerStatus:
    apiVersion: cloudcredential.openshift.io/v1
    kind: AWSProviderStatus
    policy: ci-op-wild7jfc-d89b3-openshift-image-registry-kg65h-policy
    user: ci-op-wild7jfc-d89b3-openshift-image-registry-kg65h
  provisioned: 'false'

Failure seems rare, with 10 jobs hitting this over the past 24h [3].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.3/148
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-sdn-multitenant-4.3/148/artifacts/e2e-aws/must-gather/registry-svc-ci-openshift-org-ocp-4-3-2019-11-22-122829-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-cloud-credential-operator/cloudcredential.openshift.io/credentialsrequests/openshift-image-registry.yaml
[3]: https://search.svc.ci.openshift.org/chart?search=CredentialsFailing:%20.*credentials%20requests%20are%20failing%20to%20sync

Comment 2 Devan Goodwin 2020-01-08 19:19:53 UTC
Spotted by Clayton again in: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_node_exporter/48/pull-ci-openshift-node_exporter-master-e2e-aws/45

Appears to be about a 1% failure rate: https://search.svc.ci.openshift.org/chart?search=CredentialsFailing:%20.*credentials%20requests%20are%20failing%20to%20sync

From the cco logs: 

time="2020-01-08T17:55:34Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-01-08T17:55:34Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2020-01-08T17:55:34Z" level=debug msg="lastsyncgeneration is current and lastsynctimestamp was less than an hour ago, so no need to sync" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2020-01-08T17:55:34Z" level=debug msg="syncing cluster operator status" controller=credreq_status
time="2020-01-08T17:55:34Z" level=debug msg="4 cred requests" controller=credreq_status
time="2020-01-08T17:55:34Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="1 of 4 credentials requests are failing to sync." reason=CredentialsFailing status=True type=Degraded
time="2020-01-08T17:55:34Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="3 of 4 credentials requests provisioned, 1 reporting errors." reason=Reconciling status=True type=Progressing

This appears to be the broken cred but it's not reconciling due to the logic above. The cred request itself does show a failing CR:

        {
            "apiVersion": "cloudcredential.openshift.io/v1",
            "kind": "CredentialsRequest",
            "metadata": {
                "creationTimestamp": "2020-01-08T17:39:34Z",
                "finalizers": [
                    "cloudcredential.openshift.io/deprovision"
                ],
                "generation": 1,
                "labels": {
                    "controller-tools.k8s.io": "1.0"
                },
                "name": "openshift-image-registry",
                "namespace": "openshift-cloud-credential-operator",
                "resourceVersion": "1526",
                "selfLink": "/apis/cloudcredential.openshift.io/v1/namespaces/openshift-cloud-credential-operator/credentialsrequests/openshift-image-registry",
                "uid": "726e65ec-3577-42d7-acd8-f2427a45422a"
            },
            "spec": {
                "providerSpec": {
                    "apiVersion": "cloudcredential.openshift.io/v1",
                    "kind": "AWSProviderSpec",
                    "statementEntries": [
                        {
                            "action": [
                                "s3:CreateBucket",
                                "s3:DeleteBucket",
                                "s3:PutBucketTagging",
                                "s3:GetBucketTagging",
                                "s3:PutBucketPublicAccessBlock",
                                "s3:GetBucketPublicAccessBlock",
                                "s3:PutEncryptionConfiguration",
                                "s3:GetEncryptionConfiguration",
                                "s3:PutLifecycleConfiguration",
                                "s3:GetLifecycleConfiguration",
                                "s3:GetBucketLocation",
                                "s3:ListBucket",
                                "s3:HeadBucket",
                                "s3:GetObject",
                                "s3:PutObject",
                                "s3:DeleteObject",
                                "s3:ListBucketMultipartUploads",
                                "s3:AbortMultipartUpload"
                            ],
                            "effect": "Allow",
                            "resource": "*"
                        }
                    ]
                },
                "secretRef": {
                    "name": "installer-cloud-credentials",
                    "namespace": "openshift-image-registry"
                }
            },
            "status": {
                "conditions": [
                    {
                        "lastProbeTime": "2020-01-08T17:39:35Z",
                        "lastTransitionTime": "2020-01-08T17:39:35Z",
                        "message": "failed to grant creds: error syncing creds in mint-mode: secrets \"installer-cloud-credentials\" already exists",
                        "reason": "CredentialsProvisionFailure",
                        "status": "True",
                        "type": "CredentialsProvisionFailure"
                    }
                ],
                "lastSyncGeneration": 1,
                "lastSyncTimestamp": "2020-01-08T17:39:35Z",
                "providerStatus": {
                    "apiVersion": "cloudcredential.openshift.io/v1",
                    "kind": "AWSProviderStatus",
                    "policy": "ci-op-jr96gzdy-37df0-openshift-image-registry-fnf4r-policy",
                    "user": "ci-op-jr96gzdy-37df0-openshift-image-registry-fnf4r"
                },
                "provisioned": false
            }
        },


And yet we're not reconciling. Feels like a weird race to kick this off, but then bad logic on how we set and interpret last sync generation and timestamp in the status.

Comment 5 Johnny Liu 2020-01-17 08:43:38 UTC
From description, this issue is not easy to be produced. 

I checked recent ci failure job, did not find similar issue yet. Here I only do some testing to ensure no regression issue. 

Verified with 4.4.0-0.nightly-2020-01-17-003449, installation is completed successfully.

If anyone provide more detailed steps for validation, pls let me know.

Comment 7 errata-xmlrpc 2020-05-04 11:18:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581