Description of problem: When cco is in Mint mode, cco will help to create credentials(users) in cloud providers for components use. If the user was removed from cloud provider for various reasons, such as accident, cco should recreate a new credential for the component, but now cco only shows the below error message, then the operator will be in degraded status ###log from cco pod time="2021-05-11T03:23:32Z" level=error msg="error getting user: {\n\n}" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws error="NoSuchEntity: The user with name lwanaws510-rmltf-openshift-machine-api-aws-m6nz5 cannot be found.\n\tstatus code: 404, request id: 4f700b60-9ef4-4b9a-9095-3d5eca81ac3d" time="2021-05-11T03:23:32Z" level=error msg="error determining whether a credentials update is needed" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws error="unable to read info for username {\n\n}: NoSuchEntity: The user with name lwanaws510-rmltf-openshift-machine-api-aws-m6nz5 cannot be found.\n\tstatus code: 404, request id: 4f700b60-9ef4-4b9a-9095-3d5eca81ac3d" time="2021-05-11T03:23:32Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-aws secret=openshift-machine-api/aws-cloud-credentials Version: 4.8.0-fc.3-x86_64 Steps to Reproduce: Install a cluster with cco in Mint mode Delete a user from the cloud provider manually Wait for cco next reconcile, about 1 hours, check if cco will recreate a new user for the component Actual Results: It won’t recreate a new user for the component, and operator Degraded Expected Results: It will detect the user doesn’t exist and recreate a new user for the component
I don't have permission to delete sp in azure cluster too, I will research more which permission it needs. but I test for gcp, it didn't work, I removed service account for openshift-image-registry-gcs, and after an hour, I checked cco logs and see below messages: ### time="2021-05-21T06:27:15Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry-gcs secret=openshift-image-registry/installer-cloud-credentials time="2021-05-21T06:27:15Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry-gcs secret=openshift-image-registry/installer-cloud-credentials time="2021-05-21T06:27:47Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry-gcs time="2021-05-21T06:27:48Z" level=error msg="error checking whether service account keys exists" actuator=gcp cr=openshift-cloud-credential-operator/openshift-image-registry-gcs error="error getting list of keys for service account: rpc error: code = NotFound desc = Account deleted: 642429426269" ### it shows service account not found, and cloud-credential degraded. $ oc get co cloud-credential NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE cloud-credential 4.8.0-0.nightly-2021-05-19-123944 True True True 98m I test on payload : 4.8.0-0.nightly-2021-05-19-123944
Hi, Akhil I know how to delete the service principal. I login to az using root service principal which I used to create the cluster. $az login --service-principal -u http://lwan-installer -p XXXXXXXXXXXX --tenant 6047c7e9-b2ad-488d-XXXXXXXX then I can delete the component service principal, using $az ad sp delete --id clientid But I find when cco reconsile credentialsrequest, it doesn't create the new SP , and even it doesn't know the SP is gone. Everything looks well from cco logs and cloud-credential operator status, but in fact, the SP are no longer on the azure cloud.
Hi Lin, You were right about both GCP and azure. I have updated the PR with the fix for both of them. Waiting for it to be reviewed. Thanks for testing it for me.
Verified on 4.8.0-0.nightly-2021-05-29-114625 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-29-114625 True False 3h24m Cluster version is 4.8.0-0.nightly-2021-05-29-114625 tested on AWS/GCP/Azure, after I deleted user/SP/SA, cco will recreated the new one when it regularly reconcile credentialsrequests But @arane ,I found a little prompt issue for a warning message on gcp, does it should be s/AWS/GCP/g in https://github.com/openshift/cloud-credential-operator/blob/master/pkg/gcp/actuator/actuator.go#L564 ?
Akhil, there is an another issue. In azure, cco truely recreate service principal if it detects SP doesn't exist, but when it updates secret, it only updates azure_client_id field, azure_client_secret field doesn't be updated. we can see from image-registry operator, after cco recreates SP and updates secrets, image-registry still hits the below issue ##image registry clusteroperator status: "message": "Progressing: Unable to apply resources: unable to sync storage configuration: failed to get keys for the storage account imageregistrylwanaztpfzp: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/lwanaz0531-3-qbxz4-rg/providers/Microsoft.Storage/storageAccounts/imageregistrylwanaztpfzp/listKeys?api-version=2019-04-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {\"error\":\"invalid_client\",\"error_description\":\"AADSTS7000215: Invalid client secret is provided.\\r\\nTrace ID: f2c7e302-069a-411b-91ee-78a54d24c102\\r\\nCorrelation ID: 4fe025d6-f9d0-411d-aa22-b22cb72bb38b\\r\\nTimestamp: 2021-05-31 11:23:03Z\",\"error_codes\":[7000215],\"timestamp\":\"2021-05-31 11:23:03Z\",\"trace_id\":\"f2c7e302-069a-411b-91ee-78a54d24c102\",\"correlation_id\":\"4fe025d6-f9d0-411d-aa22-b22cb72bb38b\",\"error_uri\":\"https://login.microsoftonline.com/error?code=7000215\"}", "reason": "Error", "status": "True", "type": "Progressing" }, Move this one to failedQA
Thanks for reporting the above issues @lwan . I have opened a new PR to fix them https://github.com/openshift/cloud-credential-operator/pull/349
Verified on 4.8.0-0.nightly-2021-06-06-164529. test steps: 1. check image-registry secret before deleting the service principal $ oc get secret installer-cloud-credentials -n openshift-image-registry -o json | jq -r .data { "azure_client_id": "MGNlMzBmZDItYWRjZi00MDQ1LThjYWItMjk0MDZhMWIxOTFm", "azure_client_secret": "YjY5Y2UzNTUtZDcyNi00NjI1LWI0NjctMGQ1YWI1OTljZmUz", "azure_region": "XXXXXX", "azure_resource_prefix": "XXXXXX", "azure_resourcegroup": "XXXXXX", "azure_subscription_id": "XXXXXX", "azure_tenant_id": "XXXXXX" } 2. remove iamge-registry's SP, check co image-registry is in unhealthy status $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE image-registry 4.8.0-0.nightly-2021-06-06-164529 True True False 43m 3. wait for cco next reconcile, about an hour, then check image-registry secret again, cco will recreate a new sp and update both secret azure_client_id and azure_client_secret field. $ oc get secret installer-cloud-credentials -n openshift-image-registry -o json | jq -r .data { "azure_client_id": "MmEwMjMwNmUtNGFhZi00ODhmLWJhOWUtNTVlNmVkYTNjYTk2", "azure_client_secret": "N2M2OTUyYjUtNDM3OC00YWVjLWE3ZTMtZGQxYmZiMzI1OTVk", "azure_region": "XXXXXX", "azure_resource_prefix": "XXXXXX", "azure_resourcegroup": "XXXXXX", "azure_subscription_id": "XXXXXX", "azure_tenant_id": "XXXXXX" } 4. check co image-registry back to normal staus $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE image-registry 4.8.0-0.nightly-2021-06-06-164529 True False False 86m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438