1960176 – CCO should recreate a user for the component when it was removed from the cloud providers

Bug 1960176 - CCO should recreate a user for the component when it was removed from the cloud providers

Summary: CCO should recreate a user for the component when it was removed from the clo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Akhil Rane
QA Contact:	wang lin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-05-13 08:44 UTC by wang lin
Modified:	2021-07-27 23:08 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: User is deleted from cloud provider Consequence: User is never recreated Fix: Result: CCO ensures that user exists and recreates if deleted
Clone Of:
Environment:
Last Closed:	2021-07-27 23:08:19 UTC
Target Upstream Version:
Embargoed:
Flags:	lwan: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cloud-credential-operator pull 345	None	open	Bug 1960176: Recreate user when deleted in aws cloud provider	2021-05-20 17:33:56 UTC
Github	openshift cloud-credential-operator pull 349	None	open	Bug 1960176: Make sure credentials have newly generated azure client secret	2021-06-02 22:38:17 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 23:08:35 UTC

Description wang lin 2021-05-13 08:44:00 UTC

Description of problem:
When cco is in Mint mode, cco will help to create credentials(users) in cloud providers for components use. If the user was removed from cloud provider for various reasons, such as accident, cco should recreate a new credential for the component, but now cco only shows the below error message, then the operator will be in degraded status
###log from cco pod
time="2021-05-11T03:23:32Z" level=error msg="error getting user: {\n\n}" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws error="NoSuchEntity: The user with name lwanaws510-rmltf-openshift-machine-api-aws-m6nz5 cannot be found.\n\tstatus code: 404, request id: 4f700b60-9ef4-4b9a-9095-3d5eca81ac3d"
time="2021-05-11T03:23:32Z" level=error msg="error determining whether a credentials update is needed" actuator=aws cr=openshift-cloud-credential-operator/openshift-machine-api-aws error="unable to read info for username {\n\n}: NoSuchEntity: The user with name lwanaws510-rmltf-openshift-machine-api-aws-m6nz5 cannot be found.\n\tstatus code: 404, request id: 4f700b60-9ef4-4b9a-9095-3d5eca81ac3d"
time="2021-05-11T03:23:32Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-aws secret=openshift-machine-api/aws-cloud-credentials

Version:
4.8.0-fc.3-x86_64

Steps to Reproduce:
Install a cluster with cco in Mint mode
Delete a user from the cloud provider manually
Wait for cco next reconcile, about 1 hours, check if cco will recreate a new user for the component

Actual Results:
It won’t recreate a new user for the component, and operator Degraded

Expected Results:
It will detect the user doesn’t exist and recreate a new user for the component

Comment 2 wang lin 2021-05-21 06:46:37 UTC

I don't have permission to delete sp in azure cluster too, I will research more which permission it needs.

but I test for gcp, it didn't work, I removed service account for openshift-image-registry-gcs, and after an hour, I checked cco logs and see below messages:

###
time="2021-05-21T06:27:15Z" level=error msg="error syncing credentials: error determining whether a credentials update is needed" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry-gcs secret=openshift-image-registry/installer-cloud-credentials
time="2021-05-21T06:27:15Z" level=error msg="errored with condition: CredentialsProvisionFailure" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry-gcs secret=openshift-image-registry/installer-cloud-credentials
time="2021-05-21T06:27:47Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry-gcs
time="2021-05-21T06:27:48Z" level=error msg="error checking whether service account keys exists" actuator=gcp cr=openshift-cloud-credential-operator/openshift-image-registry-gcs error="error getting list of keys for service account: rpc error: code = NotFound desc = Account deleted: 642429426269"
###


it shows service account not found, and cloud-credential degraded.
$ oc get co cloud-credential
NAME               VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
cloud-credential   4.8.0-0.nightly-2021-05-19-123944   True        True          True       98m

I test on payload : 4.8.0-0.nightly-2021-05-19-123944

Comment 3 wang lin 2021-05-21 11:25:09 UTC

Hi, Akhil

I know how to delete the service principal. I login to az using root service principal which I used to create the cluster.
$az login --service-principal -u http://lwan-installer -p XXXXXXXXXXXX --tenant 6047c7e9-b2ad-488d-XXXXXXXX

then I can delete the component service principal, using
$az ad sp delete --id clientid

But I find when cco reconsile credentialsrequest, it doesn't create the new SP , and even it doesn't know the SP is gone. Everything looks well from cco logs and cloud-credential operator status, but in fact, the SP are no longer on the azure cloud.

Comment 4 Akhil Rane 2021-05-24 13:26:23 UTC

Hi Lin,

You were right about both GCP and azure. I have updated the PR with the fix for both of them. Waiting for it to be reviewed. Thanks for testing it for me.

Comment 6 wang lin 2021-05-31 06:24:29 UTC

Verified on 4.8.0-0.nightly-2021-05-29-114625 

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-29-114625   True        False         3h24m   Cluster version is 4.8.0-0.nightly-2021-05-29-114625

tested on AWS/GCP/Azure, after I deleted user/SP/SA, cco will recreated the new one when it regularly reconcile credentialsrequests

But @arane ,I found a little prompt issue for a warning message on gcp, does it should be s/AWS/GCP/g in 
https://github.com/openshift/cloud-credential-operator/blob/master/pkg/gcp/actuator/actuator.go#L564 ?

Comment 7 wang lin 2021-05-31 11:36:23 UTC

Akhil, there is an another issue. In azure, cco truely recreate service principal if it detects SP doesn't exist, but when it updates secret, it only updates azure_client_id field, azure_client_secret field doesn't be updated. we can see from image-registry operator, after cco recreates SP and updates secrets, image-registry still hits the below issue

##image registry clusteroperator status:
"message": "Progressing: Unable to apply resources: unable to sync storage configuration: failed to get keys for the storage account imageregistrylwanaztpfzp: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/lwanaz0531-3-qbxz4-rg/providers/Microsoft.Storage/storageAccounts/imageregistrylwanaztpfzp/listKeys?api-version=2019-04-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {\"error\":\"invalid_client\",\"error_description\":\"AADSTS7000215: Invalid client secret is provided.\\r\\nTrace ID: f2c7e302-069a-411b-91ee-78a54d24c102\\r\\nCorrelation ID: 4fe025d6-f9d0-411d-aa22-b22cb72bb38b\\r\\nTimestamp: 2021-05-31 11:23:03Z\",\"error_codes\":[7000215],\"timestamp\":\"2021-05-31 11:23:03Z\",\"trace_id\":\"f2c7e302-069a-411b-91ee-78a54d24c102\",\"correlation_id\":\"4fe025d6-f9d0-411d-aa22-b22cb72bb38b\",\"error_uri\":\"https://login.microsoftonline.com/error?code=7000215\"}",
      "reason": "Error",
      "status": "True",
      "type": "Progressing"
    },



Move this one to failedQA

Comment 8 Akhil Rane 2021-06-02 22:36:28 UTC

Thanks for reporting the above issues @lwan . I have opened a new PR to fix them https://github.com/openshift/cloud-credential-operator/pull/349

Comment 10 wang lin 2021-06-07 03:47:21 UTC

Verified on 4.8.0-0.nightly-2021-06-06-164529.

test steps:
1. check image-registry secret before deleting the service principal

$ oc get secret installer-cloud-credentials -n openshift-image-registry -o json | jq -r .data
{
  "azure_client_id": "MGNlMzBmZDItYWRjZi00MDQ1LThjYWItMjk0MDZhMWIxOTFm",
  "azure_client_secret": "YjY5Y2UzNTUtZDcyNi00NjI1LWI0NjctMGQ1YWI1OTljZmUz",
  "azure_region": "XXXXXX",
  "azure_resource_prefix": "XXXXXX",
  "azure_resourcegroup": "XXXXXX",
  "azure_subscription_id": "XXXXXX",
  "azure_tenant_id": "XXXXXX"
}


2. remove iamge-registry's SP, check co image-registry is in unhealthy status
$ oc get co image-registry
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry   4.8.0-0.nightly-2021-06-06-164529   True        True          False      43m


3. wait for cco next reconcile, about an hour, then check image-registry secret again, cco will recreate a new sp and update both secret azure_client_id and azure_client_secret field.
$ oc get secret installer-cloud-credentials -n openshift-image-registry -o json | jq -r .data
{
  "azure_client_id": "MmEwMjMwNmUtNGFhZi00ODhmLWJhOWUtNTVlNmVkYTNjYTk2",
  "azure_client_secret": "N2M2OTUyYjUtNDM3OC00YWVjLWE3ZTMtZGQxYmZiMzI1OTVk",
  "azure_region": "XXXXXX",
  "azure_resource_prefix": "XXXXXX",
  "azure_resourcegroup": "XXXXXX",
  "azure_subscription_id": "XXXXXX",
  "azure_tenant_id": "XXXXXX"
}

4. check co image-registry back to normal staus
$ oc get co image-registry
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry   4.8.0-0.nightly-2021-06-06-164529   True        False         False      86m

Comment 13 errata-xmlrpc 2021-07-27 23:08:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.