Bug 1856500 - Bad UX when investigating degraded CCO operator
Summary: Bad UX when investigating degraded CCO operator
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Devan Goodwin
QA Contact: wang lin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-13 19:09 UTC by Michael Gugino
Modified: 2021-09-13 14:01 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-04 13:03:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker HIVE-1311 0 None None None 2021-09-13 14:01:24 UTC

Description Michael Gugino 2020-07-13 19:09:58 UTC
Description of problem:

My install reported as failed due to degraded CCO.

"ERROR Cluster operator cloud-credential Degraded is True with CredentialsFailing: 2 of 4 credentials requests are failing to sync."

So, I check the operator:

./oc get clusteroperators cloud-credential -oyaml

That pretty much just repeats the same information, though there is a list of related objects which points me in the right direction.

So, I see I have a bunch of kind CredentialsRequest in namespace openshift-cloud-credential-operator.

So, I take a look at those:

./oc get CredentialsRequest -A
NAMESPACE                             NAME                                 AGE
openshift-cloud-credential-operator   cloud-credential-operator-iam-ro     33m
openshift-cloud-credential-operator   openshift-image-registry             45m
openshift-cloud-credential-operator   openshift-image-registry-azure       46m
openshift-cloud-credential-operator   openshift-image-registry-gcs         45m
openshift-cloud-credential-operator   openshift-image-registry-openstack   45m
openshift-cloud-credential-operator   openshift-ingress                    46m
openshift-cloud-credential-operator   openshift-ingress-azure              45m
openshift-cloud-credential-operator   openshift-ingress-gcp                45m
openshift-cloud-credential-operator   openshift-machine-api-aws            46m
openshift-cloud-credential-operator   openshift-machine-api-azure          46m
openshift-cloud-credential-operator   openshift-machine-api-gcp            45m
openshift-cloud-credential-operator   openshift-machine-api-openstack      45m
openshift-cloud-credential-operator   openshift-machine-api-ovirt          45m
openshift-cloud-credential-operator   openshift-machine-api-vsphere        45m
openshift-cloud-credential-operator   openshift-network                    46m

Okay, so this was useless.  First, it's telling me that 2/4 are failing.  Obviously, I have much more than 4.  Maybe more pertinent info is included with wide.

No, more of the same (I got sidetracked for a while doing what I set out to do as the cluster was working for my purposes):

./oc get CredentialsRequest -n openshift-cloud-credential-operator -owide
NAME                                 AGE
cloud-credential-operator-iam-ro     3h11m
openshift-image-registry             3h23m
openshift-image-registry-azure       3h24m
openshift-image-registry-gcs         3h23m
openshift-image-registry-openstack   3h23m
openshift-ingress                    3h24m
openshift-ingress-azure              3h23m
openshift-ingress-gcp                3h23m
openshift-machine-api-aws            3h24m
openshift-machine-api-azure          3h24m
openshift-machine-api-gcp            3h23m
openshift-machine-api-openstack      3h23m
openshift-machine-api-ovirt          3h23m
openshift-machine-api-vsphere        3h23m
openshift-network                    3h24m


Version-Release number of selected component (if applicable):
Latest 4.5 nightly

$ ./oc version
Client Version: 4.5.0-0.nightly-2020-05-11-123802
Server Version: 4.5.0-0.nightly-2020-05-11-123802
Kubernetes Version: v1.18.0-rc.1

How reproducible:
100%.  The root cause was determined to be cloud quota capacity for service accounts or whatever they're called on AWS.

message: "failed to grant creds: error syncing creds in mint-mode: AWS Error:
        LimitExceeded - LimitExceeded: Cannot exceed quota for UsersPerAccount: 5000\n\tstatus
        code: 409, request id: xxx"


Steps to Reproduce:
1.  Install cluster
2.  Installer fails due to exceeding UsersPerAccount

Actual results:
Nothing useful to help guide troubleshooting from operator message.


Expected results:
Either list the credentialrequests and namespace that are failing, or provide some generic thing in the message like "Look at oc get CredentialsRequest -n ... for more info" kind of thing.

Also, the output of './oc get CredentialsRequest -n openshift-cloud-credential-operator' should yield useful information about the status of each request.  Most are just 'ignored' so we should indicate that.  We should indicate success or failure for each item respectively as well.

Additional info:

Comment 1 Devan Goodwin 2020-07-16 14:19:59 UTC
Some great ideas in here, thanks.

Will try to get into next sprint (188).

Comment 2 Greg Sheremeta 2020-08-22 12:00:49 UTC
> Will try to get into next sprint (188).

didn't make it. will investigate next sprint

Comment 3 Devan Goodwin 2020-09-10 12:10:40 UTC
Will try to investigate somewhere in the 4.7 timeframe as this is a good candidate for the focus of that release.

Comment 6 Devan Goodwin 2020-12-04 13:03:33 UTC
This is borderline RFE and is now moved to Jira: https://issues.redhat.com/browse/CO-1311

We like the ideas just can't find the time yet.


Note You need to log in before you can comment on or make changes to this bug.