Bug 1794536 - CloudCredentialOperatorProvisioningFailed alert fires when CCO is disabled
Summary: CloudCredentialOperatorProvisioningFailed alert fires when CCO is disabled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.4.0
Assignee: Joel Diaz
QA Contact: wang lin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-23 19:25 UTC by Seth Jennings
Modified: 2020-05-04 11:27 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Cloud credential operator would report on CredentialsRequests with conditions even when the CCO has been disabled. Consequence: Alerts would show even when the operator has been configured to be disabled. Fix: Do not report conditions when CCO is set to disabled. Result: No alerts for a component that has been explicitly disabled.
Clone Of:
Environment:
Last Closed: 2020-05-04 11:26:35 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cloud-credential-operator pull 154 None closed Bug 1794536: don't report conditions when CCO is disabled 2020-09-02 05:18:45 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:27:08 UTC

Description Seth Jennings 2020-01-23 19:25:40 UTC
Description of problem:

I disabled the CCO via the configmap

$ oc edit cm -n openshift-cloud-credential-operator cloud-credential-operator-config
data:
  disabled: "true"

Operator is reporting Available=True and Degraded=False

$ oc get clusteroperator cloud-credential
NAME               VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
cloud-credential   4.3.0     True        False         False      3h14m

However the CloudCredentialOperatorProvisioningFailed alert is firing.

Alerting rule is

cco_credentials_requests_conditions{condition="CredentialsProvisionFailure"} > 0

This alert should only fire when the CCO is enabled.

Version-Release number of selected component (if applicable):
4.3.0

How reproducible:
always

Steps to Reproduce:
1. Install cluster with kube-system/aws-creds removed
2. Disable the CCO via configmap
3. Observe alert

Actual results:
cco_credentials_requests_conditions{condition="CredentialsProvisionFailure"} = 3
(ingress, image-registry, machine-api)

  status:
    conditions:
    - lastProbeTime: "2020-01-23T15:59:12Z"
      lastTransitionTime: "2020-01-23T15:59:12Z"
      message: 'failed to grant creds: unable to fetch root cloud cred secret: Secret
        "aws-creds" not found'
      reason: CredentialsProvisionFailure
      status: "True"
      type: CredentialsProvisionFailure
    lastSyncGeneration: 0
    provisioned: false

Expected results:

Alert should be conditioned on if the CCO is enabled or not

Additional info:

Comment 1 wang lin 2020-02-04 02:08:19 UTC
Joel, I have a question to ask you. 
The step 1 of reproduce is removing kube-system/aws-creds before installing a cluster or after installing a cluster?
I try to reproduce it following your step, but can not success. The alert is not fired.
Could you please show me more detail steps?

Comment 2 wang lin 2020-02-04 02:25:32 UTC
Sorry, I found the wrong person. 
Seth, could you please help to answer the question above? Thanks in advance.

Comment 3 Seth Jennings 2020-02-04 13:52:48 UTC
The way I did it was to remove the secret from the manifests that the installer uses

https://gist.github.com/sjenning/b22468a02a7fce57a914b09569409ee0#modify-manifests

The key is that their have to be CredentialsRequest CRs that the CCO can not process because the secret is missing

Comment 5 wang lin 2020-02-07 11:35:15 UTC
Hello, Seth


Sorry to disturb you again. I can understand what you point out,but I did not know how to operator it. I am a new hiring,and I have not know much about it.
----------------------------------------------------------------------------------------------------
I tried the following steps

1. Generate manifests and remove aws-creds resource
   $./openshift-install create manifests
   $rm openshift/99_cloud-creds-secret.yaml 99_role-cloud-creds-secret-reader.yaml
2.deploy cluster
  $./openshift-install create cluster

But the cluster can't deploy successfully.

-----------------------------------------------------------------------------------------------------

So I tried the following steps again

1. Generate manifests and remove aws-creds resource
   $./openshift-install create manifests
   $rm openshift/99_cloud-creds-secret.yaml 99_role-cloud-creds-secret-reader.yaml
2.create secrets of ingress,machine-api,image-registry files in openshift dir
3.deploy cluster
  $./openshift-install create cluster

This time, the cluster can be deployed successfully. But I don't know what commands I should use to get the status you wrote in description. I try to use $oc get clusteroperator cloud-credential -o yaml and $openshift-cloud-credential-operator commands,but I can't get below status you mentioned:

status:
    conditions:
    - lastProbeTime: "2020-01-23T15:59:12Z"
      lastTransitionTime: "2020-01-23T15:59:12Z"
      message: 'failed to grant creds: unable to fetch root cloud cred secret: Secret
        "aws-creds" not found'
      reason: CredentialsProvisionFailure
      status: "True"
      type: CredentialsProvisionFailure
    lastSyncGeneration: 0
    provisioned: false


I just can see the alert through prometheus-k8s web console. Then I try to apply a credentialsRequest CRs via $oc create -f test.yaml. I also can't get CredentialsProvisionFailure info.

yaml file is below:

apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
  name: test1
  namespace: openshift-cloud-credential-operator
spec:
  secretRef:
    name: test1
    namespace: default
  providerSpec:
    apiVersion: cloudcredential.openshift.io/v1
    kind: AWSProviderSpec
    statementEntries:
    - effect: Allow
      action:
      - s3:CreateBucket
      - s3:DeleteBucket
      resource: "*"


-----------------------------------------------------------------------------------------


I don't know if my test steps are right,So I want to ask for your help sincerely. Thanks in advance.

Comment 6 Joel Diaz 2020-02-07 13:04:59 UTC
Let me suggest an alternative to repro this. Perform a regular cluster installation.

Then add a bad CredentialsRequest (this CredentialsRequest will have the NamespaceMissing condition):

---
apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: my-cred-request
  namespace: openshift-cloud-credential-operator
spec:
  secretRef:
    name: my-cred-request-secret
    namespace: not-a-real-namespace
  providerSpec:
    apiVersion: cloudcredential.openshift.io/v1
    kind: AWSProviderSpec
    statementEntries:
    - effect: Allow
      action:
      - s3:CreateBucket
      - s3:DeleteBucket
      resource: "*"

Give it a moment so that Prometheus is showing the alert. Now edit the cloud-credential-operator configmap so that 'disabled' is set to 'true'.

Now the next time cloud-credential-operator calculates metrics (after 2 minutes), and Prometheus scrapes the updated metrics, the NamespaceMissing allert should no longer be firing.

Comment 7 wang lin 2020-02-07 13:29:51 UTC
Thanks Joel, I have clear it very well this time.

Comment 8 wang lin 2020-02-07 14:15:39 UTC
The bug has fixed.
test payload : registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-02-07-001901

Comment 9 Seth Jennings 2020-02-07 14:22:29 UTC
sounds like you got it figured out

Comment 11 errata-xmlrpc 2020-05-04 11:26:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.