Bug 1794536
| Summary: | CloudCredentialOperatorProvisioningFailed alert fires when CCO is disabled | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Seth Jennings <sjenning> |
| Component: | Cloud Credential Operator | Assignee: | Joel Diaz <jdiaz> |
| Status: | CLOSED ERRATA | QA Contact: | wang lin <lwan> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3.0 | CC: | awestbro, jdiaz, lwan |
| Target Milestone: | --- | ||
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Cloud credential operator would report on CredentialsRequests with conditions even when the CCO has been disabled.
Consequence: Alerts would show even when the operator has been configured to be disabled.
Fix: Do not report conditions when CCO is set to disabled.
Result: No alerts for a component that has been explicitly disabled.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-04 11:26:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Joel, I have a question to ask you. The step 1 of reproduce is removing kube-system/aws-creds before installing a cluster or after installing a cluster? I try to reproduce it following your step, but can not success. The alert is not fired. Could you please show me more detail steps? Sorry, I found the wrong person. Seth, could you please help to answer the question above? Thanks in advance. The way I did it was to remove the secret from the manifests that the installer uses https://gist.github.com/sjenning/b22468a02a7fce57a914b09569409ee0#modify-manifests The key is that their have to be CredentialsRequest CRs that the CCO can not process because the secret is missing Hello, Seth
Sorry to disturb you again. I can understand what you point out,but I did not know how to operator it. I am a new hiring,and I have not know much about it.
----------------------------------------------------------------------------------------------------
I tried the following steps
1. Generate manifests and remove aws-creds resource
$./openshift-install create manifests
$rm openshift/99_cloud-creds-secret.yaml 99_role-cloud-creds-secret-reader.yaml
2.deploy cluster
$./openshift-install create cluster
But the cluster can't deploy successfully.
-----------------------------------------------------------------------------------------------------
So I tried the following steps again
1. Generate manifests and remove aws-creds resource
$./openshift-install create manifests
$rm openshift/99_cloud-creds-secret.yaml 99_role-cloud-creds-secret-reader.yaml
2.create secrets of ingress,machine-api,image-registry files in openshift dir
3.deploy cluster
$./openshift-install create cluster
This time, the cluster can be deployed successfully. But I don't know what commands I should use to get the status you wrote in description. I try to use $oc get clusteroperator cloud-credential -o yaml and $openshift-cloud-credential-operator commands,but I can't get below status you mentioned:
status:
conditions:
- lastProbeTime: "2020-01-23T15:59:12Z"
lastTransitionTime: "2020-01-23T15:59:12Z"
message: 'failed to grant creds: unable to fetch root cloud cred secret: Secret
"aws-creds" not found'
reason: CredentialsProvisionFailure
status: "True"
type: CredentialsProvisionFailure
lastSyncGeneration: 0
provisioned: false
I just can see the alert through prometheus-k8s web console. Then I try to apply a credentialsRequest CRs via $oc create -f test.yaml. I also can't get CredentialsProvisionFailure info.
yaml file is below:
apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
name: test1
namespace: openshift-cloud-credential-operator
spec:
secretRef:
name: test1
namespace: default
providerSpec:
apiVersion: cloudcredential.openshift.io/v1
kind: AWSProviderSpec
statementEntries:
- effect: Allow
action:
- s3:CreateBucket
- s3:DeleteBucket
resource: "*"
-----------------------------------------------------------------------------------------
I don't know if my test steps are right,So I want to ask for your help sincerely. Thanks in advance.
Let me suggest an alternative to repro this. Perform a regular cluster installation.
Then add a bad CredentialsRequest (this CredentialsRequest will have the NamespaceMissing condition):
---
apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
labels:
controller-tools.k8s.io: "1.0"
name: my-cred-request
namespace: openshift-cloud-credential-operator
spec:
secretRef:
name: my-cred-request-secret
namespace: not-a-real-namespace
providerSpec:
apiVersion: cloudcredential.openshift.io/v1
kind: AWSProviderSpec
statementEntries:
- effect: Allow
action:
- s3:CreateBucket
- s3:DeleteBucket
resource: "*"
Give it a moment so that Prometheus is showing the alert. Now edit the cloud-credential-operator configmap so that 'disabled' is set to 'true'.
Now the next time cloud-credential-operator calculates metrics (after 2 minutes), and Prometheus scrapes the updated metrics, the NamespaceMissing allert should no longer be firing.
Thanks Joel, I have clear it very well this time. The bug has fixed. test payload : registry.svc.ci.openshift.org/ocp/release:4.4.0-0.nightly-2020-02-07-001901 sounds like you got it figured out Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |
Description of problem: I disabled the CCO via the configmap $ oc edit cm -n openshift-cloud-credential-operator cloud-credential-operator-config data: disabled: "true" Operator is reporting Available=True and Degraded=False $ oc get clusteroperator cloud-credential NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE cloud-credential 4.3.0 True False False 3h14m However the CloudCredentialOperatorProvisioningFailed alert is firing. Alerting rule is cco_credentials_requests_conditions{condition="CredentialsProvisionFailure"} > 0 This alert should only fire when the CCO is enabled. Version-Release number of selected component (if applicable): 4.3.0 How reproducible: always Steps to Reproduce: 1. Install cluster with kube-system/aws-creds removed 2. Disable the CCO via configmap 3. Observe alert Actual results: cco_credentials_requests_conditions{condition="CredentialsProvisionFailure"} = 3 (ingress, image-registry, machine-api) status: conditions: - lastProbeTime: "2020-01-23T15:59:12Z" lastTransitionTime: "2020-01-23T15:59:12Z" message: 'failed to grant creds: unable to fetch root cloud cred secret: Secret "aws-creds" not found' reason: CredentialsProvisionFailure status: "True" type: CredentialsProvisionFailure lastSyncGeneration: 0 provisioned: false Expected results: Alert should be conditioned on if the CCO is enabled or not Additional info: