1776700 – CCOProvisioningFailed alert is found in a fresh cluster

Bug 1776700 - CCOProvisioningFailed alert is found in a fresh cluster

Summary: CCOProvisioningFailed alert is found in a fresh cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Joel Diaz
QA Contact:	Xiaoli Tian
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1783963 (view as bug list)
Depends On:	1781109 1783963
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-26 07:44 UTC by Junqi Zhao
Modified:	2020-01-23 11:14 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-23 11:14:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
cloud-credential-operator pod logs (161.37 KB, text/plain) 2019-11-26 07:44 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cloud-credential-operator pull 147	0	None	closed	Bug 1776700: pre-populate conditions with count of zero	2020-07-06 19:24:05 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:14:48 UTC

Description Junqi Zhao 2019-11-26 07:44:05 UTC

Created attachment 1639715 [details]
cloud-credential-operator pod logs

Description of problem:
4.3.0-0.nightly-2019-11-25-153929 fresh cluster, CCOProvisioningFailed alert is found

# oc -n openshift-monitoring get ep | grep alertmanager-main
NAME                          ENDPOINTS                                                          AGE
alertmanager-main             10.128.2.10:9095,10.129.2.13:9095,10.131.0.12:9095                 6h31m

# token=`oc -n openshift-monitoring sa get-token prometheus-k8s`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-1  -- curl -k -H "Authorization: Bearer $token" 'https://10.128.2.10:9095/api/v1/alerts' | jq
...
{
      "labels": {
        "alertname": "CCOProvisioningFailed",
        "condition": "CredentialsProvisionFailure",
        "endpoint": "cco-metrics",
        "instance": "10.130.0.2:2112",
        "job": "cco-metrics",
        "namespace": "openshift-cloud-credential-operator",
        "pod": "cloud-credential-operator-7b4fd65dc5-z5z5q",
        "prometheus": "openshift-monitoring/k8s",
        "service": "cco-metrics",
        "severity": "warning"
      },
      "annotations": {
        "summary": "CredentialsRequest(s) unable to be fulfilled"
      },
      "startsAt": "2019-11-26T01:13:42.851606264Z",
      "endsAt": "2019-11-26T07:40:42.851606264Z",
      "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.juzhao-11-26.qe.devcluster.openshift.com/graph?g0.expr=cco_credentials_requests_conditions%7Bcondition%3D%22CredentialsProvisionFailure%22%7D+%3E+0&g0.tab=1",
      "status": {
        "state": "active",
        "silencedBy": [],
        "inhibitedBy": []
      },
      "receivers": [
        "null"
      ],
      "fingerprint": "554807430686d598"
    }
...

CCOProvisioningFailed detail
*******************************
alert: CCOProvisioningFailed
expr: cco_credentials_requests_conditions{condition="CredentialsProvisionFailure"}
  > 0
for: 5m
labels:
  severity: warning
annotations:
  summary: CredentialsRequest(s) unable to be fulfilled
*******************************
cco_credentials_requests_conditions{condition="CredentialsProvisionFailure"} > 0
Element	Value
cco_credentials_requests_conditions{condition="CredentialsProvisionFailure",endpoint="cco-metrics",instance="10.130.0.2:2112",job="cco-metrics",namespace="openshift-cloud-credential-operator",pod="cloud-credential-operator-7b4fd65dc5-z5z5q",service="cco-metrics"}	1

logs see the attached file
Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-11-25-153929

How reproducible:
Recently

Steps to Reproduce:
1. See the description
2.
3.

Actual results:
CCOProvisioningFailed alert is found in a fresh cluster

Expected results:
no such alert

Additional info:

Comment 1 Scott Dodson 2019-12-09 19:53:29 UTC

This is the same as Bug 1781109, setting up dependency on that one as the 4.4 bug.

Comment 3 Joel Diaz 2019-12-10 15:57:06 UTC

The issue would be intermittent. Fundamentally what is happening is that once an alert fires (which wouldn't happen on every installation), the alert would never clear.

You can force an alert by adding a CredentialsRequest object that points to a namespace that doesn't exist.

apiVersion: cloudcredential.openshift.io/v1
kind: CredentialsRequest
metadata:
  name: my-cred-request
  namespace: openshift-cloud-credential-operator
spec:
  secretRef:
    name: my-cred-request-secret
    namespace: namespace-does-not-exist
  providerSpec:
    apiVersion: cloudcredential.openshift.io/v1
    kind: AWSProviderSpec
    statementEntries:
    - effect: Allow
      action:
      - s3:CreateBucket
      - s3:DeleteBucket
      resource: "*"

After a few minutes you should see the alert. Now you can either create the namespace to clear the alert, or delete the CredentialsRequest so there is no longer a CredentialsRequest in a bad state, and you would expect the alert to clear, but it never does (at least it doesn't clear without the changes in the PR).

Comment 9 Vadim Rutkovsky 2019-12-16 10:51:31 UTC

*** Bug 1783963 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2020-01-23 11:14:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.