1898199 – [4.5 upgrade][alert]CloudCredentialOperatorDown

Bug 1898199 - [4.5 upgrade][alert]CloudCredentialOperatorDown

Summary: [4.5 upgrade][alert]CloudCredentialOperatorDown

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Credential Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.5.z
Assignee:	Devan Goodwin
QA Contact:	wang lin
Docs Contact:
URL:
Whiteboard:
Depends On:	1896230
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-16 16:08 UTC by W. Trevor King
Modified:	2020-12-15 20:29 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1896230
Environment:
Last Closed:	2020-12-15 20:28:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cloud-credential-operator pull 269	0	None	closed	Bug 1898199: manifests: Allow 'for: 20m' for CloudCredentialOperatorDown	2020-12-15 08:20:11 UTC
Red Hat Product Errata	RHSA-2020:5359	0	None	None	None	2020-12-15 20:29:12 UTC

Description W. Trevor King 2020-11-16 16:08:43 UTC

+++ This bug was initially created as a clone of Bug #1896230 +++

+++ This bug was initially created as a clone of Bug #1889540 +++

The alert was fired during the upgrade of build01 (a cluster in CI build farm) from 4.5.14 to 4.6.0-rc.4.

https://coreos.slack.com/archives/CHY2E1BL4/p1603136712226000
[FIRING:1] CloudCredentialOperatorDown (openshift-monitoring/k8s critical)
cloud-credential-operator pod not running


The upgrade was eventually successful.

Even though the pods in openshift-cloud-credential-operator namespace are running, the alert was still active.

oc --context build01 get pod -n openshift-cloud-credential-operator
NAME                                         READY   STATUS    RESTARTS   AGE
cloud-credential-operator-5fc784b4c5-dmptg   2/2     Running   0          15m
pod-identity-webhook-75fc9d4d96-68dk5        1/1     Running   0          6m22s

Not sure what a cluster admin should do in this case.

--- Additional comment from wking on 2020-10-24 04:03:08 UTC ---

Ahh, an 8m issue perfectly explains build01's 4.5->4.6 update from the 13th, which looked like:

* pre-update, cloud-credential-operator-9bf9464dc-6nxxj exists
* 16:13Z, ClusterOperator goes Progressing=True
* 16:13Z, cco-metrics for cloud-credential-operator-9bf9464dc-6nxxj goes down
* 16:14Z, cloud-credential-operator-7d84697896-xnjdv created
* 16:16Z, cloud-credential-operator-7d84697896-xnjdv up to two cloud-credential-operator containers
* 16:18Z, cloud-credential-operator-9bf9464dc-6nxxj exits
* 16:19Z, cloud-credential-operator-7d84697896-xnjdv back to one cloud-credential-operator container
* 16:21Z, CloudCredentialOperatorDown starts firing
* 16:26Z, ClusterOperator goes Progressing=False
* 17:16Z, cloud-credential-operator-7d84697896-xnjdv exits
* 17:16Z, cloud-credential-operator-7d84697896-cvvs8 created
* 17:20Z, cloud-credential-operator-7d84697896-cvvs8 exits
* 17:20Z, cloud-credential-operator-7d84697896-hm5zl created
* 17:29Z, CloudCredentialOperatorDown clears, 9m after hm5zl was created
* 17:32Z, CloudCredentialOperatorDown starts firing again
* 17:36Z, cloud-credential-operator-7d84697896-hm5zl exits
* 17:36Z, cloud-credential-operator-7d84697896-zdb7s created
* 17:44Z, cco-metrics for cloud-credential-operator-7d84697896-zdb7s comes up
* 17:44Z, CloudCredentialOperatorDown clears again, this time 8 min after zdb7s was created

One thing that might be easier to port to 4.5 would be something that cranked 'for' for the alert up to 15m [1].  When there was a real outage, the additional 10m delay doesn't seem like it would be a big deal, and it would help avoid the leader-hiccup noise, which is nice for a critical alert.  No need to backport any leader-hiccup fixes.

If 4.6 keeps 'for' at 5m, then by the time the new operator is out, the level-90 alert will come in and reset 'for' to 5m.  So that restricts the 15m bump to 4.5 and the sensitive part of 4.5 -> 4.6 updates.  But it doesn't seem like a 5m cred-operator outage is critical anyway; who churns their creds so fast that that kind of latency is worth waking admins at midnight?  Maybe we want 15m in 4.6 too so things like brief scheduler hiccups and similar don't trip the alarm?

[1]: https://github.com/openshift/cloud-credential-operator/blob/7f83dd90df2a7e91682ca5d13aca152e09d64174/manifests/0000_90_cloud-credential-operator_04_alertrules.yaml#L48

Comment 2 wang lin 2020-12-04 05:19:52 UTC

verified on registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-12-03-141413

$ ./test.sh 
deployment.apps/cluster-version-operator scaled
deployment.apps/cloud-credential-operator scaled
no alert ,current time: 12:56:45
no alert ,current time: 12:57:47
no alert ,current time: 12:58:49
no alert ,current time: 12:59:53
no alert ,current time: 13:01:04
no alert ,current time: 13:02:08
no alert ,current time: 13:03:11
no alert ,current time: 13:04:18
no alert ,current time: 13:05:23
no alert ,current time: 13:06:26
no alert ,current time: 13:07:28
no alert ,current time: 13:08:30
no alert ,current time: 13:09:35
no alert ,current time: 13:10:37
no alert ,current time: 13:11:41
no alert ,current time: 13:12:43
no alert ,current time: 13:13:46
no alert ,current time: 13:14:48
no alert ,current time: 13:15:51
no alert ,current time: 13:16:53
alert fire time: 13:17:55
{
  "labels": {
    "alertname": "CloudCredentialOperatorDown",
    "severity": "critical"
  },
  "annotations": {
    "message": "cloud-credential-operator pod not running"
  },
  "state": "firing",
  "activeAt": "2020-12-04T04:57:42.851606264Z",
  "value": "1e+00"
}
deployment.apps/cloud-credential-operator scaled
deployment.apps/cluster-version-operator scaled

Comment 5 errata-xmlrpc 2020-12-15 20:28:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.5.23 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5359

Note You need to log in before you can comment on or make changes to this bug.