Bug 1889540

Summary:	[4.5 upgrade][alert]CloudCredentialOperatorDown
Product:	OpenShift Container Platform	Reporter:	Hongkai Liu <hongkliu>
Component:	Cloud Credential Operator	Assignee:	Devan Goodwin <dgoodwin>
Status:	CLOSED ERRATA	QA Contact:	liujia <jiajliu>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	jdiaz, jiajliu, lwan, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:26:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1896230

Description Hongkai Liu 2020-10-19 23:10:30 UTC

The alert was fired during the upgrade of build01 (a cluster in CI build farm) from 4.5.14 to 4.6.0-rc.4.

https://coreos.slack.com/archives/CHY2E1BL4/p1603136712226000
[FIRING:1] CloudCredentialOperatorDown (openshift-monitoring/k8s critical)
cloud-credential-operator pod not running


The upgrade was eventually successful.

Even though the pods in openshift-cloud-credential-operator namespace are running, the alert was still active.

oc --context build01 get pod -n openshift-cloud-credential-operator
NAME                                         READY   STATUS    RESTARTS   AGE
cloud-credential-operator-5fc784b4c5-dmptg   2/2     Running   0          15m
pod-identity-webhook-75fc9d4d96-68dk5        1/1     Running   0          6m22s

Not sure what a cluster admin should do in this case.

Comment 2 wang lin 2020-10-20 06:31:30 UTC

This one may not a 100% failure, when I reproduced the issue, the alert disappear after cco pod recreated during upgrade.  Just a reference for issue analysis.

Comment 3 Hongkai Liu 2020-10-21 18:38:48 UTC

It happened again while upgrading another cluster (build02).
https://coreos.slack.com/archives/CHY2E1BL4/p1603220262272100

oc --context build02 get pod -n openshift-cloud-credential-operator
NAME                                         READY   STATUS    RESTARTS   AGE
cloud-credential-operator-5fc784b4c5-n69rl   2/2     Running   0          23m

It is 2 out of 2 upgrades.

Comment 4 Devan Goodwin 2020-10-22 12:02:32 UTC

I think we can explain what this is:

2020-10-19T19:34:40.908033447Z Copying system trust bundle
2020-10-19T19:34:40.978859994Z time="2020-10-19T19:34:40Z" level=info msg="setting up client for manager"
2020-10-19T19:34:40.979637999Z time="2020-10-19T19:34:40Z" level=info msg="generated leader election ID" id=cceb953b-9d55-44f3-a18a-f67109ef97ba
2020-10-19T19:34:40.99689294Z I1019 19:34:40.996832       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cloud-credential-operator/cloud-credential-operator-leader...
2020-10-19T19:34:41.014443032Z time="2020-10-19T19:34:41Z" level=info msg="current leader: cloud-credential-operator-65b79b464b-r8prf_a85c84f4-e12f-4eaf-b7ec-e3b21178ea19"
2020-10-19T19:43:09.700806018Z I1019 19:43:09.700454       1 leaderelection.go:253] successfully acquired lease openshift-cloud-credential-operator/cloud-credential-operator-leader
2020-10-19T19:43:09.700851554Z time="2020-10-19T19:43:09Z" level=info msg="became leader" id=cceb953b-9d55-44f3-a18a-f67109ef97ba
2020-10-19T19:43:09.700924973Z time="2020-10-19T19:43:09Z" level=info msg="setting up manager"

In 4.6 new leader election code was added to stop writing to etcd every 2 seconds. This election code includes properly releasing the lock when the pod terminates normally, so the next pod is not delayed on startup. Thresholds were then adjusted to stop the frequent writes and the delay we'll wait to acquire the lock becomes very long, around 8 minutes.

In the above logs you can see the pod started, detected it was not the leader (because 4.5 does *not* have the code to properly release the lock on shutdown), and then waits for the 8 minute timeout before resuming control. During this time it appears the alert will fire and then resolve itself.

This should be the one and only time this will occur, on a 4.5 to 4.6 upgrade, subsequent upgrades should no longer see this error.

To fix, we would need to backport *only* the code to release the lock properly to 4.5, if we backport the whole change the issue would still surface on a 4.5.z upgrade one time.

Discussed with Scott Dodson who saw this yesterday and at present we are thinking it may not be worth the effort to backport, as it should be a one time complication as the new better leader election code rolls out, however we are open to thoughts on this.

Comment 5 W. Trevor King 2020-10-24 04:03:08 UTC

Ahh, an 8m issue perfectly explains build01's 4.5->4.6 update from the 13th, which looked like:

* pre-update, cloud-credential-operator-9bf9464dc-6nxxj exists
* 16:13Z, ClusterOperator goes Progressing=True
* 16:13Z, cco-metrics for cloud-credential-operator-9bf9464dc-6nxxj goes down
* 16:14Z, cloud-credential-operator-7d84697896-xnjdv created
* 16:16Z, cloud-credential-operator-7d84697896-xnjdv up to two cloud-credential-operator containers
* 16:18Z, cloud-credential-operator-9bf9464dc-6nxxj exits
* 16:19Z, cloud-credential-operator-7d84697896-xnjdv back to one cloud-credential-operator container
* 16:21Z, CloudCredentialOperatorDown starts firing
* 16:26Z, ClusterOperator goes Progressing=False
* 17:16Z, cloud-credential-operator-7d84697896-xnjdv exits
* 17:16Z, cloud-credential-operator-7d84697896-cvvs8 created
* 17:20Z, cloud-credential-operator-7d84697896-cvvs8 exits
* 17:20Z, cloud-credential-operator-7d84697896-hm5zl created
* 17:29Z, CloudCredentialOperatorDown clears, 9m after hm5zl was created
* 17:32Z, CloudCredentialOperatorDown starts firing again
* 17:36Z, cloud-credential-operator-7d84697896-hm5zl exits
* 17:36Z, cloud-credential-operator-7d84697896-zdb7s created
* 17:44Z, cco-metrics for cloud-credential-operator-7d84697896-zdb7s comes up
* 17:44Z, CloudCredentialOperatorDown clears again, this time 8 min after zdb7s was created

One thing that might be easier to port to 4.5 would be something that cranked 'for' for the alert up to 15m [1].  When there was a real outage, the additional 10m delay doesn't seem like it would be a big deal, and it would help avoid the leader-hiccup noise, which is nice for a critical alert.  No need to backport any leader-hiccup fixes.

If 4.6 keeps 'for' at 5m, then by the time the new operator is out, the level-90 alert will come in and reset 'for' to 5m.  So that restricts the 15m bump to 4.5 and the sensitive part of 4.5 -> 4.6 updates.  But it doesn't seem like a 5m cred-operator outage is critical anyway; who churns their creds so fast that that kind of latency is worth waking admins at midnight?  Maybe we want 15m in 4.6 too so things like brief scheduler hiccups and similar don't trip the alarm?

[1]: https://github.com/openshift/cloud-credential-operator/blob/7f83dd90df2a7e91682ca5d13aca152e09d64174/manifests/0000_90_cloud-credential-operator_04_alertrules.yaml#L48

Comment 6 W. Trevor King 2020-10-24 04:05:52 UTC

err, that^ might have been app.ci.  All the CI clusters kind of run together in my mind.

Comment 7 W. Trevor King 2020-10-24 04:31:38 UTC

I liked my "bump 'for'" approach so much I made a PR ;).  Feel free to close if folks want to move in a different direction.

Comment 9 W. Trevor King 2020-11-10 20:01:43 UTC

Verfication for 4.7 will probably look like:

1. Install a 4.7 cluster.
2. Scale down the cluster-version operator:

     $ oc -n openshift-cluster-version scale --replicas 0 deployment/cluster-version-operator

3. Scale down the cred operator:

     $ oc -n openshift-cloud-credential-operator scale --replicas 0 deployment/cloud-credential-operator

4. Wait 20 minutes.
5. See CloudCredentialOperatorDown firing (but no earlier).

Comment 10 liujia 2020-11-11 09:21:58 UTC

verify on 4.7.0-0.nightly-2020-11-11-012235(the alert active time is not in 20mins, but about 1min)

# oc adm release info registry.svc.ci.openshift.org/ocp/release:4.7.0-0.nightly-2020-11-11-012235 --commits|grep cloud-credential
  cloud-credential-operator                      https://github.com/openshift/cloud-credential-operator                      c2592b366efa9b769aa2ba1adbef41c664a166fa

# git log --date local --pretty="%h %an %cd - %s"|head -n4
850faf1a OpenShift Merge Robot Wed Nov 11 07:00:15 2020 - Merge pull request #264 from joelddiaz/new-credentials-field
c2592b36 OpenShift Merge Robot Tue Nov 10 19:34:18 2020 - Merge pull request #266 from staebler/aws_c2s_support
fe4a5428 staebler Tue Nov 10 14:00:23 2020 - aws: support custom CA bundle for AWS API
ae4f77ee OpenShift Merge Robot Tue Nov 10 05:40:33 2020 - Merge pull request #262 from wking/allow-more-time-before-critical-alert

According to comment 9, monitoring alert after scale down the cred operator. Following result shows that the alert will fire in about 1min after cred operator down.
# ./test.sh 
no alert,current time: 16:55:18
no alert,current time: 16:55:24
no alert,current time: 16:55:29
no alert,current time: 16:55:34
no alert,current time: 16:55:39
no alert,current time: 16:55:44
no alert,current time: 16:55:49
no alert,current time: 16:55:54
no alert,current time: 16:55:59
no alert,current time: 16:56:04
no alert,current time: 16:56:09
no alert,current time: 16:56:15
alert fire time: 16:56:20
alert: { "labels": { "alertname": "CloudCredentialOperatorDown", "severity": "critical" }, "annotations": { "message": "cloud-credential-operator pod not running" }, "state": "pending", "activeAt": "2020-11-11T08:56:12.851606264Z", "value": "1e+00" }

From alert info, the alert active time between "CloudCredentialOperatorDown" and "ClusterVersionOperatorDown" is about 30s.
# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.jliu47.qe.devcluster.openshift.com/api/v1/alerts |jq -r '.data.alerts[]| select(.labels.severity == "critical")'
{
  "labels": {
    "alertname": "CloudCredentialOperatorDown",
    "severity": "critical"
  },
  "annotations": {
    "message": "cloud-credential-operator pod not running"
  },
  "state": "pending",
  "activeAt": "2020-11-11T08:56:12.851606264Z",
  "value": "1e+00"
}
{
  "labels": {
    "alertname": "ClusterVersionOperatorDown",
    "severity": "critical"
  },
  "annotations": {
    "message": "Cluster version operator has disappeared from Prometheus target discovery. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible."
  },
  "state": "pending",
  "activeAt": "2020-11-11T08:55:46.457044785Z",
  "value": "1e+00"
}

Another issue found(out of this bug's scope) is that the ClusterVersionOperatorDown alert is also active in 1m after scale down cvo, is this expected? @W. Trevor King
# ./test.sh 
no alert,current time: 17:17:24
no alert,current time: 17:17:29
no alert,current time: 17:17:35
no alert,current time: 17:17:40
no alert,current time: 17:17:45
no alert,current time: 17:17:50
no alert,current time: 17:17:55
no alert,current time: 17:18:00
no alert,current time: 17:18:05
no alert,current time: 17:18:10
no alert,current time: 17:18:15
alert fire time: 17:18:20
alert: { "labels": { "alertname": "ClusterVersionOperatorDown", "severity": "critical" }, "annotations": { "message": "Cluster version operator has disappeared from Prometheus target discovery. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible." }, "state": "pending", "activeAt": "2020-11-11T09:18:16.457044785Z", "value": "1e+00" }

Comment 11 W. Trevor King 2020-11-11 14:52:17 UTC

(In reply to liujia from comment #10)
> alert fire time: 16:56:20
> alert: { "labels": { "alertname": "CloudCredentialOperatorDown", "severity": "critical" }, "annotations": { "message": "cloud-credential-operator pod not running" }, "state": "pending", "activeAt": "2020-11-11T08:56:12.851606264Z", "value": "1e+00" }

This alert is pending, not yet firing.  Can you try again and wait until you see "state": "firing"?  Or configure an Alertmanager notification sink, and see that it takes 20m before Alertmanager writes to that sink.

Comment 12 liujia 2020-11-12 01:04:19 UTC

(In reply to W. Trevor King from comment #11) 
> This alert is pending, not yet firing.  Can you try again and wait until you
> see "state": "firing"?  Or configure an Alertmanager notification sink, and
> see that it takes 20m before Alertmanager writes to that sink.
Ahh, thx for pointing it out. i thought only firing alters would be generated, so i miss the state info "pending". I will give a try now.

Comment 13 liujia 2020-11-12 02:43:45 UTC

Verified on 4.7.0-0.nightly-2020-11-11-012235

CloudCredentialOperatorDown alert will be firing in 20 mins after the alert is active(pending), and it will take about 1min to active the alert(pending) after scale down the CloudCredentialOperator.
# ./test.sh 
no alert firing,current time: 10:16:44
no alert firing,current time: 10:16:54
...
no alert firing,current time: 10:37:07
no alert firing,current time: 10:37:17
no alert firing,current time: 10:37:27
no alert firing,current time: 10:37:37
alert fire time: 10:37:47
alert: { "labels": { "alertname": "CloudCredentialOperatorDown", "severity": "critical" }, "annotations": { "message": "cloud-credential-operator pod not running" }, "state": "firing", "activeAt": "2020-11-12T02:17:42.851606264Z", "value": "1e+00" }

Comment 16 errata-xmlrpc 2021-02-24 15:26:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633