1834551 – ClusterOperatorDown fires when operator is only degraded; states will block upgrades

Bug 1834551 - ClusterOperatorDown fires when operator is only degraded; states will block upgrades

Summary: ClusterOperatorDown fires when operator is only degraded; states will block u...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	W. Trevor King
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1991010
TreeView+	depends on / blocked

Reported:	2020-05-11 22:49 UTC by Christoph Blecker
Modified:	2023-09-15 00:31 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The cluster-version operator used to consider both Available and Degraded when setting the cluster_operator_up metric which feeds the ClusterOperatorDown alert. Consequence: The ClusterOperatorDown alert would fire for Available=True Degraded=True operators, although Available=True doesn't match the alert's "has not been available" description. Fix: The cluster-version operator now ignores Degraded when setting cluster_operator_up. Result: ClusterOperatorDown no longer fires for Available=True operators, even if they are Degraded=True. Degraded unset and !=False cases are still covered by the ClusterOperatorDegraded alert.
Clone Of:
Environment:
Last Closed:	2021-07-27 22:32:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 550	0	None	open	Bug 1834551: pkg/cvo/metrics: Ignore Degraded for cluster_operator_up	2021-04-26 20:39:55 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:33:11 UTC

Description Christoph Blecker 2020-05-11 22:49:32 UTC

Description of problem:
The ClusterOperatorDown AR incorrectly states that this alert will block upgrades.

This alert will fire if an operator is in a degraded state (ClusterOperatorDegraded also fires for that), but states that if this alert is firing upgrades will not proceed. This is inaccurate.


Version-Release number of selected component (if applicable):
4.3.18


How reproducible:
Very consistent


Steps to Reproduce:
1. Degrade an operator (I tested by degrading `authentication` by providing it a bad OpenID IDP certificate)
2. Confirm alert fires after 10 mins
3. Attempt an upgrade

Actual results:
Cluster will upgrade successfully, despite alert


Expected results:
Alert wording should accurately reflect that this won't block upgrades, or alert should not fire at all if the operator is only degraded. The latter I see being preferable, as there is already a separate alert for degraded operators.


Additional info:

Comment 1 Pawel Krupa 2020-05-12 07:40:51 UTC

If your were able to upgrade with a degraded operator, then this is something CVO team should look into. Monitoring here is just a messenger.

Reassigning to CVO team for further investigation.

Comment 2 Lalatendu Mohanty 2020-05-26 13:20:06 UTC

Setting sev to low as it does not impacts cluster availability.

Comment 3 Lalatendu Mohanty 2020-05-26 13:23:03 UTC

We are working on higher priority bugs and feature developments and hence moving this to next sprint.

Comment 4 Lalatendu Mohanty 2020-05-26 13:37:07 UTC

Can you confirm if you are seeing alerts like this "message: Cluster operator {{ "{{ $labels.name }}" }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible." ?

Comment 5 Christoph Blecker 2020-05-26 16:44:40 UTC

@lmohanty: Yes, that is the specific alert message we are seeing.

Comment 6 W. Trevor King 2020-05-26 17:53:19 UTC

Degraded operators should be blocking update success, per [1,2].  If that's not happening, we'll probably need CVO logs, and possibly a must-gather, to understand why not.

[1]: https://github.com/openshift/cluster-version-operator/blob/e318b86d674ccc850266b74169a0a4403d3b633b/docs/user/reconciliation.md#clusteroperator
[2]: https://github.com/openshift/cluster-version-operator/blob/07e65a31745675bae7d33d19777042636633bdac/pkg/cvo/internal/operatorstatus.go#L204-L209

Comment 8 W. Trevor King 2020-07-06 23:56:42 UTC

> 1. Degrade an operator (I tested by degrading `authentication` by providing it a bad OpenID IDP certificate)

In the absence of a must-gather, it's nice to be explicit about the reproducer, instead of assuming we're all familiar with how to do this^ ;).  Trying to reconstruct via [1], I have:

1. Used @cluster-bot to launch a cluster: launch 4.3.18 gcp
2. Set a channel, because cluster-bot clusters now come without channels configured:

     $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "stable-4.3"}]'

3. Copied down the "Standard OpenID Connect CR" from [1] with some placeholders:

     $ cat <<EOF >oauth.yaml
     apiVersion: config.openshift.io/v1
     kind: OAuth
     metadata:
       name: cluster
     spec:
       identityProviders:
       - name: oidcidp 
         mappingMethod: claim 
         type: OpenID
         openID:
           clientID: does-not-exist
           clientSecret: 
             name: does-not-exist
           claims: 
             preferredUsername:
             - preferred_username
             name:
             - name
             email:
             - email
           issuer: https://www.idp-issuer.example.com 
     EOF
4. Push it into my cluster: $ oc apply -f oauth.yaml
5. Confirm that the auth operator is degraded:

     $ oc get -o json clusteroperator authentication | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
     2020-07-06T23:18:00Z Degraded=False AsExpected: IdentityProviderConfigDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
     2020-07-06T23:18:00Z Progressing=False AsExpected: -
     2020-07-06T23:18:00Z Available=True AsExpected: -
     2020-07-06T23:05:25Z Upgradeable=True AsExpected: -

   Hmm, I guess the auth operator is giving things some time to see if they resolve on their own.  After waiting a minute or few, Degraded flips to True:

     $ oc get -o json clusteroperator authentication | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
     2020-07-06T23:22:10Z Degraded=True IdentityProviderConfigDegradedError: IdentityProviderConfigDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
     2020-07-06T23:18:00Z Progressing=False AsExpected: -
     2020-07-06T23:18:00Z Available=True AsExpected: -
     2020-07-06T23:05:25Z Upgradeable=True AsExpected: -

6. Wait 10m, see both ClusterOperatorDegraded and ClusterOperatorDown complaining about authentication.
7. Trigger an update:

     $ oc adm upgrade --to 4.3.19

8. Watch the update:

     $ oc get --watch clusterversion version
     NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
     version   4.3.18    True        True          60s     Working towards 4.3.19: 13% complete
     version   4.3.18    True        True          105s    Working towards 4.3.19: 16% complete
     version   4.3.18    True        True          3m15s   Working towards 4.3.19: 23% complete
     version   4.3.18    True        True          5m17s   Working towards 4.3.19: 24% complete
     ...
     version   4.3.18    True        True          12m     Unable to apply 4.3.19: the cluster operator authentication is degraded

   So good, stuck on the still-degraded operator.  I dunno how to square that with comment 0's "Cluster will upgrade successfully, despite alert".

I'm also personally fine allowing updates to not block on ClusterOperator Degraded=True, and only use the operator's claimed Available=True and versions to gate the ClusterOperator manifest.

> This alert will fire if an operator is in a degraded state (ClusterOperatorDegraded also fires for that)...

Hmm.  Looks like cluster_operator_up is checking both Available=True and Degraded=False since it landed [2].  Clayton, can you describe why we have cluster_operator_up instead of relying on cluster_operator_conditions?

And maybe Abhinav can talk about ClusterOperatorDown [3]?  Seems like it could pivot to just be about:

  cluster_operator_conditions{job="cluster-version-operator", condition="Available"} == 0

or some such.

[1]: https://docs.openshift.com/container-platform/4.3/authentication/identity_providers/configuring-oidc-identity-provider.html
[2]: https://github.com/openshift/cluster-version-operator/pull/45/commits/8b9118992dfcb43f9c7629b21c8d299f352f52fe#diff-b301ca75c2e59524b215636eb7a4586aR93
[3]: https://github.com/openshift/cluster-version-operator/pull/232

Comment 9 Lalatendu Mohanty 2020-07-09 14:51:38 UTC

We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features.  Hence we are adding UpcomingSprint now, and we'll revisit this in the next sprint.

Comment 10 Jack Ottofaro 2020-07-30 20:05:29 UTC

We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features.  Hence we are adding UpcomingSprint now, and we'll revisit this in the next sprint.

Comment 11 Lalatendu Mohanty 2020-08-21 10:22:52 UTC

Moving this to the next sprint as we are working on higher priority bugs and features.

Comment 12 Lalatendu Mohanty 2020-08-21 10:23:36 UTC

Changing the target release to 4.7 as this is not critical for 4.6.

Comment 13 W. Trevor King 2020-09-12 21:00:31 UTC

Punting to the next sprint again, when I'll try to hound Clayton and Abhinav for answers to the questions from comment 8.

Comment 14 W. Trevor King 2020-10-04 02:40:44 UTC

Comment 13 is still current.

Comment 15 Jack Ottofaro 2020-10-23 19:15:57 UTC

Moving this to the next sprint as we are working on higher priority bugs and features.

Comment 18 W. Trevor King 2021-04-30 15:39:03 UTC

Simplest test is probably:

$ cat <<EOF >co.yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  name: bug-testing
status:
  conditions:
  - lastTransitionTime: '2021-01-01T01:01:01Z'
    type: Available
    status: 'True'
    reason: AsExpected
    message: All good
  - lastTransitionTime: '2021-01-01T01:01:01Z'
    type: Degraded
    status: 'True'
    reason: BadStuff
    message: Help me Obi-wan Kenobi, you're my only hope
EOF
$ oc apply -f co.yaml

And then wait 10 minutes to see that ClusterOperatorDegraded is going off, but ClusterOperatorDown is no longer going off.  If you have a favorite mechanism to make an existing ClusterOperator Available=True Degraded=False, that would work too.

Comment 19 liujia 2021-05-08 03:47:13 UTC

1. Trigger v4.3.18 cluster, check all operators are running well, and no ClusterOperatorDegraded and ClusterOperatorDown alerts.
2. Degrade authentication operator with the way in comment8, and wait for 10min to check the operator status and alerts
# ./oc get co authentication
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.3.18 True False True 18h
# ./oc get co authentication
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.3.18 True False True 16h
]# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.jliu-43.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown").annotations.message'
Cluster operator authentication has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible.

# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.jliu-43.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded").annotations.message'
Cluster operator authentication has been degraded for 10 mins. Operator is degraded because IdentityProviderConfigDegradedError and cluster upgrades will be unstable.

3. Trigger upgrade to v4.3.19, and the upgrade will stuck on authentication operator.
# ./oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.3.18 True True 20m Unable to apply 4.3.19: the cluster operator authentication is degraded

According to above test, can not reproduce "upgrade successfully with the alert saying it's not possible", but we can reproduce that both ClusterOperatorDegraded and ClusterOperatorDown alerts are firing.

So following verify will focus on the reproduced part.

Comment 20 liujia 2021-05-08 03:52:16 UTC

We don't have both ClusterOperatorDegraded and ClusterOperatorDown alerts cases currently. So add "NeedsTestCase".

Comment 21 liujia 2021-05-10 02:42:03 UTC

Degrade authentication operator and wait for 10min to check the operator status and alerts.
# ./oc get co authentication 
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication   4.8.0-0.nightly-2021-05-08-025039   True        False         True       62m

# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.jliu-48.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown")|.state + "\n" + .annotations.message'

# curl -s -k -H "Authorization: Bearer $token"  https://prometheus-k8s-openshift-monitoring.apps.jliu-48.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")|.state + "\n" + .annotations.message'
firing
Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable.

Only ClusterOperatorDegraded alert fired. Verified on 4.8.0-0.nightly-2021-05-08-025039.

Comment 25 errata-xmlrpc 2021-07-27 22:32:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 26 Red Hat Bugzilla 2023-09-15 00:31:45 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.