Description of problem: The ClusterOperatorDown AR incorrectly states that this alert will block upgrades. This alert will fire if an operator is in a degraded state (ClusterOperatorDegraded also fires for that), but states that if this alert is firing upgrades will not proceed. This is inaccurate. Version-Release number of selected component (if applicable): 4.3.18 How reproducible: Very consistent Steps to Reproduce: 1. Degrade an operator (I tested by degrading `authentication` by providing it a bad OpenID IDP certificate) 2. Confirm alert fires after 10 mins 3. Attempt an upgrade Actual results: Cluster will upgrade successfully, despite alert Expected results: Alert wording should accurately reflect that this won't block upgrades, or alert should not fire at all if the operator is only degraded. The latter I see being preferable, as there is already a separate alert for degraded operators. Additional info:
If your were able to upgrade with a degraded operator, then this is something CVO team should look into. Monitoring here is just a messenger. Reassigning to CVO team for further investigation.
Setting sev to low as it does not impacts cluster availability.
We are working on higher priority bugs and feature developments and hence moving this to next sprint.
Can you confirm if you are seeing alerts like this "message: Cluster operator {{ "{{ $labels.name }}" }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible." ?
@lmohanty: Yes, that is the specific alert message we are seeing.
Degraded operators should be blocking update success, per [1,2]. If that's not happening, we'll probably need CVO logs, and possibly a must-gather, to understand why not. [1]: https://github.com/openshift/cluster-version-operator/blob/e318b86d674ccc850266b74169a0a4403d3b633b/docs/user/reconciliation.md#clusteroperator [2]: https://github.com/openshift/cluster-version-operator/blob/07e65a31745675bae7d33d19777042636633bdac/pkg/cvo/internal/operatorstatus.go#L204-L209
> 1. Degrade an operator (I tested by degrading `authentication` by providing it a bad OpenID IDP certificate) In the absence of a must-gather, it's nice to be explicit about the reproducer, instead of assuming we're all familiar with how to do this^ ;). Trying to reconstruct via [1], I have: 1. Used @cluster-bot to launch a cluster: launch 4.3.18 gcp 2. Set a channel, because cluster-bot clusters now come without channels configured: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/channel", "value": "stable-4.3"}]' 3. Copied down the "Standard OpenID Connect CR" from [1] with some placeholders: $ cat <<EOF >oauth.yaml apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: identityProviders: - name: oidcidp mappingMethod: claim type: OpenID openID: clientID: does-not-exist clientSecret: name: does-not-exist claims: preferredUsername: - preferred_username name: - name email: - email issuer: https://www.idp-issuer.example.com EOF 4. Push it into my cluster: $ oc apply -f oauth.yaml 5. Confirm that the auth operator is degraded: $ oc get -o json clusteroperator authentication | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2020-07-06T23:18:00Z Degraded=False AsExpected: IdentityProviderConfigDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host 2020-07-06T23:18:00Z Progressing=False AsExpected: - 2020-07-06T23:18:00Z Available=True AsExpected: - 2020-07-06T23:05:25Z Upgradeable=True AsExpected: - Hmm, I guess the auth operator is giving things some time to see if they resolve on their own. After waiting a minute or few, Degraded flips to True: $ oc get -o json clusteroperator authentication | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2020-07-06T23:22:10Z Degraded=True IdentityProviderConfigDegradedError: IdentityProviderConfigDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host 2020-07-06T23:18:00Z Progressing=False AsExpected: - 2020-07-06T23:18:00Z Available=True AsExpected: - 2020-07-06T23:05:25Z Upgradeable=True AsExpected: - 6. Wait 10m, see both ClusterOperatorDegraded and ClusterOperatorDown complaining about authentication. 7. Trigger an update: $ oc adm upgrade --to 4.3.19 8. Watch the update: $ oc get --watch clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.18 True True 60s Working towards 4.3.19: 13% complete version 4.3.18 True True 105s Working towards 4.3.19: 16% complete version 4.3.18 True True 3m15s Working towards 4.3.19: 23% complete version 4.3.18 True True 5m17s Working towards 4.3.19: 24% complete ... version 4.3.18 True True 12m Unable to apply 4.3.19: the cluster operator authentication is degraded So good, stuck on the still-degraded operator. I dunno how to square that with comment 0's "Cluster will upgrade successfully, despite alert". I'm also personally fine allowing updates to not block on ClusterOperator Degraded=True, and only use the operator's claimed Available=True and versions to gate the ClusterOperator manifest. > This alert will fire if an operator is in a degraded state (ClusterOperatorDegraded also fires for that)... Hmm. Looks like cluster_operator_up is checking both Available=True and Degraded=False since it landed [2]. Clayton, can you describe why we have cluster_operator_up instead of relying on cluster_operator_conditions? And maybe Abhinav can talk about ClusterOperatorDown [3]? Seems like it could pivot to just be about: cluster_operator_conditions{job="cluster-version-operator", condition="Available"} == 0 or some such. [1]: https://docs.openshift.com/container-platform/4.3/authentication/identity_providers/configuring-oidc-identity-provider.html [2]: https://github.com/openshift/cluster-version-operator/pull/45/commits/8b9118992dfcb43f9c7629b21c8d299f352f52fe#diff-b301ca75c2e59524b215636eb7a4586aR93 [3]: https://github.com/openshift/cluster-version-operator/pull/232
We do not have time to fix the bug in this sprint as we are working on higher priority bugs and features. Hence we are adding UpcomingSprint now, and we'll revisit this in the next sprint.
Moving this to the next sprint as we are working on higher priority bugs and features.
Changing the target release to 4.7 as this is not critical for 4.6.
Punting to the next sprint again, when I'll try to hound Clayton and Abhinav for answers to the questions from comment 8.
Comment 13 is still current.
Simplest test is probably: $ cat <<EOF >co.yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: name: bug-testing status: conditions: - lastTransitionTime: '2021-01-01T01:01:01Z' type: Available status: 'True' reason: AsExpected message: All good - lastTransitionTime: '2021-01-01T01:01:01Z' type: Degraded status: 'True' reason: BadStuff message: Help me Obi-wan Kenobi, you're my only hope EOF $ oc apply -f co.yaml And then wait 10 minutes to see that ClusterOperatorDegraded is going off, but ClusterOperatorDown is no longer going off. If you have a favorite mechanism to make an existing ClusterOperator Available=True Degraded=False, that would work too.
1. Trigger v4.3.18 cluster, check all operators are running well, and no ClusterOperatorDegraded and ClusterOperatorDown alerts. 2. Degrade authentication operator with the way in comment8, and wait for 10min to check the operator status and alerts # ./oc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.18 True False True 18h # ./oc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.3.18 True False True 16h ]# curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.jliu-43.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown").annotations.message' Cluster operator authentication has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible. # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.jliu-43.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded").annotations.message' Cluster operator authentication has been degraded for 10 mins. Operator is degraded because IdentityProviderConfigDegradedError and cluster upgrades will be unstable. 3. Trigger upgrade to v4.3.19, and the upgrade will stuck on authentication operator. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.18 True True 20m Unable to apply 4.3.19: the cluster operator authentication is degraded According to above test, can not reproduce "upgrade successfully with the alert saying it's not possible", but we can reproduce that both ClusterOperatorDegraded and ClusterOperatorDown alerts are firing. So following verify will focus on the reproduced part.
We don't have both ClusterOperatorDegraded and ClusterOperatorDown alerts cases currently. So add "NeedsTestCase".
Degrade authentication operator and wait for 10min to check the operator status and alerts. # ./oc get co authentication NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-05-08-025039 True False True 62m # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.jliu-48.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDown")|.state + "\n" + .annotations.message' # curl -s -k -H "Authorization: Bearer $token" https://prometheus-k8s-openshift-monitoring.apps.jliu-48.qe.gcp.devcluster.openshift.com/api/v1/alerts | jq -r '.data.alerts[]| select(.labels.alertname == "ClusterOperatorDegraded")|.state + "\n" + .annotations.message' firing Cluster operator authentication has been degraded for 10 minutes. Operator is degraded because OAuthServerConfigObservation_Error and cluster upgrades will be unstable. Only ClusterOperatorDegraded alert fired. Verified on 4.8.0-0.nightly-2021-05-08-025039.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days