Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1951835

Summary:	CVO should propagate ClusterOperator's Degraded to ClusterVersion's Failing during install
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Cluster Version Operator	Assignee:	Over the Air Updates <aos-team-ota>
Status:	CLOSED ERRATA	QA Contact:	Yang Yang <yanyang>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.8	CC:	aos-bugs, bleanhar, lmohanty, wking, yanyang
Target Milestone:	---
Target Release:	4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-01-17 19:46:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-04-20 23:52:17 UTC

EDITED: clarifying based on more details of history

During early CVO development we switched to adding an "initializing" mode that bypassed the degraded state. This leads to the CVO reporting Available=true the first time all operators go available (regardless of degraded state). On subsequent reconciles, if any operator is degraded, we set the failed condition after a wait.

This masks significant problems that can occur by an operator going degraded. In general, there is no scenario where degraded is expected at install success time for automated flows, and degraded may indicate not all user provided config has been rolled out.

The CVO code that introduced this bug at the time was dealing with COs that had significant issues, and a case was made that we could tolerate the failures (we added logic to the test suite to wait until things settled). Looking back (with recent examples) I'm no longer convinced that we should allow CVO to continue because we are handing of responsibility without adequately conveying to the user the impact of the degraded state on their post-install behavior.

I think it is reasonable to avoid overeager "degraded" summarization on the CVO Failed condition during install (as the current logic does), but critically, at the moment the CVO goes available if any operator is degraded the Failed condition must be set to true, and install must exit non-zero, because a degraded operator during install is neither normal, nor is there a need to wait. Put another way, install bypasses the "wait 40m to go failed" inertia because there was never any "known good state" to have inertia from. A possible minimal approach is to assess when we will transition from initializing -> reconciling at the end of a payload sync and to summarize state such that we go available=true and failed=true at the same time, while resetting the "inertia start" for all of those operators.

As a subsequent issue, install should exit when Available=true, but return non-zero when Failed=true. The combination of these two fixes would prevent handoff to admin in automated settings (like CI, hive, or customer automation) that is expensive or complex to catch after the fact.

Comment 1 Clayton Coleman 2021-04-20 23:55:24 UTC

Triggered by discussion on https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1073/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-serial/1384229570948894720

TL:DR

Suppressing CVO condition Failed during install up until available = true is acceptable, but if operators are degraded at that time the available false => true transition must be accompanied by Failed false => true and install must exit non-zero.

Comment 2 Clayton Coleman 2021-04-21 00:03:35 UTC

Going back and reviewing https://github.com/openshift/cluster-version-operator/commit/94c4e576d1e10b568e0c44648904637e12776927#diff-a35ab582a8b2cae2bee4525f7863488327e93f398424d6fa1407b77561c4960cL182, I also think this change missed a subtle and important point.  It's not that degraded shouldn't block the success of install.  It's that degraded shouldn't block progress through the payload.  At the very end of the payload reconcile, a degraded operator should still result in a failing state.  The payload transition from initializing -> reconciling should reflect the current state of the sync, and the sync was not happy (it's bad to go from initializing "good" to reconciling "broken" 40m after install - we know, we should print it up right away).

So this may not be a regression on the surface, but it's a violation of intent of both available and degraded - the core CVO sync loop should not exit without error if something is degraded.

Comment 3 Clayton Coleman 2021-04-21 00:08:23 UTC

A second follow up bug (once (available=true,failed=true) is set) is for installer to exit non-zero on failed.  I don't think an additional wait is necessary, but degraded has been tightened sufficiently in operators that a user should instantly know that they must investigate something.

Comment 4 Clayton Coleman 2021-04-21 00:36:19 UTC

Also, the 40m inertia for failed should start when we transition to reconciling, which after the changes recommended here would then leave us in failed state, and could potentially clear immediately.  We shouldn't have to wait until 40m from the time the operator was created to report failing, we should use the transition to indicate that.  This should not have any impact on reconciling or upgrade behavior, which are different modes.

Comment 5 Clayton Coleman 2021-04-21 00:49:00 UTC

Some review of existing jobs 

https://search.ci.openshift.org/?search=Some+cluster+operators+are+not+ready&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

This is searching for the early e2e test that currently protects us from the error.  It looks like we have 1-4% failure rate on install where we exit "good" and in reality are "bad" - just a quick scan of those results included some pretty concerning failures - things that could be infrastructure (failed master starts), race conditions in our code that might lead to lockups (rollouts), potential kubelet issues.  All of those imply failure modes that I would prefer we catch and signal to a customer vs continuing after install automation.  They may heal, but may not.  It's our responsibility to summarize the state of the install correctly and in the current state I do not think we are doing that.

Comment 6 W. Trevor King 2021-04-21 04:54:58 UTC

> Also, the 40m inertia for failed should start when we transition to reconciling...

In master, the only way to get the UpdateEffectFailAfterInterval effect is Degraded=True when the mode is not InitializingMode [1].  This continues a long tradition of ignoring Degraded (previously Failing) during install [2,3].  And the installer has only ever looked at Available since it started watching ClusterVersion [4].  So I think CVO inertial is completely orthogonal to the InitializingMode logic.

> ...and degraded may indicate not all user provided config has been rolled out.

This is a fairly tenuous connection.  Users can push a whole bunch of config manifests via the installer.  cluster-bootstrap will happily push those into the cluster, but I don't think it has any CVO-style "is the in-cluster resource sufficiently happy?" logic.

> Put another way, install bypasses the "wait 40m to go failed" inertia because there was never any "known good state" to have inertia from.

If any component is break-the-install sad right off the bat, it should not go Available=True.  I expect a number of these cases are from operators which were initially Available=True Degraded=False, but then subsequently went Degraded=True before the install completed.

> It's that degraded shouldn't block progress through the payload.

Huzzah.  LGTM's on [5] welcome ;).

> At the very end of the payload reconcile, a degraded operator should still result in a failing state.

In the context of [5], I'm not all that worried about this, because ClusterOperatorDegraded is a critical alert.  The only time folks care about the non-alert ClusterVersion Failing condition is the installer and other folks for whom querying alerts is awkward.

> ...it's bad to go from initializing "good" to reconciling "broken" 40m after install...

It's also bad to say "probably rip this cluster down and try a fresh install" a minute before everything is happy, which is what happened in this compact 4.5 job [6]: install completed at 2021-04-11T08:36:48Z [7] and kube-scheduler recovered to Degraded=False at 2021-04-11T08:37:14Z [8].

> ...degraded has been tightened sufficiently in operators that a user should instantly know that they must investigate something...

I am not convinced.  I will try and assemble more statistics to ground our divergent assessment of Degraded=True importance in reality.

> All of those imply failure modes that I would prefer we catch and signal to a customer vs continuing after install automation.

I am 100% in favor of failing CI on 'Managed cluster should start all core operators' to drive out both "something irrecoverably bad is happening" and "operator is overly jumpy or we have a quickly-self-recovering bug" issue before we ship releases.  I am less clear on whether we want to fail installs in the wild solely on the grounds of an Available=True Degraded=False operator, because I expect the bulk of these to be recoverable noise.  Not great.  And I'm fine if the installer wants to wait a bit to see if operators settle themselves down.  But again, probably good to try and ground our assumptions in statistics.

> It's our responsibility to summarize the state of the install correctly and in the current state I do not think we are doing that.

I have no problem with the installer mentioning "hey, these ClusterOperators are Degraded=True" in logs and/or machine-readable output.  I'm just not yet sold on "enough of those are bad enough that we want to scrap the whole install".

[1]: https://github.com/openshift/cluster-version-operator/blob/6fdd1e0f313f9c67ddf93037a0d4e17ce62e89ab/pkg/cvo/internal/operatorstatus.go#L215-L221
[2]: https://github.com/openshift/cluster-version-operator/commit/94c4e576d1e10b568e0c44648904637e12776927#diff-a35ab582a8b2cae2bee4525f7863488327e93f398424d6fa1407b77561c4960cL182
[3]: https://github.com/openshift/cluster-version-operator/pull/136/commits/b0b4902fce3add235d1abff6c269b0e39b06f1e9
[4]: https://github.com/openshift/installer/pull/1132/commits/e17ba3c571ad80f27b39a6f53d44bcbd9401684a#diff-56276d5381d618d46ec8d35d93210662c8fdd4c9bcd90fd36afbc6a59227eb0bR316-R318
[5]: https://github.com/openshift/cluster-version-operator/pull/482
[6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.5/1381156351165599744
[7]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.5/1381156351165599744/artifacts/e2e-gcp/clusterversion.json
[8]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.5/1381156351165599744/artifacts/e2e-gcp/clusteroperators.json

Comment 7 W. Trevor King 2021-05-07 16:30:55 UTC

David pointed out that in [1], CVO doesn't set Failing=True until 2021-04-19T20:46:40Z, but kube-apiserver went Degraded=True at 2021-04-19T20:19:39Z.  We should be propagating ClusterOperator's Degraded into ClusterVersion's Failing during install, like we do when in reconciliation mode, even as we continue to not block graph synchronization on that Degraded state.

If folks want the installer to watch for ClusterVersion's Failing as a condition of declaring install complete, that would be a separate installer bug about touching the code near [2].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1073/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-serial/1384229570948894720
[2]: https://github.com/openshift/installer/blob/35b738bdb71fad9ecb01db320500a787dfc60d92/cmd/openshift-install/create.go#L455-L457

Comment 8 Scott Dodson 2021-05-26 13:56:30 UTC

I think that's reasonable for Installer to pay attention to CVO Failing condition, I guess that can be done now regardless of whether or not the CVO has started setting Failing per the semantics described in this bug. https://issues.redhat.com/browse/CORS-1721 for that

Comment 9 Lalatendu Mohanty 2021-06-07 15:37:51 UTC

Removing 4.8 from target release as we do not have enough time to fix this in 4.8. Also this is not a blocker for 4.8 release.

Comment 10 W. Trevor King 2021-07-23 23:36:07 UTC

I haven't had time to pick this up.  Anyone who wants it can take it.

Comment 11 Johnny Liu 2021-10-21 08:07:30 UTC

Does anyone know how to reproduce this issue from QE perspective?

Comment 12 W. Trevor King 2021-10-21 23:52:37 UTC

Ideally we could make an operator go Available=True Degraded=True during the install.  Perhaps by cordoning all but one compute node and killing off a router pod?  Reviewing the CI-search query linked in comment 5, killing oauth-openshift.openshift-authentication pods might work too ( OAuthServerDeployment_UnavailablePod ).  Or figuring out what this job does [1] to make kube-apiserver mad about FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-openstack-techpreview-parallel/1451130495969529856

Comment 14 Brenton Leanhardt 2022-01-24 17:39:58 UTC

We're still planning to work on this but it's not going to make 4.10.

Comment 17 Yang Yang 2022-08-23 09:17:13 UTC

Reproducing with 4.11.0

# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0    False       False         True       7m42s   OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found...
baremetal                                  4.11.0    True        False         False      6m34s   
cloud-controller-manager                   4.11.0    True        False         False      8m15s   
cloud-credential                                     True        False         False      8m39s   
cluster-autoscaler                         4.11.0    True        False         False      6m16s   
config-operator                            4.11.0    True        False         False      7m39s   
console                                                                                           
csi-snapshot-controller                    4.11.0    True        False         False      7m13s   
dns                                        4.11.0    True        False         False      6m17s   
etcd                                       4.11.0    True        True          False      5m30s   NodeInstallerProgressing: 1 nodes are at revision 4; 1 nodes are at revision 6; 1 nodes are at revision 7
image-registry                                       False       True          False      4s      Available: The deployment does not have available replicas...
ingress                                              False       True          True       38s     The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights                                   4.11.0    True        False         False      39s     
kube-apiserver                             4.11.0    True        True          True       2m15s   GuardControllerDegraded: Missing operand on node yanyang-0823b-gx79n-master-0.c.openshift-qe.internal
kube-controller-manager                    4.11.0    True        True          False      4m27s   NodeInstallerProgressing: 3 nodes are at revision 5; 0 nodes have achieved new revision 6
kube-scheduler                             4.11.0    True        True          False      4m25s   NodeInstallerProgressing: 3 nodes are at revision 5; 0 nodes have achieved new revision 6
kube-storage-version-migrator              4.11.0    True        False         False      7m29s   
machine-api                                4.11.0    True        False         False      3m8s    
machine-approver                           4.11.0    True        False         False      6m21s   
machine-config                             4.11.0    True        False         False      5m51s   
marketplace                                4.11.0    True        False         False      6m24s   
monitoring                                           Unknown     True          Unknown    6m38s   Rolling out the stack.
network                                    4.11.0    True        True          False      8m41s   Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
node-tuning                                4.11.0    True        False         False      6m17s   
openshift-apiserver                        4.11.0    True        True          False      38s     APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: observed generation is 3, desired generation is 4.
openshift-controller-manager               4.11.0    True        False         False      3m31s   
openshift-samples                                                                                 
operator-lifecycle-manager                 4.11.0    True        False         False      7m1s    
operator-lifecycle-manager-catalog         4.11.0    True        False         False      7m5s    
operator-lifecycle-manager-packageserver   4.11.0    True        False         False      37s     
service-ca                                 4.11.0    True        False         False      7m33s   
storage                                    4.11.0    True        False         False      6m51s   

# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2022-08-23T07:27:37Z"
    generation: 2
    name: version
    resourceVersion: "16559"
    uid: 70518ad5-914f-4d21-ac15-aa815e3f5f24
  spec:
    channel: stable-4.11
    clusterID: 5fae025d-8adf-42ef-b0a5-7afa2be20752
  status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - baremetal
      - marketplace
      - openshift-samples
      knownCapabilities:
      - baremetal
      - marketplace
      - openshift-samples
    conditions:
    - lastTransitionTime: "2022-08-23T07:28:00Z"
      status: "True"
      type: RetrievedUpdates
    - lastTransitionTime: "2022-08-23T07:28:00Z"
      message: Disabling ownership via cluster version overrides prevents upgrades.
        Please remove overrides before continuing.
      reason: ClusterVersionOverridesSet
      status: "False"
      type: Upgradeable
    - lastTransitionTime: "2022-08-23T07:28:00Z"
      message: Capabilities match configured spec
      reason: AsExpected
      status: "False"
      type: ImplicitlyEnabledCapabilities
    - lastTransitionTime: "2022-08-23T07:28:00Z"
      message: Payload loaded version="4.11.0" image="quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4"
      reason: PayloadLoaded
      status: "True"
      type: ReleaseAccepted
    - lastTransitionTime: "2022-08-23T07:28:00Z"
      status: "False"
      type: Available
    - lastTransitionTime: "2022-08-23T07:34:45Z"
      message: |-
        Multiple errors are preventing progress:
        * Could not update imagestream "openshift/driver-toolkit" (551 of 802): the server is down or not responding
        * Could not update oauthclient "console" (498 of 802): the server does not recognize this resource, check extension API servers
        * Could not update role "openshift-console-operator/prometheus-k8s" (722 of 802): resource may have been deleted
        * Could not update role "openshift-console/prometheus-k8s" (725 of 802): resource may have been deleted
      reason: MultipleErrors
      status: "True"
      type: Failing
    - lastTransitionTime: "2022-08-23T07:28:00Z"
      message: 'Unable to apply 4.11.0: an unknown error has occurred: MultipleErrors'
      reason: MultipleErrors
      status: "True"
      type: Progressing
    desired:
      channels:
      - candidate-4.11
      - fast-4.11
      - stable-4.11
      image: quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4
      url: https://access.redhat.com/errata/RHSA-2022:5069
      version: 4.11.0
    history:
    - completionTime: null
      image: quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4
      startedTime: "2022-08-23T07:28:00Z"
      state: Partial
      verified: false
      version: 4.11.0
    observedGeneration: 1
    versionHash: jtC771QuuKI=
kind: List
metadata:
  resourceVersion: ""

# grep "transitioning from Initializing" cvo.log 
I0823 07:48:29.077535       1 sync_worker.go:608] Sync succeeded, transitioning from Initializing to Reconciling

In Initializing stage, CVO Failing condition doesn't report degraded but available COs: kube-controller-manager, kube-scheduler, network. Seems like it's reproduced. Will verify it with 4.12.

Comment 18 Yang Yang 2022-08-23 10:02:04 UTC

Verifying with 4.12.0-0.nightly-2022-08-22-143022

During cluster install, make network degraded by

# oc patch proxy cluster --type json -p '[{"op": "replace", "path": "/spec/trustedCA/name", "value": "osus-ca"}]'

# oc logs pod/cluster-version-operator-77f86bfd66-rjxbr -n openshift-cluster-version | grep "transitioning from Initializing to Reconciling"

Okay, the cluster is in the Initializing stage.

# oc get co; oc get clusterversion/version -ojson | jq .status.conditions
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.0-0.nightly-2022-08-22-143022   True        False         False      51m     
baremetal                                  4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
cloud-controller-manager                   4.12.0-0.nightly-2022-08-22-143022   True        False         False      75m     
cloud-credential                           4.12.0-0.nightly-2022-08-22-143022   True        False         False      78m     
cluster-autoscaler                         4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
config-operator                            4.12.0-0.nightly-2022-08-22-143022   True        False         False      74m     
console                                    4.12.0-0.nightly-2022-08-22-143022   True        False         False      60m     
csi-snapshot-controller                    4.12.0-0.nightly-2022-08-22-143022   True        False         False      74m     
dns                                        4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
etcd                                       4.12.0-0.nightly-2022-08-22-143022   True        False         False      72m     
image-registry                             4.12.0-0.nightly-2022-08-22-143022   True        False         False      67m     
ingress                                    4.12.0-0.nightly-2022-08-22-143022   True        False         False      67m     
insights                                   4.12.0-0.nightly-2022-08-22-143022   True        False         False      67m     
kube-apiserver                             4.12.0-0.nightly-2022-08-22-143022   True        False         False      70m     
kube-controller-manager                    4.12.0-0.nightly-2022-08-22-143022   True        False         False      71m     
kube-scheduler                             4.12.0-0.nightly-2022-08-22-143022   True        False         False      70m     
kube-storage-version-migrator              4.12.0-0.nightly-2022-08-22-143022   True        False         False      74m     
machine-api                                4.12.0-0.nightly-2022-08-22-143022   True        False         False      68m     
machine-approver                           4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
machine-config                             4.12.0-0.nightly-2022-08-22-143022   False       False         True       60m     Cluster not available for [{operator 4.12.0-0.nightly-2022-08-22-143022}]
marketplace                                4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
monitoring                                 4.12.0-0.nightly-2022-08-22-143022   True        False         False      62m     
network                                    4.12.0-0.nightly-2022-08-22-143022   True        False         True       75m     The configuration is invalid for proxy 'cluster' (failed to validate configmap reference for proxy trustedCA 'osus-ca': failed to get trustedCA configmap for proxy cluster: configmaps "osus-ca" not found). Use 'oc edit proxy.config.openshift.io cluster' to fix.
node-tuning                                4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
openshift-apiserver                        4.12.0-0.nightly-2022-08-22-143022   True        False         False      68m     
openshift-controller-manager               4.12.0-0.nightly-2022-08-22-143022   True        False         False      70m     
openshift-samples                          4.12.0-0.nightly-2022-08-22-143022   True        False         False      68m     
operator-lifecycle-manager                 4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-08-22-143022   True        False         False      68m     
service-ca                                 4.12.0-0.nightly-2022-08-22-143022   True        False         False      74m     
storage                                    4.12.0-0.nightly-2022-08-22-143022   True        False         False      73m     
[
  {
    "lastTransitionTime": "2022-08-23T08:36:07Z",
    "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-08-22-143022 not found in the \"stable-4.11\" channel",
    "reason": "VersionNotFound",
    "status": "False",
    "type": "RetrievedUpdates"
  },
  {
    "lastTransitionTime": "2022-08-23T08:36:07Z",
    "message": "Capabilities match configured spec",
    "reason": "AsExpected",
    "status": "False",
    "type": "ImplicitlyEnabledCapabilities"
  },
  {
    "lastTransitionTime": "2022-08-23T08:36:07Z",
    "message": "Payload loaded version=\"4.12.0-0.nightly-2022-08-22-143022\" image=\"registry.ci.openshift.org/ocp/release@sha256:9e56a2ce8110d06bc1cbc212339834e3b12925cd2bfe4e9a0755e88e5619854d\" architecture=\"amd64\"",
    "reason": "PayloadLoaded",
    "status": "True",
    "type": "ReleaseAccepted"
  },
  {
    "lastTransitionTime": "2022-08-23T08:36:07Z",
    "status": "False",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-23T09:06:07Z",
    "message": "Cluster operator machine-config is not available",
    "reason": "ClusterOperatorNotAvailable",
    "status": "True",
    "type": "Failing"
  },
  {
    "lastTransitionTime": "2022-08-23T08:36:07Z",
    "message": "Unable to apply 4.12.0-0.nightly-2022-08-22-143022: the cluster operator machine-config is not available",
    "reason": "ClusterOperatorNotAvailable",
    "status": "True",
    "type": "Progressing"
  }
]

The Failing condition doesn't log the degraded but available CO: network. Looks not good.

Jack, could you please take a look^?

Comment 19 W. Trevor King 2022-08-23 22:29:33 UTC

'Failing' is basically our only way to report both Degraded=True and Available=False ClusterOperator (and all of the other issues we can have at reconcile time).  We attempt to prioritize by severity in [1].  Seems reasonable that Available=False would win, although I'm not immediately clear on why we don't also complain about 'network's Degraded=True.  CVO logs might clarify.

[1]: https://github.com/openshift/cluster-version-operator/pull/662/files#diff-2010b5bb18e3579c7c8a1c79ab439955a723894f85549689a4401790b0315f00R1181-R1203

Comment 20 Yang Yang 2022-08-24 12:34:30 UTC

To avoid the impact of Available=False operators, verify it by making authentication degraded.

During the install, apply an unavailable oauth idp.

# cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com 

# oc apply -f oauth.yaml

# oc get co authentication
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.12.0-0.nightly-2022-08-23-223922   True        False         True       49m     OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host

Authentication is degraded.

# oc get clusterversion/version -ojson | jq .status.conditions
[
  {
    "lastTransitionTime": "2022-08-24T11:23:16Z",
    "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-08-23-223922 not found in the \"stable-4.11\" channel",
    "reason": "VersionNotFound",
    "status": "False",
    "type": "RetrievedUpdates"
  },
  {
    "lastTransitionTime": "2022-08-24T11:23:16Z",
    "message": "Capabilities match configured spec",
    "reason": "AsExpected",
    "status": "False",
    "type": "ImplicitlyEnabledCapabilities"
  },
  {
    "lastTransitionTime": "2022-08-24T11:23:16Z",
    "message": "Payload loaded version=\"4.12.0-0.nightly-2022-08-23-223922\" image=\"registry.ci.openshift.org/ocp/release@sha256:e1dc2ab7a69de1d3f5ed6801cadc72dde7081c6c83ad4e6327678498cf1c5e52\" architecture=\"amd64\"",
    "reason": "PayloadLoaded",
    "status": "True",
    "type": "ReleaseAccepted"
  },
  {
    "lastTransitionTime": "2022-08-24T11:40:06Z",
    "message": "Done applying 4.12.0-0.nightly-2022-08-23-223922",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-08-24T11:40:06Z",
    "status": "False",
    "type": "Failing"
  },
  {
    "lastTransitionTime": "2022-08-24T12:29:36Z",
    "message": "Cluster version is 4.12.0-0.nightly-2022-08-23-223922",
    "status": "False",
    "type": "Progressing"
  }
]

# oc logs pod/cluster-version-operator-669594b4cd-f62fv -n openshift-cluster-version | egrep 'Initializing|clusteroperator "authentication"'
......
I0824 12:07:15.422187       1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 10
I0824 12:07:15.517695       1 sync_worker.go:971] Precreated resource clusteroperator "authentication" (269 of 810)
I0824 12:07:28.458253       1 sync_worker.go:981] Running sync for clusteroperator "authentication" (269 of 810)
E0824 12:07:28.458497       1 task.go:117] error running apply for clusteroperator "authentication" (269 of 810): Cluster operator authentication is degraded
I0824 12:07:28.458521       1 sync_worker.go:1001] Done syncing for clusteroperator "authentication" (269 of 810)
I0824 12:12:07.217593       1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 11
I0824 12:12:07.372287       1 sync_worker.go:971] Precreated resource clusteroperator "authentication" (269 of 810)
I0824 12:12:20.255877       1 sync_worker.go:981] Running sync for clusteroperator "authentication" (269 of 810)
E0824 12:12:20.256131       1 task.go:117] error running apply for clusteroperator "authentication" (269 of 810): Cluster operator authentication is degraded
I0824 12:12:20.256156       1 sync_worker.go:1001] Done syncing for clusteroperator "authentication" (269 of 810)

Failing condition doesn't complain about authentication.

Comment 22 Yang Yang 2022-08-25 02:36:02 UTC

Based on comment#20, moving it back to assigned status.

Comment 23 W. Trevor King 2022-08-25 06:14:08 UTC

Hrm.  With the CVO logs from comment 21 (sorry external folks):

$ grep 'ClusterVersionOperator\|in state\|Result of work\|is degraded' cvo2.log 
I0824 11:26:13.767695       1 start.go:23] ClusterVersionOperator 4.12.0-202208230336.p0.gf7f9b8d.assembly.stream-f7f9b8d
I0824 11:32:35.314306       1 cvo.go:358] Starting ClusterVersionOperator with minimum reconcile period 3m48.045459287s
I0824 11:32:36.158965       1 sync_worker.go:515] Propagating initial target version {4.12.0-0.nightly-2022-08-23-223922 registry.ci.openshift.org/ocp/release@sha256:e1dc2ab7a69de1d3f5ed6801cadc72dde7081c6c83ad4e6327678498cf1c5e52 false} to sync worker loop in state Initializing.
I0824 11:32:36.159485       1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 0
...
I0824 12:07:15.422187       1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 10
E0824 12:07:28.458497       1 task.go:117] error running apply for clusteroperator "authentication" (269 of 810): Cluster operator authentication is degraded
I0824 12:07:35.479573       1 task_graph.go:546] Result of work: []
E0824 12:07:35.479605       1 sync_worker.go:635] unable to synchronize image (waiting 3m48.045459287s): Cluster operator authentication is degraded
I0824 12:12:07.217593       1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 11
E0824 12:12:20.256131       1 task.go:117] error running apply for clusteroperator "authentication" (269 of 810): Cluster operator authentication is degraded
I0824 12:12:27.278818       1 task_graph.go:546] Result of work: []
E0824 12:12:27.278876       1 sync_worker.go:635] unable to synchronize image (waiting 3m48.045459287s): Cluster operator authentication is degraded

So it appears that we are failing to complete the install (and transition from Initializing to Reconciling), and that we are also failing to set Failing=True (based on the comment 20 ClusterVersion output).

Comment 26 Yang Yang 2022-09-29 07:49:25 UTC

Verifying with 4.12.0-0.nightly-2022-09-28-204419
1. Install a cluster
2. During installation, make authentication degraded
# cat oauth.yaml 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: oidcidp 
    mappingMethod: claim 
    type: OpenID
    openID:
      clientID: test
      clientSecret: 
        name: test
      claims: 
        preferredUsername:
        - preferred_username
        name:
        - name
        email:
        - email
      issuer: https://www.idp-issuer.example.com 

# oc apply -f oauth.yaml

# oc get co authentication
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.12.0-0.nightly-2022-09-28-204419   True        False         True       49m     OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host

We can see the install exit with non-zero.

INFO Creating infrastructure resources...         
INFO Waiting up to 20m0s (until 2:47AM) for the Kubernetes API at https://api.yangyang-0929c.qe.gcp.devcluster.openshift.com:6443... 
INFO API v1.24.0+8c7c967 up                       
INFO Waiting up to 30m0s (until 3:01AM) for bootstrapping to complete... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 40m0s (until 3:26AM) for the cluster at https://api.yangyang-0929c.qe.gcp.devcluster.openshift.com:6443 to initialize... 
ERROR Cluster operator authentication Degraded is True with OAuthServerConfigObservation_Error: OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host 
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform 
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected 
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected 
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected 
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected 
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required 
INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer 
INFO Cluster operator insights Disabled is False with AsExpected:  
INFO Cluster operator insights SCAAvailable is False with NotFound: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 404: {"code":"ACCT-MGMT-7","href":"/api/accounts_mgmt/v1/errors/7","id":"7","kind":"Error","operation_id":"c0f83ef0-a637-4ece-9c3f-7e7281a908b4","reason":"The organization (id= 1TbBDPjQPqtajYl6z5u5LwpiYMo) does not have any certificate of type sca. Enable SCA at https://access.redhat.com/management."} 
INFO Cluster operator network ManagementStateDegraded is False with :  
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation 
ERROR failed to initialize the cluster: Cluster operator authentication is degraded 

# oc logs pod/cluster-version-operator-78dc9df974-jxz7k -n openshift-cluster-version | grep -i "transitioning from Initializing to Reconciling"
I0929 06:58:12.376061       1 sync_worker.go:636] Sync succeeded, transitioning from Initializing to Reconciling


# oc get clusterversion/version -ojson | jq .status.conditions
[
  {
    "lastTransitionTime": "2022-09-29T06:30:40Z",
    "message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.12\" channel",
    "reason": "VersionNotFound",
    "status": "False",
    "type": "RetrievedUpdates"
  },
  {
    "lastTransitionTime": "2022-09-29T06:30:40Z",
    "message": "Capabilities match configured spec",
    "reason": "AsExpected",
    "status": "False",
    "type": "ImplicitlyEnabledCapabilities"
  },
  {
    "lastTransitionTime": "2022-09-29T06:30:40Z",
    "message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"",
    "reason": "PayloadLoaded",
    "status": "True",
    "type": "ReleaseAccepted"
  },
  {
    "lastTransitionTime": "2022-09-29T06:30:40Z",
    "status": "False",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2022-09-29T06:48:32Z",
    "message": "Cluster operator authentication is degraded",
    "reason": "ClusterOperatorDegraded",
    "status": "True",
    "type": "Failing"
  },
  {
    "lastTransitionTime": "2022-09-29T07:01:47Z",
    "message": "Error while reconciling 4.12.0-0.nightly-2022-09-28-204419: the cluster operator authentication is degraded",
    "reason": "ClusterOperatorDegraded",
    "status": "False",
    "type": "Progressing"
  }
]

We can see CVO Failing condition complains about the degraded auth co before it transitioned to Reconciling.

Looks good to me. Moving it to verified state.

Comment 27 Yang Yang 2022-10-08 03:30:58 UTC

Hi Jack,

do we plan to introduce it back to the earlier versions?

Comment 28 Jack Ottofaro 2022-10-13 12:58:24 UTC

(In reply to Yang Yang from comment #27)
> Hi Jack,
> 
> do we plan to introduce it back to the earlier versions?

Per https://coreos.slack.com/archives/CEGKQ43CP/p1665438870880799, no backport deemed necessary.

Comment 32 errata-xmlrpc 2023-01-17 19:46:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399