Bug 1951835
| Summary: | CVO should propagate ClusterOperator's Degraded to ClusterVersion's Failing during install | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Cluster Version Operator | Assignee: | Over the Air Updates <aos-team-ota> |
| Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.8 | CC: | aos-bugs, bleanhar, lmohanty, wking, yanyang |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-17 19:46:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Clayton Coleman
2021-04-20 23:52:17 UTC
Triggered by discussion on https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1073/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-serial/1384229570948894720 TL:DR Suppressing CVO condition Failed during install up until available = true is acceptable, but if operators are degraded at that time the available false => true transition must be accompanied by Failed false => true and install must exit non-zero. Going back and reviewing https://github.com/openshift/cluster-version-operator/commit/94c4e576d1e10b568e0c44648904637e12776927#diff-a35ab582a8b2cae2bee4525f7863488327e93f398424d6fa1407b77561c4960cL182, I also think this change missed a subtle and important point. It's not that degraded shouldn't block the success of install. It's that degraded shouldn't block progress through the payload. At the very end of the payload reconcile, a degraded operator should still result in a failing state. The payload transition from initializing -> reconciling should reflect the current state of the sync, and the sync was not happy (it's bad to go from initializing "good" to reconciling "broken" 40m after install - we know, we should print it up right away). So this may not be a regression on the surface, but it's a violation of intent of both available and degraded - the core CVO sync loop should not exit without error if something is degraded. A second follow up bug (once (available=true,failed=true) is set) is for installer to exit non-zero on failed. I don't think an additional wait is necessary, but degraded has been tightened sufficiently in operators that a user should instantly know that they must investigate something. Also, the 40m inertia for failed should start when we transition to reconciling, which after the changes recommended here would then leave us in failed state, and could potentially clear immediately. We shouldn't have to wait until 40m from the time the operator was created to report failing, we should use the transition to indicate that. This should not have any impact on reconciling or upgrade behavior, which are different modes. Some review of existing jobs https://search.ci.openshift.org/?search=Some+cluster+operators+are+not+ready&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job This is searching for the early e2e test that currently protects us from the error. It looks like we have 1-4% failure rate on install where we exit "good" and in reality are "bad" - just a quick scan of those results included some pretty concerning failures - things that could be infrastructure (failed master starts), race conditions in our code that might lead to lockups (rollouts), potential kubelet issues. All of those imply failure modes that I would prefer we catch and signal to a customer vs continuing after install automation. They may heal, but may not. It's our responsibility to summarize the state of the install correctly and in the current state I do not think we are doing that. > Also, the 40m inertia for failed should start when we transition to reconciling... In master, the only way to get the UpdateEffectFailAfterInterval effect is Degraded=True when the mode is not InitializingMode [1]. This continues a long tradition of ignoring Degraded (previously Failing) during install [2,3]. And the installer has only ever looked at Available since it started watching ClusterVersion [4]. So I think CVO inertial is completely orthogonal to the InitializingMode logic. > ...and degraded may indicate not all user provided config has been rolled out. This is a fairly tenuous connection. Users can push a whole bunch of config manifests via the installer. cluster-bootstrap will happily push those into the cluster, but I don't think it has any CVO-style "is the in-cluster resource sufficiently happy?" logic. > Put another way, install bypasses the "wait 40m to go failed" inertia because there was never any "known good state" to have inertia from. If any component is break-the-install sad right off the bat, it should not go Available=True. I expect a number of these cases are from operators which were initially Available=True Degraded=False, but then subsequently went Degraded=True before the install completed. > It's that degraded shouldn't block progress through the payload. Huzzah. LGTM's on [5] welcome ;). > At the very end of the payload reconcile, a degraded operator should still result in a failing state. In the context of [5], I'm not all that worried about this, because ClusterOperatorDegraded is a critical alert. The only time folks care about the non-alert ClusterVersion Failing condition is the installer and other folks for whom querying alerts is awkward. > ...it's bad to go from initializing "good" to reconciling "broken" 40m after install... It's also bad to say "probably rip this cluster down and try a fresh install" a minute before everything is happy, which is what happened in this compact 4.5 job [6]: install completed at 2021-04-11T08:36:48Z [7] and kube-scheduler recovered to Degraded=False at 2021-04-11T08:37:14Z [8]. > ...degraded has been tightened sufficiently in operators that a user should instantly know that they must investigate something... I am not convinced. I will try and assemble more statistics to ground our divergent assessment of Degraded=True importance in reality. > All of those imply failure modes that I would prefer we catch and signal to a customer vs continuing after install automation. I am 100% in favor of failing CI on 'Managed cluster should start all core operators' to drive out both "something irrecoverably bad is happening" and "operator is overly jumpy or we have a quickly-self-recovering bug" issue before we ship releases. I am less clear on whether we want to fail installs in the wild solely on the grounds of an Available=True Degraded=False operator, because I expect the bulk of these to be recoverable noise. Not great. And I'm fine if the installer wants to wait a bit to see if operators settle themselves down. But again, probably good to try and ground our assumptions in statistics. > It's our responsibility to summarize the state of the install correctly and in the current state I do not think we are doing that. I have no problem with the installer mentioning "hey, these ClusterOperators are Degraded=True" in logs and/or machine-readable output. I'm just not yet sold on "enough of those are bad enough that we want to scrap the whole install". [1]: https://github.com/openshift/cluster-version-operator/blob/6fdd1e0f313f9c67ddf93037a0d4e17ce62e89ab/pkg/cvo/internal/operatorstatus.go#L215-L221 [2]: https://github.com/openshift/cluster-version-operator/commit/94c4e576d1e10b568e0c44648904637e12776927#diff-a35ab582a8b2cae2bee4525f7863488327e93f398424d6fa1407b77561c4960cL182 [3]: https://github.com/openshift/cluster-version-operator/pull/136/commits/b0b4902fce3add235d1abff6c269b0e39b06f1e9 [4]: https://github.com/openshift/installer/pull/1132/commits/e17ba3c571ad80f27b39a6f53d44bcbd9401684a#diff-56276d5381d618d46ec8d35d93210662c8fdd4c9bcd90fd36afbc6a59227eb0bR316-R318 [5]: https://github.com/openshift/cluster-version-operator/pull/482 [6]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.5/1381156351165599744 [7]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.5/1381156351165599744/artifacts/e2e-gcp/clusterversion.json [8]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-compact-4.5/1381156351165599744/artifacts/e2e-gcp/clusteroperators.json David pointed out that in [1], CVO doesn't set Failing=True until 2021-04-19T20:46:40Z, but kube-apiserver went Degraded=True at 2021-04-19T20:19:39Z. We should be propagating ClusterOperator's Degraded into ClusterVersion's Failing during install, like we do when in reconciliation mode, even as we continue to not block graph synchronization on that Degraded state. If folks want the installer to watch for ClusterVersion's Failing as a condition of declaring install complete, that would be a separate installer bug about touching the code near [2]. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/1073/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-serial/1384229570948894720 [2]: https://github.com/openshift/installer/blob/35b738bdb71fad9ecb01db320500a787dfc60d92/cmd/openshift-install/create.go#L455-L457 I think that's reasonable for Installer to pay attention to CVO Failing condition, I guess that can be done now regardless of whether or not the CVO has started setting Failing per the semantics described in this bug. https://issues.redhat.com/browse/CORS-1721 for that Removing 4.8 from target release as we do not have enough time to fix this in 4.8. Also this is not a blocker for 4.8 release. I haven't had time to pick this up. Anyone who wants it can take it. Does anyone know how to reproduce this issue from QE perspective? Ideally we could make an operator go Available=True Degraded=True during the install. Perhaps by cordoning all but one compute node and killing off a router pod? Reviewing the CI-search query linked in comment 5, killing oauth-openshift.openshift-authentication pods might work too ( OAuthServerDeployment_UnavailablePod ). Or figuring out what this job does [1] to make kube-apiserver mad about FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-openstack-techpreview-parallel/1451130495969529856 We're still planning to work on this but it's not going to make 4.10. Reproducing with 4.11.0
# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.11.0 False False True 7m42s OAuthServerRouteEndpointAccessibleControllerAvailable: failed to retrieve route from cache: route.route.openshift.io "oauth-openshift" not found...
baremetal 4.11.0 True False False 6m34s
cloud-controller-manager 4.11.0 True False False 8m15s
cloud-credential True False False 8m39s
cluster-autoscaler 4.11.0 True False False 6m16s
config-operator 4.11.0 True False False 7m39s
console
csi-snapshot-controller 4.11.0 True False False 7m13s
dns 4.11.0 True False False 6m17s
etcd 4.11.0 True True False 5m30s NodeInstallerProgressing: 1 nodes are at revision 4; 1 nodes are at revision 6; 1 nodes are at revision 7
image-registry False True False 4s Available: The deployment does not have available replicas...
ingress False True True 38s The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights 4.11.0 True False False 39s
kube-apiserver 4.11.0 True True True 2m15s GuardControllerDegraded: Missing operand on node yanyang-0823b-gx79n-master-0.c.openshift-qe.internal
kube-controller-manager 4.11.0 True True False 4m27s NodeInstallerProgressing: 3 nodes are at revision 5; 0 nodes have achieved new revision 6
kube-scheduler 4.11.0 True True False 4m25s NodeInstallerProgressing: 3 nodes are at revision 5; 0 nodes have achieved new revision 6
kube-storage-version-migrator 4.11.0 True False False 7m29s
machine-api 4.11.0 True False False 3m8s
machine-approver 4.11.0 True False False 6m21s
machine-config 4.11.0 True False False 5m51s
marketplace 4.11.0 True False False 6m24s
monitoring Unknown True Unknown 6m38s Rolling out the stack.
network 4.11.0 True True False 8m41s Deployment "/openshift-network-diagnostics/network-check-source" is waiting for other operators to become ready
node-tuning 4.11.0 True False False 6m17s
openshift-apiserver 4.11.0 True True False 38s APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: observed generation is 3, desired generation is 4.
openshift-controller-manager 4.11.0 True False False 3m31s
openshift-samples
operator-lifecycle-manager 4.11.0 True False False 7m1s
operator-lifecycle-manager-catalog 4.11.0 True False False 7m5s
operator-lifecycle-manager-packageserver 4.11.0 True False False 37s
service-ca 4.11.0 True False False 7m33s
storage 4.11.0 True False False 6m51s
# oc get clusterversion -oyaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
metadata:
creationTimestamp: "2022-08-23T07:27:37Z"
generation: 2
name: version
resourceVersion: "16559"
uid: 70518ad5-914f-4d21-ac15-aa815e3f5f24
spec:
channel: stable-4.11
clusterID: 5fae025d-8adf-42ef-b0a5-7afa2be20752
status:
availableUpdates: null
capabilities:
enabledCapabilities:
- baremetal
- marketplace
- openshift-samples
knownCapabilities:
- baremetal
- marketplace
- openshift-samples
conditions:
- lastTransitionTime: "2022-08-23T07:28:00Z"
status: "True"
type: RetrievedUpdates
- lastTransitionTime: "2022-08-23T07:28:00Z"
message: Disabling ownership via cluster version overrides prevents upgrades.
Please remove overrides before continuing.
reason: ClusterVersionOverridesSet
status: "False"
type: Upgradeable
- lastTransitionTime: "2022-08-23T07:28:00Z"
message: Capabilities match configured spec
reason: AsExpected
status: "False"
type: ImplicitlyEnabledCapabilities
- lastTransitionTime: "2022-08-23T07:28:00Z"
message: Payload loaded version="4.11.0" image="quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4"
reason: PayloadLoaded
status: "True"
type: ReleaseAccepted
- lastTransitionTime: "2022-08-23T07:28:00Z"
status: "False"
type: Available
- lastTransitionTime: "2022-08-23T07:34:45Z"
message: |-
Multiple errors are preventing progress:
* Could not update imagestream "openshift/driver-toolkit" (551 of 802): the server is down or not responding
* Could not update oauthclient "console" (498 of 802): the server does not recognize this resource, check extension API servers
* Could not update role "openshift-console-operator/prometheus-k8s" (722 of 802): resource may have been deleted
* Could not update role "openshift-console/prometheus-k8s" (725 of 802): resource may have been deleted
reason: MultipleErrors
status: "True"
type: Failing
- lastTransitionTime: "2022-08-23T07:28:00Z"
message: 'Unable to apply 4.11.0: an unknown error has occurred: MultipleErrors'
reason: MultipleErrors
status: "True"
type: Progressing
desired:
channels:
- candidate-4.11
- fast-4.11
- stable-4.11
image: quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4
url: https://access.redhat.com/errata/RHSA-2022:5069
version: 4.11.0
history:
- completionTime: null
image: quay.io/openshift-release-dev/ocp-release@sha256:300bce8246cf880e792e106607925de0a404484637627edf5f517375517d54a4
startedTime: "2022-08-23T07:28:00Z"
state: Partial
verified: false
version: 4.11.0
observedGeneration: 1
versionHash: jtC771QuuKI=
kind: List
metadata:
resourceVersion: ""
# grep "transitioning from Initializing" cvo.log
I0823 07:48:29.077535 1 sync_worker.go:608] Sync succeeded, transitioning from Initializing to Reconciling
In Initializing stage, CVO Failing condition doesn't report degraded but available COs: kube-controller-manager, kube-scheduler, network. Seems like it's reproduced. Will verify it with 4.12.
Verifying with 4.12.0-0.nightly-2022-08-22-143022
During cluster install, make network degraded by
# oc patch proxy cluster --type json -p '[{"op": "replace", "path": "/spec/trustedCA/name", "value": "osus-ca"}]'
# oc logs pod/cluster-version-operator-77f86bfd66-rjxbr -n openshift-cluster-version | grep "transitioning from Initializing to Reconciling"
Okay, the cluster is in the Initializing stage.
# oc get co; oc get clusterversion/version -ojson | jq .status.conditions
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.12.0-0.nightly-2022-08-22-143022 True False False 51m
baremetal 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
cloud-controller-manager 4.12.0-0.nightly-2022-08-22-143022 True False False 75m
cloud-credential 4.12.0-0.nightly-2022-08-22-143022 True False False 78m
cluster-autoscaler 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
config-operator 4.12.0-0.nightly-2022-08-22-143022 True False False 74m
console 4.12.0-0.nightly-2022-08-22-143022 True False False 60m
csi-snapshot-controller 4.12.0-0.nightly-2022-08-22-143022 True False False 74m
dns 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
etcd 4.12.0-0.nightly-2022-08-22-143022 True False False 72m
image-registry 4.12.0-0.nightly-2022-08-22-143022 True False False 67m
ingress 4.12.0-0.nightly-2022-08-22-143022 True False False 67m
insights 4.12.0-0.nightly-2022-08-22-143022 True False False 67m
kube-apiserver 4.12.0-0.nightly-2022-08-22-143022 True False False 70m
kube-controller-manager 4.12.0-0.nightly-2022-08-22-143022 True False False 71m
kube-scheduler 4.12.0-0.nightly-2022-08-22-143022 True False False 70m
kube-storage-version-migrator 4.12.0-0.nightly-2022-08-22-143022 True False False 74m
machine-api 4.12.0-0.nightly-2022-08-22-143022 True False False 68m
machine-approver 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
machine-config 4.12.0-0.nightly-2022-08-22-143022 False False True 60m Cluster not available for [{operator 4.12.0-0.nightly-2022-08-22-143022}]
marketplace 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
monitoring 4.12.0-0.nightly-2022-08-22-143022 True False False 62m
network 4.12.0-0.nightly-2022-08-22-143022 True False True 75m The configuration is invalid for proxy 'cluster' (failed to validate configmap reference for proxy trustedCA 'osus-ca': failed to get trustedCA configmap for proxy cluster: configmaps "osus-ca" not found). Use 'oc edit proxy.config.openshift.io cluster' to fix.
node-tuning 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
openshift-apiserver 4.12.0-0.nightly-2022-08-22-143022 True False False 68m
openshift-controller-manager 4.12.0-0.nightly-2022-08-22-143022 True False False 70m
openshift-samples 4.12.0-0.nightly-2022-08-22-143022 True False False 68m
operator-lifecycle-manager 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
operator-lifecycle-manager-catalog 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
operator-lifecycle-manager-packageserver 4.12.0-0.nightly-2022-08-22-143022 True False False 68m
service-ca 4.12.0-0.nightly-2022-08-22-143022 True False False 74m
storage 4.12.0-0.nightly-2022-08-22-143022 True False False 73m
[
{
"lastTransitionTime": "2022-08-23T08:36:07Z",
"message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-08-22-143022 not found in the \"stable-4.11\" channel",
"reason": "VersionNotFound",
"status": "False",
"type": "RetrievedUpdates"
},
{
"lastTransitionTime": "2022-08-23T08:36:07Z",
"message": "Capabilities match configured spec",
"reason": "AsExpected",
"status": "False",
"type": "ImplicitlyEnabledCapabilities"
},
{
"lastTransitionTime": "2022-08-23T08:36:07Z",
"message": "Payload loaded version=\"4.12.0-0.nightly-2022-08-22-143022\" image=\"registry.ci.openshift.org/ocp/release@sha256:9e56a2ce8110d06bc1cbc212339834e3b12925cd2bfe4e9a0755e88e5619854d\" architecture=\"amd64\"",
"reason": "PayloadLoaded",
"status": "True",
"type": "ReleaseAccepted"
},
{
"lastTransitionTime": "2022-08-23T08:36:07Z",
"status": "False",
"type": "Available"
},
{
"lastTransitionTime": "2022-08-23T09:06:07Z",
"message": "Cluster operator machine-config is not available",
"reason": "ClusterOperatorNotAvailable",
"status": "True",
"type": "Failing"
},
{
"lastTransitionTime": "2022-08-23T08:36:07Z",
"message": "Unable to apply 4.12.0-0.nightly-2022-08-22-143022: the cluster operator machine-config is not available",
"reason": "ClusterOperatorNotAvailable",
"status": "True",
"type": "Progressing"
}
]
The Failing condition doesn't log the degraded but available CO: network. Looks not good.
Jack, could you please take a look^?
'Failing' is basically our only way to report both Degraded=True and Available=False ClusterOperator (and all of the other issues we can have at reconcile time). We attempt to prioritize by severity in [1]. Seems reasonable that Available=False would win, although I'm not immediately clear on why we don't also complain about 'network's Degraded=True. CVO logs might clarify. [1]: https://github.com/openshift/cluster-version-operator/pull/662/files#diff-2010b5bb18e3579c7c8a1c79ab439955a723894f85549689a4401790b0315f00R1181-R1203 To avoid the impact of Available=False operators, verify it by making authentication degraded.
During the install, apply an unavailable oauth idp.
# cat oauth.yaml
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
name: cluster
spec:
identityProviders:
- name: oidcidp
mappingMethod: claim
type: OpenID
openID:
clientID: test
clientSecret:
name: test
claims:
preferredUsername:
- preferred_username
name:
- name
email:
- email
issuer: https://www.idp-issuer.example.com
# oc apply -f oauth.yaml
# oc get co authentication
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.12.0-0.nightly-2022-08-23-223922 True False True 49m OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
Authentication is degraded.
# oc get clusterversion/version -ojson | jq .status.conditions
[
{
"lastTransitionTime": "2022-08-24T11:23:16Z",
"message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-08-23-223922 not found in the \"stable-4.11\" channel",
"reason": "VersionNotFound",
"status": "False",
"type": "RetrievedUpdates"
},
{
"lastTransitionTime": "2022-08-24T11:23:16Z",
"message": "Capabilities match configured spec",
"reason": "AsExpected",
"status": "False",
"type": "ImplicitlyEnabledCapabilities"
},
{
"lastTransitionTime": "2022-08-24T11:23:16Z",
"message": "Payload loaded version=\"4.12.0-0.nightly-2022-08-23-223922\" image=\"registry.ci.openshift.org/ocp/release@sha256:e1dc2ab7a69de1d3f5ed6801cadc72dde7081c6c83ad4e6327678498cf1c5e52\" architecture=\"amd64\"",
"reason": "PayloadLoaded",
"status": "True",
"type": "ReleaseAccepted"
},
{
"lastTransitionTime": "2022-08-24T11:40:06Z",
"message": "Done applying 4.12.0-0.nightly-2022-08-23-223922",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2022-08-24T11:40:06Z",
"status": "False",
"type": "Failing"
},
{
"lastTransitionTime": "2022-08-24T12:29:36Z",
"message": "Cluster version is 4.12.0-0.nightly-2022-08-23-223922",
"status": "False",
"type": "Progressing"
}
]
# oc logs pod/cluster-version-operator-669594b4cd-f62fv -n openshift-cluster-version | egrep 'Initializing|clusteroperator "authentication"'
......
I0824 12:07:15.422187 1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 10
I0824 12:07:15.517695 1 sync_worker.go:971] Precreated resource clusteroperator "authentication" (269 of 810)
I0824 12:07:28.458253 1 sync_worker.go:981] Running sync for clusteroperator "authentication" (269 of 810)
E0824 12:07:28.458497 1 task.go:117] error running apply for clusteroperator "authentication" (269 of 810): Cluster operator authentication is degraded
I0824 12:07:28.458521 1 sync_worker.go:1001] Done syncing for clusteroperator "authentication" (269 of 810)
I0824 12:12:07.217593 1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 11
I0824 12:12:07.372287 1 sync_worker.go:971] Precreated resource clusteroperator "authentication" (269 of 810)
I0824 12:12:20.255877 1 sync_worker.go:981] Running sync for clusteroperator "authentication" (269 of 810)
E0824 12:12:20.256131 1 task.go:117] error running apply for clusteroperator "authentication" (269 of 810): Cluster operator authentication is degraded
I0824 12:12:20.256156 1 sync_worker.go:1001] Done syncing for clusteroperator "authentication" (269 of 810)
Failing condition doesn't complain about authentication.
Based on comment#20, moving it back to assigned status. Hrm. With the CVO logs from comment 21 (sorry external folks): $ grep 'ClusterVersionOperator\|in state\|Result of work\|is degraded' cvo2.log I0824 11:26:13.767695 1 start.go:23] ClusterVersionOperator 4.12.0-202208230336.p0.gf7f9b8d.assembly.stream-f7f9b8d I0824 11:32:35.314306 1 cvo.go:358] Starting ClusterVersionOperator with minimum reconcile period 3m48.045459287s I0824 11:32:36.158965 1 sync_worker.go:515] Propagating initial target version {4.12.0-0.nightly-2022-08-23-223922 registry.ci.openshift.org/ocp/release@sha256:e1dc2ab7a69de1d3f5ed6801cadc72dde7081c6c83ad4e6327678498cf1c5e52 false} to sync worker loop in state Initializing. I0824 11:32:36.159485 1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 0 ... I0824 12:07:15.422187 1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 10 E0824 12:07:28.458497 1 task.go:117] error running apply for clusteroperator "authentication" (269 of 810): Cluster operator authentication is degraded I0824 12:07:35.479573 1 task_graph.go:546] Result of work: [] E0824 12:07:35.479605 1 sync_worker.go:635] unable to synchronize image (waiting 3m48.045459287s): Cluster operator authentication is degraded I0824 12:12:07.217593 1 sync_worker.go:870] apply: 4.12.0-0.nightly-2022-08-23-223922 on generation 2 in state Initializing at attempt 11 E0824 12:12:20.256131 1 task.go:117] error running apply for clusteroperator "authentication" (269 of 810): Cluster operator authentication is degraded I0824 12:12:27.278818 1 task_graph.go:546] Result of work: [] E0824 12:12:27.278876 1 sync_worker.go:635] unable to synchronize image (waiting 3m48.045459287s): Cluster operator authentication is degraded So it appears that we are failing to complete the install (and transition from Initializing to Reconciling), and that we are also failing to set Failing=True (based on the comment 20 ClusterVersion output). Verifying with 4.12.0-0.nightly-2022-09-28-204419
1. Install a cluster
2. During installation, make authentication degraded
# cat oauth.yaml
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
name: cluster
spec:
identityProviders:
- name: oidcidp
mappingMethod: claim
type: OpenID
openID:
clientID: test
clientSecret:
name: test
claims:
preferredUsername:
- preferred_username
name:
- name
email:
- email
issuer: https://www.idp-issuer.example.com
# oc apply -f oauth.yaml
# oc get co authentication
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.12.0-0.nightly-2022-09-28-204419 True False True 49m OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
We can see the install exit with non-zero.
INFO Creating infrastructure resources...
INFO Waiting up to 20m0s (until 2:47AM) for the Kubernetes API at https://api.yangyang-0929c.qe.gcp.devcluster.openshift.com:6443...
INFO API v1.24.0+8c7c967 up
INFO Waiting up to 30m0s (until 3:01AM) for bootstrapping to complete...
INFO Destroying the bootstrap resources...
INFO Waiting up to 40m0s (until 3:26AM) for the cluster at https://api.yangyang-0929c.qe.gcp.devcluster.openshift.com:6443 to initialize...
ERROR Cluster operator authentication Degraded is True with OAuthServerConfigObservation_Error: OAuthServerConfigObservationDegraded: failed to apply IDP oidcidp config: dial tcp: lookup www.idp-issuer.example.com on 172.30.0.10:53: no such host
INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
INFO Cluster operator insights ClusterTransferAvailable is False with NoClusterTransfer: no available cluster transfer
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator insights SCAAvailable is False with NotFound: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: OCM API https://api.openshift.com/api/accounts_mgmt/v1/certificates returned HTTP 404: {"code":"ACCT-MGMT-7","href":"/api/accounts_mgmt/v1/errors/7","id":"7","kind":"Error","operation_id":"c0f83ef0-a637-4ece-9c3f-7e7281a908b4","reason":"The organization (id= 1TbBDPjQPqtajYl6z5u5LwpiYMo) does not have any certificate of type sca. Enable SCA at https://access.redhat.com/management."}
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
ERROR failed to initialize the cluster: Cluster operator authentication is degraded
# oc logs pod/cluster-version-operator-78dc9df974-jxz7k -n openshift-cluster-version | grep -i "transitioning from Initializing to Reconciling"
I0929 06:58:12.376061 1 sync_worker.go:636] Sync succeeded, transitioning from Initializing to Reconciling
# oc get clusterversion/version -ojson | jq .status.conditions
[
{
"lastTransitionTime": "2022-09-29T06:30:40Z",
"message": "Unable to retrieve available updates: currently reconciling cluster version 4.12.0-0.nightly-2022-09-28-204419 not found in the \"stable-4.12\" channel",
"reason": "VersionNotFound",
"status": "False",
"type": "RetrievedUpdates"
},
{
"lastTransitionTime": "2022-09-29T06:30:40Z",
"message": "Capabilities match configured spec",
"reason": "AsExpected",
"status": "False",
"type": "ImplicitlyEnabledCapabilities"
},
{
"lastTransitionTime": "2022-09-29T06:30:40Z",
"message": "Payload loaded version=\"4.12.0-0.nightly-2022-09-28-204419\" image=\"registry.ci.openshift.org/ocp/release@sha256:2c8e617830f84ac1ee1bfcc3581010dec4ae5d9cad7a54271574e8d91ef5ecbc\" architecture=\"amd64\"",
"reason": "PayloadLoaded",
"status": "True",
"type": "ReleaseAccepted"
},
{
"lastTransitionTime": "2022-09-29T06:30:40Z",
"status": "False",
"type": "Available"
},
{
"lastTransitionTime": "2022-09-29T06:48:32Z",
"message": "Cluster operator authentication is degraded",
"reason": "ClusterOperatorDegraded",
"status": "True",
"type": "Failing"
},
{
"lastTransitionTime": "2022-09-29T07:01:47Z",
"message": "Error while reconciling 4.12.0-0.nightly-2022-09-28-204419: the cluster operator authentication is degraded",
"reason": "ClusterOperatorDegraded",
"status": "False",
"type": "Progressing"
}
]
We can see CVO Failing condition complains about the degraded auth co before it transitioned to Reconciling.
Looks good to me. Moving it to verified state.
Hi Jack, do we plan to introduce it back to the earlier versions? (In reply to Yang Yang from comment #27) > Hi Jack, > > do we plan to introduce it back to the earlier versions? Per https://coreos.slack.com/archives/CEGKQ43CP/p1665438870880799, no backport deemed necessary. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |