Bug 2093440
| Summary: | [sig-arch][Early] Managed cluster should start all core operators - NodeCADaemonControllerDegraded: failed to update object | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Forrest Babcock <fbabcock> |
| Component: | Image Registry | Assignee: | Oleg Bulatov <obulatov> |
| Status: | CLOSED ERRATA | QA Contact: | XiuJuan Wang <xiuwang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.11 | CC: | dgoodwin, fmissi, mfojtik, obulatov, stevsmit |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
* Previously, the Image Registry Operator did not have a `progressing` condition for the `node-ca` daemon set and used `generation` from an incorrect object. Consequently, the `node-ca` daemon set could be marked as `degraded` while the Operator was still running. This update adds the `progressing` condition, which indicates that the installation is not complete. As a result, the Image Registry Operator successfully installs the `node-ca` daemon set and the installer waits until it is fully deployed. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2093440[(*BZ#2093440*)]
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-17 19:49:30 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2110958 | ||
|
Description
Forrest Babcock
2022-06-03 17:32:16 UTC
Adding internal whiteboard: trt Documenting search.ci query that highlights these failures https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Link to jira where we are collecting information as we research more https://issues.redhat.com/browse/TRT-285 What we are finding is that it appears that cluster-image-registry-operator attempts to process an item before all of the operators / services are available and goes into a degraded state. https://github.com/openshift/cluster-image-registry-operator/blob/2b82360000454e7e02b6f3b46d1354546dffccbc/pkg/operator/nodecadaemon.go#L84 https://github.com/openshift/cluster-image-registry-operator/blob/2b82360000454e7e02b6f3b46d1354546dffccbc/pkg/resource/nodecadaemon.go#L72 https://github.com/openshift/apiserver-library-go/blob/02e0e71ffa9ad96c2a4dc6d4b1ff6f850287e8ef/pkg/admission/quota/clusterresourcequota/admission.go#L105 processNextWorkItem -> sync -> .. -> Update .. Presume the update leads to apiserver-library quota.openshift.io/ClusterResourceQuota -> Validate call that fails. 2022-06-02T14:01:17Z Running multi-stage test e2e-aws-serial 2022-06-02 10:31:22 ... {"Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created" -> "Available: The registry is ready\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created"} 022-06-02 10:33:53 ... E0602 14:33:53.462338 1 nodecadaemon.go:86] NodeCADaemonController: unable to sync: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized, requeuing Then later on when the remaining operators have finished progressing the tests kick off but the image registry operator is still in a degraded state and the test fails. Thu Jun 2 14:41:46 UTC 2022 - all clusteroperators are done progressing. Jun 2 14:41:49.472: FAIL: Some cluster operators are not ready: image-registry (Degraded=True NodeCADaemonControllerError: NodeCADaemonControllerDegraded: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized) You can use https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job to find the failures You can use Loki {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1536250110118203392"} | unpack |~ "NodeCADaemonControllerDegraded" to find the timing for the degradation (and eventual clearing) You can review the build log (sample: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1536250110118203392/build-log.txt) and match the timings for the test run up with the events in loki You can use https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-arch%5D%5BEarly%5D%20Managed%20cluster%20should%20start%20all%20core%20operators%20%5BSkipped%3ADisconnected%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D to see the trend with recent failures It seems to be a timing issue around the inability to connect to a dependent system. Not sure if a check of dependent systems can be done before reporting ready or a tighter degradated state recovery loop could be considered. The image-registry operator NodeCADaemonProgressing is added on 4.12.0-0.nightly-2022-07-22-010532
- lastTransitionTime: "2022-07-22T06:43:54Z"
message: |-
Available: The registry is ready
NodeCADaemonAvailable: The daemon set node-ca has available replicas
ImagePrunerAvailable: Pruner CronJob has been created
reason: Ready
status: "True"
type: Available
- lastTransitionTime: "2022-07-22T06:44:57Z"
message: |-
Progressing: The registry is ready
NodeCADaemonProgressing: The daemon set node-ca is deployed
reason: Ready
status: "False"
type: Progressing
- lastTransitionTime: "2022-07-22T03:18:56Z"
reason: AsExpected
status: "False"
type: Degraded
And check https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job there is no failure for 4.12 so far, mark it to verified and will monitor the 4.12 result continue
In 4.12 https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.12&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Still show lots error, reopen this bug, Jul 17 22:39:08.757: FAIL: Some cluster operators are not ready: image-registry (Degraded=True NodeCADaemonControllerError: NodeCADaemonControllerDegraded: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized) Move the bug to verified, no more the bug issue reported in 4.12 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399 |