Description of problem: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1532360939641245696 Had a test failure [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] Jun 2 14:41:49.472: Some cluster operators are not ready: image-registry (Degraded=True NodeCADaemonControllerError: NodeCADaemonControllerDegraded: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized)} Querying Loki for {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1532360939641245696"}| unpack |~ "caches not synchronized" During the 2022-06002 10:00:00 - 10:36:00 timeframe indicated a spike in these messages Code scan: https://github.com/search?q=org%3Aopenshift+%22caches+not+synchronized%22&type=code Shows https://github.com/openshift/apiserver-library-go/blob/02e0e71ffa9ad96c2a4dc6d4b1ff6f850287e8ef/pkg/admission/quota/clusterresourcequota/admission.go has this message without a plugin name prefix. For starters we are looking to add that prefix Version-Release number of selected component (if applicable): 4.11 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Adding internal whiteboard: trt Documenting search.ci query that highlights these failures https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Link to jira where we are collecting information as we research more https://issues.redhat.com/browse/TRT-285
What we are finding is that it appears that cluster-image-registry-operator attempts to process an item before all of the operators / services are available and goes into a degraded state. https://github.com/openshift/cluster-image-registry-operator/blob/2b82360000454e7e02b6f3b46d1354546dffccbc/pkg/operator/nodecadaemon.go#L84 https://github.com/openshift/cluster-image-registry-operator/blob/2b82360000454e7e02b6f3b46d1354546dffccbc/pkg/resource/nodecadaemon.go#L72 https://github.com/openshift/apiserver-library-go/blob/02e0e71ffa9ad96c2a4dc6d4b1ff6f850287e8ef/pkg/admission/quota/clusterresourcequota/admission.go#L105 processNextWorkItem -> sync -> .. -> Update .. Presume the update leads to apiserver-library quota.openshift.io/ClusterResourceQuota -> Validate call that fails. 2022-06-02T14:01:17Z Running multi-stage test e2e-aws-serial 2022-06-02 10:31:22 ... {"Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created" -> "Available: The registry is ready\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created"} 022-06-02 10:33:53 ... E0602 14:33:53.462338 1 nodecadaemon.go:86] NodeCADaemonController: unable to sync: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized, requeuing Then later on when the remaining operators have finished progressing the tests kick off but the image registry operator is still in a degraded state and the test fails. Thu Jun 2 14:41:46 UTC 2022 - all clusteroperators are done progressing. Jun 2 14:41:49.472: FAIL: Some cluster operators are not ready: image-registry (Degraded=True NodeCADaemonControllerError: NodeCADaemonControllerDegraded: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized) You can use https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job to find the failures You can use Loki {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1536250110118203392"} | unpack |~ "NodeCADaemonControllerDegraded" to find the timing for the degradation (and eventual clearing) You can review the build log (sample: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1536250110118203392/build-log.txt) and match the timings for the test run up with the events in loki You can use https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-arch%5D%5BEarly%5D%20Managed%20cluster%20should%20start%20all%20core%20operators%20%5BSkipped%3ADisconnected%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D to see the trend with recent failures It seems to be a timing issue around the inability to connect to a dependent system. Not sure if a check of dependent systems can be done before reporting ready or a tighter degradated state recovery loop could be considered. The image-registry operator
NodeCADaemonProgressing is added on 4.12.0-0.nightly-2022-07-22-010532 - lastTransitionTime: "2022-07-22T06:43:54Z" message: |- Available: The registry is ready NodeCADaemonAvailable: The daemon set node-ca has available replicas ImagePrunerAvailable: Pruner CronJob has been created reason: Ready status: "True" type: Available - lastTransitionTime: "2022-07-22T06:44:57Z" message: |- Progressing: The registry is ready NodeCADaemonProgressing: The daemon set node-ca is deployed reason: Ready status: "False" type: Progressing - lastTransitionTime: "2022-07-22T03:18:56Z" reason: AsExpected status: "False" type: Degraded And check https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job there is no failure for 4.12 so far, mark it to verified and will monitor the 4.12 result continue
In 4.12 https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.12&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Still show lots error, reopen this bug, Jul 17 22:39:08.757: FAIL: Some cluster operators are not ready: image-registry (Degraded=True NodeCADaemonControllerError: NodeCADaemonControllerDegraded: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized)
Move the bug to verified, no more the bug issue reported in 4.12
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399