2093440 – [sig-arch][Early] Managed cluster should start all core operators - NodeCADaemonControllerDegraded: failed to update object

Bug 2093440 - [sig-arch][Early] Managed cluster should start all core operators - NodeCADaemonControllerDegraded: failed to update object

Summary: [sig-arch][Early] Managed cluster should start all core operators - NodeCADa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Image Registry
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Oleg Bulatov
QA Contact:	XiuJuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2110958
TreeView+	depends on / blocked

Reported:	2022-06-03 17:32 UTC by Forrest Babcock
Modified:	2023-01-17 19:49 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, the Image Registry Operator did not have a `progressing` condition for the `node-ca` daemon set and used `generation` from an incorrect object. Consequently, the `node-ca` daemon set could be marked as `degraded` while the Operator was still running. This update adds the `progressing` condition, which indicates that the installation is not complete. As a result, the Image Registry Operator successfully installs the `node-ca` daemon set and the installer waits until it is fully deployed. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2093440[(BZ#2093440)]
Clone Of:
Environment:
Last Closed:	2023-01-17 19:49:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift apiserver-library-go pull 84	None	Waiting on Customer	Rados gateway replication very slow in multisite setup	2022-06-27 14:48:51 UTC
Github	openshift cluster-image-registry-operator pull 789	None	open	Bug 2093440: Add progressing condition for node-ca daemon set	2022-07-21 09:07:36 UTC
Github	openshift cluster-image-registry-operator pull 790	None	open	Bug 2093440: Use actualDaemonSet for SetDaemonSetGeneration	2022-08-01 12:08:47 UTC
Red Hat Product Errata	RHSA-2022:7399	None	None	None	2023-01-17 19:49:50 UTC

Description Forrest Babcock 2022-06-03 17:32:16 UTC

Description of problem:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1532360939641245696

Had a test failure
[sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

Jun 2 14:41:49.472: Some cluster operators are not ready: image-registry (Degraded=True NodeCADaemonControllerError: NodeCADaemonControllerDegraded: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized)}

Querying Loki for 
{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-serial/1532360939641245696"}| unpack |~ "caches not synchronized"

During the 2022-06002 10:00:00 - 10:36:00 timeframe indicated a spike in these messages

Code scan:
https://github.com/search?q=org%3Aopenshift+%22caches+not+synchronized%22&type=code

Shows https://github.com/openshift/apiserver-library-go/blob/02e0e71ffa9ad96c2a4dc6d4b1ff6f850287e8ef/pkg/admission/quota/clusterresourcequota/admission.go 

has this message without a plugin name prefix.  For starters we are looking to add that prefix


Version-Release number of selected component (if applicable):
4.11

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Forrest Babcock 2022-06-09 14:09:04 UTC

Adding internal whiteboard: trt
Documenting search.ci query that highlights these failures

https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Link to jira where we are collecting information as we research more https://issues.redhat.com/browse/TRT-285

Comment 3 Forrest Babcock 2022-06-13 16:15:33 UTC

What we are finding is that it appears that cluster-image-registry-operator attempts to process an item before all of the operators / services are available and goes into a degraded state.

https://github.com/openshift/cluster-image-registry-operator/blob/2b82360000454e7e02b6f3b46d1354546dffccbc/pkg/operator/nodecadaemon.go#L84
https://github.com/openshift/cluster-image-registry-operator/blob/2b82360000454e7e02b6f3b46d1354546dffccbc/pkg/resource/nodecadaemon.go#L72
https://github.com/openshift/apiserver-library-go/blob/02e0e71ffa9ad96c2a4dc6d4b1ff6f850287e8ef/pkg/admission/quota/clusterresourcequota/admission.go#L105

processNextWorkItem -> sync -> .. -> Update .. Presume the update leads to apiserver-library quota.openshift.io/ClusterResourceQuota -> Validate call that fails.





2022-06-02T14:01:17Z Running multi-stage test e2e-aws-serial

2022-06-02 10:31:22   ...

{"Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created" -> "Available: The registry is ready\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created"} 

 

022-06-02 10:33:53   ...

E0602 14:33:53.462338 1 nodecadaemon.go:86] NodeCADaemonController: unable to sync: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized, requeuing

 




Then later on when the remaining operators have finished progressing the tests kick off but the image registry operator is still in a degraded state and the test fails.



Thu Jun 2 14:41:46 UTC 2022 - all clusteroperators are done progressing.

 

Jun 2 14:41:49.472: FAIL: Some cluster operators are not ready: image-registry (Degraded=True NodeCADaemonControllerError: NodeCADaemonControllerDegraded: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized)


You can use https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job to find the failures
You can use Loki {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1536250110118203392"} | unpack |~ "NodeCADaemonControllerDegraded" to find the timing for the degradation (and eventual clearing)
You can review the build log (sample: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade/1536250110118203392/build-log.txt) and match the timings for the test run up with the events in loki
You can use https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=%5Bsig-arch%5D%5BEarly%5D%20Managed%20cluster%20should%20start%20all%20core%20operators%20%5BSkipped%3ADisconnected%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D to see the trend with recent failures

It seems to be a timing issue around the inability to connect to a dependent system.  Not sure if a check of dependent systems can be done before reporting ready or a tighter degradated state recovery loop could be considered.
The image-registry operator

Comment 7 XiuJuan Wang 2022-07-22 07:46:22 UTC

NodeCADaemonProgressing is added on 4.12.0-0.nightly-2022-07-22-010532

  - lastTransitionTime: "2022-07-22T06:43:54Z"
    message: |-
      Available: The registry is ready
      NodeCADaemonAvailable: The daemon set node-ca has available replicas
      ImagePrunerAvailable: Pruner CronJob has been created
    reason: Ready
    status: "True"
    type: Available
  - lastTransitionTime: "2022-07-22T06:44:57Z"
    message: |-
      Progressing: The registry is ready
      NodeCADaemonProgressing: The daemon set node-ca is deployed
    reason: Ready
    status: "False"
    type: Progressing
  - lastTransitionTime: "2022-07-22T03:18:56Z"
    reason: AsExpected
    status: "False"
    type: Degraded

And check https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.11&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job there is no failure for 4.12 so far, mark it to verified and will monitor the 4.12 result continue

Comment 8 XiuJuan Wang 2022-07-27 03:50:05 UTC

In 4.12 https://search.ci.openshift.org/?search=.*node-ca.*caches+not+synchronized.*&maxAge=336h&context=1&type=bug%2Bjunit&name=4.12&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Still show lots error, reopen this bug,
Jul 17 22:39:08.757: FAIL: Some cluster operators are not ready: image-registry (Degraded=True NodeCADaemonControllerError: NodeCADaemonControllerDegraded: failed to update object *v1.DaemonSet, Namespace=openshift-image-registry, Name=node-ca: daemonsets.apps "node-ca" is forbidden: caches not synchronized)

Comment 11 XiuJuan Wang 2022-08-02 10:21:12 UTC

Move the bug to verified, no more the bug issue reported in 4.12

Comment 14 errata-xmlrpc 2023-01-17 19:49:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.