Bug 1927168 - The deleted crd/servicemonitors.monitoring.coreos.com is not recreated quickly
Summary: The deleted crd/servicemonitors.monitoring.coreos.com is not recreated quickly
Keywords:
Status: MODIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: W. Trevor King
QA Contact: Yang Yang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-10 09:25 UTC by Junqi Zhao
Modified: 2022-08-05 17:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The cluster-version operator waited up to five minutes for ClusterOperator resources to become happy before giving up on them and beginning another sync cycle. Consequence: Waiting on an unhappy ClusterOperator resource could delay the cluster-version operator from reconciling manifests which occurred earlier in the manifest graph. Fix: The cluster-version operator no longer waits for ClusterOperator resources. It immediately fails that sync cycle on them, and begins a new sync cycle. Result: The cluster-version operator will now take less time before reconciling earlier manifests.
Clone Of:
Environment:
Last Closed: 2021-04-08 03:31:53 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 560 0 None Merged Bug 1927168: pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?" 2022-05-12 23:17:19 UTC
Github openshift cluster-version-operator pull 569 0 None Merged Bug 1927168: pkg/cvo/sync_worker: Increment Attempt on failed reconciliation 2022-05-12 23:17:25 UTC

Description Junqi Zhao 2021-02-10 09:25:10 UTC
Description of problem:
4.7.0-0.nightly-2021-02-09-192846 cluster, removed crd/servicemonitors.monitoring.coreos.com which should be reconciled by CVO, it is recreated later, but try again with the steps, it is not recrated.
also tried in other 4.7 cluster, it is not recreated at all after it is removed.

1. delete for the first time
# oc get crd/servicemonitors.monitoring.coreos.com
NAME                                    CREATED AT
servicemonitors.monitoring.coreos.com   2021-02-09T23:40:33Z
# oc delete crd/servicemonitors.monitoring.coreos.com
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
# while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 20s; done
Wed Feb 10 02:22:55 EST 2021
NAME                                    CREATED AT
servicemonitors.monitoring.coreos.com   2021-02-10T08:55:22Z
Wed Feb 10 02:23:16 EST 2021
NAME                                    CREATED AT
servicemonitors.monitoring.coreos.com   2021-02-10T08:55:22Z

2. repeat the steps
# oc get crd/servicemonitors.monitoring.coreos.com
NAME                                    CREATED AT
servicemonitors.monitoring.coreos.com   2021-02-10T08:55:22Z
# oc delete crd/servicemonitors.monitoring.coreos.com
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
# while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 20s; done
Wed Feb 10 03:54:53 EST 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found
Wed Feb 10 03:55:13 EST 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found
Wed Feb 10 03:55:33 EST 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found
Wed Feb 10 03:55:53 EST 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found
...
Wed Feb 10 04:03:18 EST 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found
...
Wed Feb 10 04:06:20 EST 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found

Could not update servicemonitor errors in CVO logs
2021-02-10T09:20:27.701991313Z I0210 09:20:27.701944       1 sync_worker.go:937] Update error 635 of 668: UpdatePayloadResourceTypeMissing Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (635 of 668): the server does not recognize this resource, check extension API servers (*errors.withStack: failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1")
2021-02-10T09:20:27.702001158Z I0210 09:20:27.701989       1 sync_worker.go:937] Update error 608 of 668: UpdatePayloadResourceTypeMissing Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers (*errors.withStack: failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1")
2021-02-10T09:20:27.702036790Z E0210 09:20:27.702021       1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): Multiple errors are preventing progress:
2021-02-10T09:20:27.702036790Z * Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers
2021-02-10T09:20:27.702036790Z * Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (635 of 668): the server does not recognize this resource, check extension API servers
...
Version-Release number of the following components:
4.7.0-0.nightly-2021-02-09-192846

How reproducible:
not sure

Steps to Reproduce:
1. see the steps
2.
3.

Actual results:
the removed crd/servicemonitors.monitoring.coreos.com is not recreated 

Expected results:
should be recreated

Additional info:

Comment 2 W. Trevor King 2021-03-22 18:06:42 UTC
Looking at comment 1's must-gather, here are the CVO logs:

  $ tar tvf dump.tar.gz | grep 'cluster-version.*current'
  -rwxr-xr-x root/root  26775831 2021-02-10 01:23 must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log

We're in the reconciling state, so task-node ordering should be randomized:

  $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep -o 'Running sync.*in state.*attempt' | uniq -c
     47 Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt

Looking at the CRDs we attempt to sync:

  $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep -o 'Running sync for customresourcedefinition.*' | sort | uniq
  ...
  Running sync for customresourcedefinition "servicecas.operator.openshift.io" (519 of 668)
  Running sync for customresourcedefinition "servicemonitors.monitoring.coreos.com" (322 of 668)
  Running sync for customresourcedefinition "storages.operator.openshift.io" (397 of 668)
  ...

So it's in there.  Syncs seem mostly successful:

  $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep -c 'Done syncing for cust
omresourcedefinition.*servicemonitors.monitoring.coreos.com'
  41

Taking a closer look at the sync cycle history:

  $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep 'Running sync.*in state\|Result of work'
  2021-02-10T05:46:59.697454340Z I0210 05:46:59.697444       1 task_graph.go:555] Result of work: []
  2021-02-10T05:49:50.654192682Z I0210 05:49:50.654167       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 90
  2021-02-10T05:50:40.161208061Z I0210 05:50:40.161188       1 task_graph.go:555] Result of work: []
  2021-02-10T05:53:31.117926328Z I0210 05:53:31.117912       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 91
  ...more happiness...
  2021-02-10T07:47:16.274953570Z I0210 07:47:16.274940       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 122
  2021-02-10T07:48:05.882705020Z I0210 07:48:05.882694       1 task_graph.go:555] Result of work: []
  2021-02-10T07:50:56.839494267Z I0210 07:50:56.839466       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 123
  2021-02-10T07:56:42.433101045Z I0210 07:56:42.433087       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-marketplace/marketplace-operator" (514 of 668): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers]
  2021-02-10T07:58:15.262012761Z I0210 07:58:15.261996       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 124
  2021-02-10T08:03:57.175215731Z I0210 08:03:57.175197       1 task_graph.go:555] Result of work: [Cluster operator network is degraded]
  2021-02-10T08:06:51.010030287Z I0210 08:06:51.009987       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 125
  2021-02-10T08:07:40.720842813Z I0210 08:07:40.720793       1 task_graph.go:555] Result of work: []
  2021-02-10T08:10:31.677583725Z I0210 08:10:31.677559       1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 126
  ...

Comparing that with your not-founds from:

  $ date --iso=m --utc --date 'Wed Feb 10 04:06:20 EST 2021'
  2021-02-10T09:06+00:00

That looks like the recovery we see is from your first deletion.  Then maybe you just don't wait long enough to see the second recovery?  Picking on the DNS ServiceMonitor:

  $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep 'servicemonitor.*openshift-dns-operator/dns-operator\|customresourcedefinition.*servicemonitors.monitoring.coreos.com\|Result of work'
  ...
  2021-02-10T09:08:14.204527560Z I0210 09:08:14.204517       1 task_graph.go:555] Result of work: []
  2021-02-10T09:11:15.232193170Z I0210 09:11:15.232179       1 sync_worker.go:765] Running sync for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668)
  2021-02-10T09:11:15.434221813Z I0210 09:11:15.434188       1 sync_worker.go:777] Done syncing for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668)
  2021-02-10T09:11:19.482706009Z I0210 09:11:19.482697       1 sync_worker.go:765] Running sync for customresourcedefinition "servicemonitors.monitoring.coreos.com" (322 of 668)
  2021-02-10T09:11:19.577490007Z I0210 09:11:19.577461       1 request.go:591] Throttling request took 93.787518ms, request: GET:https://api-int.qe-ui47-0210.qe.devcluster.openshift.com:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/servicemonitors.monitoring.coreos.com
  2021-02-10T09:11:19.584637359Z I0210 09:11:19.584604       1 sync_worker.go:777] Done syncing for customresourcedefinition "servicemonitors.monitoring.coreos.com" (322 of 668)
  2021-02-10T09:11:54.832103115Z I0210 09:11:54.832092       1 task_graph.go:555] Result of work: []
  2021-02-10T09:14:57.696018248Z I0210 09:14:57.695988       1 sync_worker.go:765] Running sync for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668)
  2021-02-10T09:15:01.944386561Z E0210 09:15:01.944339       1 task.go:112] error running apply for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  2021-02-10T09:15:27.881377183Z E0210 09:15:27.881338       1 task.go:112] error running apply for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  ...more 'no matches for kind'...
  2021-02-10T09:20:24.353691910Z E0210 09:20:24.353648       1 task.go:112] error running apply for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  2021-02-10T09:20:27.701963108Z I0210 09:20:27.701925       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (635 of 668): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers]
  2021-02-10T09:20:27.702001158Z I0210 09:20:27.701989       1 sync_worker.go:937] Update error 608 of 668: UpdatePayloadResourceTypeMissing Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers (*errors.withStack: failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1")
  2021-02-10T09:20:27.702036790Z * Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers

So that particular round, we didn't happen to have any of our manifest workers pick up the task-node with the CRD manifest.  Next round we would have shuffled again, and eventually one of our workers will pick it up and recover the cluster.  So I think this is functioning as designed, and I would expect the cluster to eventually recover.  One thing that is slowing us down is the fact that we are waiting to see if the CRD shows up, instead of failing fast on "no matches for kind" and starting in on that next sync cycle.  If we did something like that, we'd probably want to check whether we were waiting for CRDs to be functional when processing CRD manifests, so we don't push a CRD manifest and then immediately fail a subsequent manifest for a CR of that new type.

Comment 3 W. Trevor King 2021-04-08 03:31:53 UTC
So based on comment 2, we are working as designed and will eventually give up on that sync cycle, come back in, push the CRD, and recover.  We're just slow.  I expect "deleted a CVO-managed CR" to be rare, so optimizing the speed of recovery seems more like an RFE than a bug, although feel free to push back here if you disagree.  I've opened [1] to track this on the RFE side, and am closing this bug DEFERRED for now.

[1]: https://issues.redhat.com/browse/OTA-388

Comment 4 W. Trevor King 2021-05-13 05:10:40 UTC
I'm going to re-open this, because I think I can mitigate it in a way that will help with bug 1951835.

Comment 6 W. Trevor King 2021-05-14 16:14:46 UTC
One possible test plan would be running:

  $ oc delete crd/servicemonitors.monitoring.coreos.com

on an otherwise stable cluster and measuring the time from that deletion until the CVO recreates the resource.  It will be somewhat random, because the CVO's reconciliation-mode workers walk a randomly-shuffled manifest graph, but on average we should be faster now.  The more degraded or otherwise-sad ClusterOperator there are, the more noticeable the difference should be, so deleting a control-plane machine, which will make a number of operators mad, might be helpful.

Comment 7 Yang Yang 2021-05-17 11:54:14 UTC
Attempting to verify with 4.8.0-0.nightly-2021-05-15-141455

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-15-141455   True        False         160m    Cluster version is 4.8.0-0.nightly-2021-05-15-141455


# oc get crd/servicemonitors.monitoring.coreos.com
NAME                                    CREATED AT
servicemonitors.monitoring.coreos.com   2021-05-17T07:02:36Z

# oc delete crd/servicemonitors.monitoring.coreos.com; while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 2s; done

customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted

Mon May 17 06:17:24 EDT 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found
......

Mon May 17 06:44:00 EDT 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found


The CVO does not recreate it after waiting for over 25min. Checking the CVO logs:

# grep 'servicemonitors.monitoring.coreos.com\|Result of work' /tmp/cvo.log
"servicemonitors.monitoring.coreos.com" (330 of 676)
I0517 10:10:16.958794       1 task_graph.go:555] Result of work: []
I0517 10:13:11.523149       1 sync_worker.go:760] Running sync for customresourcedefinition "servicemonitors.monitoring.coreos.com" (330 of 676)
I0517 10:13:11.618646       1 request.go:591] Throttling request took 94.231261ms, request: GET:https://api-int.yangyang0517.qe.devcluster.openshift.com:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/servicemonitors.monitoring.coreos.com
I0517 10:13:11.624494       1 sync_worker.go:772] Done syncing for customresourcedefinition "servicemonitors.monitoring.coreos.com" (330 of 676)
I0517 10:13:34.522289       1 task_graph.go:555] Result of work: []
I0517 10:16:29.039275       1 sync_worker.go:760] Running sync for customresourcedefinition "servicemonitors.monitoring.coreos.com" (330 of 676)
I0517 10:16:29.133714       1 request.go:591] Throttling request took 93.593144ms, request: GET:https://api-int.yangyang0517.qe.devcluster.openshift.com:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/servicemonitors.monitoring.coreos.com
I0517 10:16:29.141120       1 sync_worker.go:772] Done syncing for customresourcedefinition "servicemonitors.monitoring.coreos.com" (330 of 676)
I0517 10:16:52.186737       1 task_graph.go:555] Result of work: []
I0517 10:25:25.057045       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]
I0517 10:34:04.077617       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]

It seems like it ran into a separate issue.

Comment 8 Yang Yang 2021-05-18 07:22:32 UTC
Get back to comment#0, it's saying

>also tried in other 4.7 cluster, it is not recreated at all after it is removed

So the issue described in comment#7 is same as comment#0. It's not related to how much time CVO takes to recover it.

$ oc delete crd/servicemonitors.monitoring.coreos.com

The crd/servicemonitors.monitoring.coreos.com is not recreated from the deletion.

Comment 10 W. Trevor King 2021-05-19 04:37:42 UTC
Looking at comment 9's CVO logs:

$ grep 'Running sync.*in state\|Result of work' cvo.log | tail    
I0517 11:03:57.364011       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:09:39.277157       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]
I0517 11:13:01.898768       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:18:43.812689       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]
I0517 11:21:35.246762       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:27:17.163778       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]
I0517 11:30:37.342895       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:36:19.258793       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676]
I0517 11:39:38.268693       1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0
I0517 11:45:20.182573       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers]

So that's pretty consistent.  Looking at our randomization [1], I see we are using Attempt as our random seed, and from the logs^ you can see that Attempt is actually stuck at zero.  So we need to fix that, and get Attempt incrementing whenever we fail, even if we are still in reconciliation mode.

[1]: https://github.com/openshift/cluster-version-operator/blob/18361c7d1f28eea362cb7f88de0cc30ba78bcfd2/pkg/cvo/sync_worker.go#L718-L720

Comment 11 W. Trevor King 2021-05-19 18:33:13 UTC
Left-shifted testing for the new PR looks good [1] :)

[1]: https://github.com/openshift/cluster-version-operator/pull/569#issuecomment-844350859

Comment 13 Yang Yang 2021-05-24 06:16:45 UTC
Attempting to verify with 4.8.0-0.nightly-2021-05-21-233425.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-21-233425   True        False         173m    Cluster version is 4.8.0-0.nightly-2021-05-21-233425

# oc get crd/servicemonitors.monitoring.coreos.com
NAME                                    CREATED AT
servicemonitors.monitoring.coreos.com   2021-05-24T02:02:05Z

# oc delete crd/servicemonitors.monitoring.coreos.com; while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 20s; done
customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted
Mon May 24 01:18:35 EDT 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found

......

Mon May 24 02:04:07 EDT 2021
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found

Waiting for more than 30 minutes, the crd/servicemonitors.monitoring.coreos.com is not recreated.

# grep 'Running sync.*in state\|Result of work' cvo2.log | tail
I0524 05:27:24.968530       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (626 of 675): the server does not recognize this resource, check extension API servers]
I0524 05:30:19.954106       1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 1
I0524 05:36:01.867852       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-cluster-samples-operator/cluster-samples-operator" (383 of 675): the server does not recognize this resource, check extension API servers]
I0524 05:39:11.491409       1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 2
I0524 05:44:53.405151       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-console-operator/console-operator" (612 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-cloud-credential-operator/cloud-credential-operator" (571 of 675): the server does not recognize this resource, check extension API servers]

The attempt does not stuck at zero any more. But the crd/servicemonitors.monitoring.coreos.com still failed to recover from deletion.

Comment 15 Yang Yang 2021-05-24 07:53:12 UTC
I didn't wait long enough. The crd/servicemonitors.monitoring.coreos.com is recreated finally. But it takes too long to get it available. 

#  while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 20s; done
Mon May 24 03:30:30 EDT 2021
NAME                                    CREATED AT
servicemonitors.monitoring.coreos.com   2021-05-24T07:25:03Z

# grep 'Running sync.*in state\|Result of work' cvo2.log | tail
I0524 06:58:22.306083       1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 11
I0524 07:04:04.220081       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (661 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-image-registry/image-registry" (586 of 675): the server does not recognize this resource, check extension API servers]
I0524 07:07:16.249233       1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 12
I0524 07:12:58.163218       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-service-ca-operator/service-ca-operator" (675 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-config-operator/config-operator" (122 of 675): the server does not recognize this resource, check extension API servers update context deadline exceeded at 9 of 675]
I0524 07:16:19.131691       1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 13
I0524 07:22:01.044977       1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-config-operator/machine-config-daemon" (652 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-cluster-storage-operator/cluster-storage-operator" (608 of 675): the server does not recognize this resource, check extension API servers]
I0524 07:25:03.678251       1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 14
I0524 07:30:45.601092       1 task_graph.go:555] Result of work: [Cluster operator network is degraded]
I0524 07:34:01.168487       1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 15
I0524 07:39:43.081668       1 task_graph.go:555] Result of work: [Cluster operator network is degraded]

Comment 17 W. Trevor King 2021-05-24 20:45:08 UTC
Looking at CI to confirm the previous original commit from cvo#560 is functioning as expected, [1] is 4.8.0-0.nightly-2021-05-21-101954 -> 4.8.0-0.nightly-2021-05-21-233425:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1395886263273263104/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b847959b7-kfbsh_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | head -n2
I0522 01:37:26.003244       1 sync_worker.go:541] Running sync registry.build02.ci.openshift.org/ci-op-w6fys29v/release@sha256:29a647edfd270eb4d0eaf6ec48c2389be051bb41d4915fa63be7fabbf7975c87 (force=true) on generation 2 in state Updating at attempt 0
I0522 01:43:07.916351       1 task_graph.go:555] Result of work: [Cluster operator openshift-apiserver is degraded]

Hmm, that's ~5.5m.  Digging in on that ClusterOperator:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1395886263273263104/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b847959b7-kfbsh_cluster-version-operator.log | grep ' 01:[34].*clusteroperator.*openshift-apiserver'
I0522 01:37:35.834673       1 sync_worker.go:742] Precreated resource clusteroperator "openshift-apiserver" (177 of 675)
I0522 01:37:37.531175       1 sync_worker.go:752] Running sync for clusteroperator "openshift-apiserver" (177 of 675)
E0522 01:37:37.531366       1 task.go:112] error running apply for clusteroperator "openshift-apiserver" (177 of 675): Cluster operator openshift-apiserver is degraded
...
E0522 01:42:52.354397       1 task.go:112] error running apply for clusteroperator "openshift-apiserver" (177 of 675): Cluster operator openshift-apiserver is degraded
I0522 01:43:40.444312       1 sync_worker.go:742] Precreated resource clusteroperator "openshift-apiserver" (177 of 675)
I0522 01:43:42.134572       1 sync_worker.go:752] Running sync for clusteroperator "openshift-apiserver" (177 of 675)
E0522 01:43:42.134797       1 task.go:112] error running apply for clusteroperator "openshift-apiserver" (177 of 675): Cluster operator openshift-apiserver is degraded
....
E0522 01:47:29.235334       1 task.go:112] error running apply for clusteroperator "openshift-apiserver" (177 of 675): Cluster operator openshift-apiserver is degraded
I0522 01:47:55.053649       1 sync_worker.go:764] Done syncing for clusteroperator "openshift-apiserver" (177 of 675)

Certainly looks like we're still waiting on that ClusterOperator and not failing the sync cycle fast.  Ah, because we're getting caught up in the retry loop at [2].  I'll see if there's a way to intelligently distinguish between fail-fast and retry behavior at that point.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1395886263273263104
[2]: https://github.com/openshift/cluster-version-operator/blob/77a24f77b8f6e9975b5f41a787d6368ff609e391/pkg/payload/task.go#L103-L107

Comment 18 W. Trevor King 2021-07-23 23:34:59 UTC
Getting close, but still some PRs in flight.

Comment 22 W. Trevor King 2022-08-05 17:18:26 UTC
[1]: https://github.com/openshift/cluster-version-operator/pull/577 just landed, including the change which had originally been part of [2].  I'll move back to MODIFIED, so we can take another look and see if [3] was enough to fix this one.

[1]: https://github.com/openshift/cluster-version-operator/pull/577
[2]: https://github.com/openshift/cluster-version-operator/pull/573
[3]: https://github.com/openshift/cluster-version-operator/pull/577/commits/ce7c4e7ad071d7e2d9c53745372c581445c6b3ec


Note You need to log in before you can comment on or make changes to this bug.