Description of problem: 4.7.0-0.nightly-2021-02-09-192846 cluster, removed crd/servicemonitors.monitoring.coreos.com which should be reconciled by CVO, it is recreated later, but try again with the steps, it is not recrated. also tried in other 4.7 cluster, it is not recreated at all after it is removed. 1. delete for the first time # oc get crd/servicemonitors.monitoring.coreos.com NAME CREATED AT servicemonitors.monitoring.coreos.com 2021-02-09T23:40:33Z # oc delete crd/servicemonitors.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted # while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 20s; done Wed Feb 10 02:22:55 EST 2021 NAME CREATED AT servicemonitors.monitoring.coreos.com 2021-02-10T08:55:22Z Wed Feb 10 02:23:16 EST 2021 NAME CREATED AT servicemonitors.monitoring.coreos.com 2021-02-10T08:55:22Z 2. repeat the steps # oc get crd/servicemonitors.monitoring.coreos.com NAME CREATED AT servicemonitors.monitoring.coreos.com 2021-02-10T08:55:22Z # oc delete crd/servicemonitors.monitoring.coreos.com customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted # while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 20s; done Wed Feb 10 03:54:53 EST 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found Wed Feb 10 03:55:13 EST 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found Wed Feb 10 03:55:33 EST 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found Wed Feb 10 03:55:53 EST 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found ... Wed Feb 10 04:03:18 EST 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found ... Wed Feb 10 04:06:20 EST 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found Could not update servicemonitor errors in CVO logs 2021-02-10T09:20:27.701991313Z I0210 09:20:27.701944 1 sync_worker.go:937] Update error 635 of 668: UpdatePayloadResourceTypeMissing Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (635 of 668): the server does not recognize this resource, check extension API servers (*errors.withStack: failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1") 2021-02-10T09:20:27.702001158Z I0210 09:20:27.701989 1 sync_worker.go:937] Update error 608 of 668: UpdatePayloadResourceTypeMissing Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers (*errors.withStack: failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1") 2021-02-10T09:20:27.702036790Z E0210 09:20:27.702021 1 sync_worker.go:353] unable to synchronize image (waiting 2m50.956499648s): Multiple errors are preventing progress: 2021-02-10T09:20:27.702036790Z * Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers 2021-02-10T09:20:27.702036790Z * Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (635 of 668): the server does not recognize this resource, check extension API servers ... Version-Release number of the following components: 4.7.0-0.nightly-2021-02-09-192846 How reproducible: not sure Steps to Reproduce: 1. see the steps 2. 3. Actual results: the removed crd/servicemonitors.monitoring.coreos.com is not recreated Expected results: should be recreated Additional info:
Looking at comment 1's must-gather, here are the CVO logs: $ tar tvf dump.tar.gz | grep 'cluster-version.*current' -rwxr-xr-x root/root 26775831 2021-02-10 01:23 must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log We're in the reconciling state, so task-node ordering should be randomized: $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep -o 'Running sync.*in state.*attempt' | uniq -c 47 Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt Looking at the CRDs we attempt to sync: $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep -o 'Running sync for customresourcedefinition.*' | sort | uniq ... Running sync for customresourcedefinition "servicecas.operator.openshift.io" (519 of 668) Running sync for customresourcedefinition "servicemonitors.monitoring.coreos.com" (322 of 668) Running sync for customresourcedefinition "storages.operator.openshift.io" (397 of 668) ... So it's in there. Syncs seem mostly successful: $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep -c 'Done syncing for cust omresourcedefinition.*servicemonitors.monitoring.coreos.com' 41 Taking a closer look at the sync cycle history: $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep 'Running sync.*in state\|Result of work' 2021-02-10T05:46:59.697454340Z I0210 05:46:59.697444 1 task_graph.go:555] Result of work: [] 2021-02-10T05:49:50.654192682Z I0210 05:49:50.654167 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 90 2021-02-10T05:50:40.161208061Z I0210 05:50:40.161188 1 task_graph.go:555] Result of work: [] 2021-02-10T05:53:31.117926328Z I0210 05:53:31.117912 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 91 ...more happiness... 2021-02-10T07:47:16.274953570Z I0210 07:47:16.274940 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 122 2021-02-10T07:48:05.882705020Z I0210 07:48:05.882694 1 task_graph.go:555] Result of work: [] 2021-02-10T07:50:56.839494267Z I0210 07:50:56.839466 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 123 2021-02-10T07:56:42.433101045Z I0210 07:56:42.433087 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-marketplace/marketplace-operator" (514 of 668): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers] 2021-02-10T07:58:15.262012761Z I0210 07:58:15.261996 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 124 2021-02-10T08:03:57.175215731Z I0210 08:03:57.175197 1 task_graph.go:555] Result of work: [Cluster operator network is degraded] 2021-02-10T08:06:51.010030287Z I0210 08:06:51.009987 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 125 2021-02-10T08:07:40.720842813Z I0210 08:07:40.720793 1 task_graph.go:555] Result of work: [] 2021-02-10T08:10:31.677583725Z I0210 08:10:31.677559 1 sync_worker.go:549] Running sync 4.7.0-0.nightly-2021-02-09-192846 (force=false) on generation 1 in state Reconciling at attempt 126 ... Comparing that with your not-founds from: $ date --iso=m --utc --date 'Wed Feb 10 04:06:20 EST 2021' 2021-02-10T09:06+00:00 That looks like the recovery we see is from your first deletion. Then maybe you just don't wait long enough to see the second recovery? Picking on the DNS ServiceMonitor: $ tar xOz must-gather.local.2798470287921355997/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-b007f593d45b7dea11ee9d7cd4ef9c1d791cf1309db39583aaa9946adba3596e/namespaces/openshift-cluster-version/pods/cluster-version-operator-dcd5d88f8-kzrwk/cluster-version-operator/cluster-version-operator/logs/current.log <dump.tar.gz | grep 'servicemonitor.*openshift-dns-operator/dns-operator\|customresourcedefinition.*servicemonitors.monitoring.coreos.com\|Result of work' ... 2021-02-10T09:08:14.204527560Z I0210 09:08:14.204517 1 task_graph.go:555] Result of work: [] 2021-02-10T09:11:15.232193170Z I0210 09:11:15.232179 1 sync_worker.go:765] Running sync for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668) 2021-02-10T09:11:15.434221813Z I0210 09:11:15.434188 1 sync_worker.go:777] Done syncing for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668) 2021-02-10T09:11:19.482706009Z I0210 09:11:19.482697 1 sync_worker.go:765] Running sync for customresourcedefinition "servicemonitors.monitoring.coreos.com" (322 of 668) 2021-02-10T09:11:19.577490007Z I0210 09:11:19.577461 1 request.go:591] Throttling request took 93.787518ms, request: GET:https://api-int.qe-ui47-0210.qe.devcluster.openshift.com:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/servicemonitors.monitoring.coreos.com 2021-02-10T09:11:19.584637359Z I0210 09:11:19.584604 1 sync_worker.go:777] Done syncing for customresourcedefinition "servicemonitors.monitoring.coreos.com" (322 of 668) 2021-02-10T09:11:54.832103115Z I0210 09:11:54.832092 1 task_graph.go:555] Result of work: [] 2021-02-10T09:14:57.696018248Z I0210 09:14:57.695988 1 sync_worker.go:765] Running sync for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668) 2021-02-10T09:15:01.944386561Z E0210 09:15:01.944339 1 task.go:112] error running apply for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" 2021-02-10T09:15:27.881377183Z E0210 09:15:27.881338 1 task.go:112] error running apply for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" ...more 'no matches for kind'... 2021-02-10T09:20:24.353691910Z E0210 09:20:24.353648 1 task.go:112] error running apply for servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" 2021-02-10T09:20:27.701963108Z I0210 09:20:27.701925 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (635 of 668): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers] 2021-02-10T09:20:27.702001158Z I0210 09:20:27.701989 1 sync_worker.go:937] Update error 608 of 668: UpdatePayloadResourceTypeMissing Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers (*errors.withStack: failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1") 2021-02-10T09:20:27.702036790Z * Could not update servicemonitor "openshift-dns-operator/dns-operator" (608 of 668): the server does not recognize this resource, check extension API servers So that particular round, we didn't happen to have any of our manifest workers pick up the task-node with the CRD manifest. Next round we would have shuffled again, and eventually one of our workers will pick it up and recover the cluster. So I think this is functioning as designed, and I would expect the cluster to eventually recover. One thing that is slowing us down is the fact that we are waiting to see if the CRD shows up, instead of failing fast on "no matches for kind" and starting in on that next sync cycle. If we did something like that, we'd probably want to check whether we were waiting for CRDs to be functional when processing CRD manifests, so we don't push a CRD manifest and then immediately fail a subsequent manifest for a CR of that new type.
So based on comment 2, we are working as designed and will eventually give up on that sync cycle, come back in, push the CRD, and recover. We're just slow. I expect "deleted a CVO-managed CR" to be rare, so optimizing the speed of recovery seems more like an RFE than a bug, although feel free to push back here if you disagree. I've opened [1] to track this on the RFE side, and am closing this bug DEFERRED for now. [1]: https://issues.redhat.com/browse/OTA-388
I'm going to re-open this, because I think I can mitigate it in a way that will help with bug 1951835.
One possible test plan would be running: $ oc delete crd/servicemonitors.monitoring.coreos.com on an otherwise stable cluster and measuring the time from that deletion until the CVO recreates the resource. It will be somewhat random, because the CVO's reconciliation-mode workers walk a randomly-shuffled manifest graph, but on average we should be faster now. The more degraded or otherwise-sad ClusterOperator there are, the more noticeable the difference should be, so deleting a control-plane machine, which will make a number of operators mad, might be helpful.
Attempting to verify with 4.8.0-0.nightly-2021-05-15-141455 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-15-141455 True False 160m Cluster version is 4.8.0-0.nightly-2021-05-15-141455 # oc get crd/servicemonitors.monitoring.coreos.com NAME CREATED AT servicemonitors.monitoring.coreos.com 2021-05-17T07:02:36Z # oc delete crd/servicemonitors.monitoring.coreos.com; while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 2s; done customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted Mon May 17 06:17:24 EDT 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found ...... Mon May 17 06:44:00 EDT 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found The CVO does not recreate it after waiting for over 25min. Checking the CVO logs: # grep 'servicemonitors.monitoring.coreos.com\|Result of work' /tmp/cvo.log "servicemonitors.monitoring.coreos.com" (330 of 676) I0517 10:10:16.958794 1 task_graph.go:555] Result of work: [] I0517 10:13:11.523149 1 sync_worker.go:760] Running sync for customresourcedefinition "servicemonitors.monitoring.coreos.com" (330 of 676) I0517 10:13:11.618646 1 request.go:591] Throttling request took 94.231261ms, request: GET:https://api-int.yangyang0517.qe.devcluster.openshift.com:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/servicemonitors.monitoring.coreos.com I0517 10:13:11.624494 1 sync_worker.go:772] Done syncing for customresourcedefinition "servicemonitors.monitoring.coreos.com" (330 of 676) I0517 10:13:34.522289 1 task_graph.go:555] Result of work: [] I0517 10:16:29.039275 1 sync_worker.go:760] Running sync for customresourcedefinition "servicemonitors.monitoring.coreos.com" (330 of 676) I0517 10:16:29.133714 1 request.go:591] Throttling request took 93.593144ms, request: GET:https://api-int.yangyang0517.qe.devcluster.openshift.com:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/servicemonitors.monitoring.coreos.com I0517 10:16:29.141120 1 sync_worker.go:772] Done syncing for customresourcedefinition "servicemonitors.monitoring.coreos.com" (330 of 676) I0517 10:16:52.186737 1 task_graph.go:555] Result of work: [] I0517 10:25:25.057045 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] I0517 10:34:04.077617 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] It seems like it ran into a separate issue.
Get back to comment#0, it's saying >also tried in other 4.7 cluster, it is not recreated at all after it is removed So the issue described in comment#7 is same as comment#0. It's not related to how much time CVO takes to recover it. $ oc delete crd/servicemonitors.monitoring.coreos.com The crd/servicemonitors.monitoring.coreos.com is not recreated from the deletion.
Looking at comment 9's CVO logs: $ grep 'Running sync.*in state\|Result of work' cvo.log | tail I0517 11:03:57.364011 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:09:39.277157 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] I0517 11:13:01.898768 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:18:43.812689 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] I0517 11:21:35.246762 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:27:17.163778 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] I0517 11:30:37.342895 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:36:19.258793 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers update context deadline exceeded at 28 of 676 update context deadline exceeded at 28 of 676] I0517 11:39:38.268693 1 sync_worker.go:549] Running sync 4.8.0-0.nightly-2021-05-15-141455 (force=false) on generation 1 in state Reconciling at attempt 0 I0517 11:45:20.182573 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 676): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (627 of 676): the server does not recognize this resource, check extension API servers] So that's pretty consistent. Looking at our randomization [1], I see we are using Attempt as our random seed, and from the logs^ you can see that Attempt is actually stuck at zero. So we need to fix that, and get Attempt incrementing whenever we fail, even if we are still in reconciliation mode. [1]: https://github.com/openshift/cluster-version-operator/blob/18361c7d1f28eea362cb7f88de0cc30ba78bcfd2/pkg/cvo/sync_worker.go#L718-L720
Left-shifted testing for the new PR looks good [1] :) [1]: https://github.com/openshift/cluster-version-operator/pull/569#issuecomment-844350859
Attempting to verify with 4.8.0-0.nightly-2021-05-21-233425. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-21-233425 True False 173m Cluster version is 4.8.0-0.nightly-2021-05-21-233425 # oc get crd/servicemonitors.monitoring.coreos.com NAME CREATED AT servicemonitors.monitoring.coreos.com 2021-05-24T02:02:05Z # oc delete crd/servicemonitors.monitoring.coreos.com; while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 20s; done customresourcedefinition.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" deleted Mon May 24 01:18:35 EDT 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found ...... Mon May 24 02:04:07 EDT 2021 Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" not found Waiting for more than 30 minutes, the crd/servicemonitors.monitoring.coreos.com is not recreated. # grep 'Running sync.*in state\|Result of work' cvo2.log | tail I0524 05:27:24.968530 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (255 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (626 of 675): the server does not recognize this resource, check extension API servers] I0524 05:30:19.954106 1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 1 I0524 05:36:01.867852 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-cluster-samples-operator/cluster-samples-operator" (383 of 675): the server does not recognize this resource, check extension API servers] I0524 05:39:11.491409 1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 2 I0524 05:44:53.405151 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-console-operator/console-operator" (612 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-cloud-credential-operator/cloud-credential-operator" (571 of 675): the server does not recognize this resource, check extension API servers] The attempt does not stuck at zero any more. But the crd/servicemonitors.monitoring.coreos.com still failed to recover from deletion.
I didn't wait long enough. The crd/servicemonitors.monitoring.coreos.com is recreated finally. But it takes too long to get it available. # while true; do date; oc get crd/servicemonitors.monitoring.coreos.com; sleep 20s; done Mon May 24 03:30:30 EDT 2021 NAME CREATED AT servicemonitors.monitoring.coreos.com 2021-05-24T07:25:03Z # grep 'Running sync.*in state\|Result of work' cvo2.log | tail I0524 06:58:22.306083 1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 11 I0524 07:04:04.220081 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (661 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-image-registry/image-registry" (586 of 675): the server does not recognize this resource, check extension API servers] I0524 07:07:16.249233 1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 12 I0524 07:12:58.163218 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-service-ca-operator/service-ca-operator" (675 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-config-operator/config-operator" (122 of 675): the server does not recognize this resource, check extension API servers update context deadline exceeded at 9 of 675] I0524 07:16:19.131691 1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 13 I0524 07:22:01.044977 1 task_graph.go:555] Result of work: [Could not update servicemonitor "openshift-machine-config-operator/machine-config-daemon" (652 of 675): the server does not recognize this resource, check extension API servers Could not update servicemonitor "openshift-cluster-storage-operator/cluster-storage-operator" (608 of 675): the server does not recognize this resource, check extension API servers] I0524 07:25:03.678251 1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 14 I0524 07:30:45.601092 1 task_graph.go:555] Result of work: [Cluster operator network is degraded] I0524 07:34:01.168487 1 sync_worker.go:541] Running sync 4.8.0-0.nightly-2021-05-21-233425 (force=false) on generation 1 in state Reconciling at attempt 15 I0524 07:39:43.081668 1 task_graph.go:555] Result of work: [Cluster operator network is degraded]
Looking at CI to confirm the previous original commit from cvo#560 is functioning as expected, [1] is 4.8.0-0.nightly-2021-05-21-101954 -> 4.8.0-0.nightly-2021-05-21-233425: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1395886263273263104/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b847959b7-kfbsh_cluster-version-operator.log | grep 'Running sync.*in state\|Result of work' | head -n2 I0522 01:37:26.003244 1 sync_worker.go:541] Running sync registry.build02.ci.openshift.org/ci-op-w6fys29v/release@sha256:29a647edfd270eb4d0eaf6ec48c2389be051bb41d4915fa63be7fabbf7975c87 (force=true) on generation 2 in state Updating at attempt 0 I0522 01:43:07.916351 1 task_graph.go:555] Result of work: [Cluster operator openshift-apiserver is degraded] Hmm, that's ~5.5m. Digging in on that ClusterOperator: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1395886263273263104/artifacts/e2e-metal-ipi-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-7b847959b7-kfbsh_cluster-version-operator.log | grep ' 01:[34].*clusteroperator.*openshift-apiserver' I0522 01:37:35.834673 1 sync_worker.go:742] Precreated resource clusteroperator "openshift-apiserver" (177 of 675) I0522 01:37:37.531175 1 sync_worker.go:752] Running sync for clusteroperator "openshift-apiserver" (177 of 675) E0522 01:37:37.531366 1 task.go:112] error running apply for clusteroperator "openshift-apiserver" (177 of 675): Cluster operator openshift-apiserver is degraded ... E0522 01:42:52.354397 1 task.go:112] error running apply for clusteroperator "openshift-apiserver" (177 of 675): Cluster operator openshift-apiserver is degraded I0522 01:43:40.444312 1 sync_worker.go:742] Precreated resource clusteroperator "openshift-apiserver" (177 of 675) I0522 01:43:42.134572 1 sync_worker.go:752] Running sync for clusteroperator "openshift-apiserver" (177 of 675) E0522 01:43:42.134797 1 task.go:112] error running apply for clusteroperator "openshift-apiserver" (177 of 675): Cluster operator openshift-apiserver is degraded .... E0522 01:47:29.235334 1 task.go:112] error running apply for clusteroperator "openshift-apiserver" (177 of 675): Cluster operator openshift-apiserver is degraded I0522 01:47:55.053649 1 sync_worker.go:764] Done syncing for clusteroperator "openshift-apiserver" (177 of 675) Certainly looks like we're still waiting on that ClusterOperator and not failing the sync cycle fast. Ah, because we're getting caught up in the retry loop at [2]. I'll see if there's a way to intelligently distinguish between fail-fast and retry behavior at that point. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade/1395886263273263104 [2]: https://github.com/openshift/cluster-version-operator/blob/77a24f77b8f6e9975b5f41a787d6368ff609e391/pkg/payload/task.go#L103-L107
Getting close, but still some PRs in flight.
[1]: https://github.com/openshift/cluster-version-operator/pull/577 just landed, including the change which had originally been part of [2]. I'll move back to MODIFIED, so we can take another look and see if [3] was enough to fix this one. [1]: https://github.com/openshift/cluster-version-operator/pull/577 [2]: https://github.com/openshift/cluster-version-operator/pull/573 [3]: https://github.com/openshift/cluster-version-operator/pull/577/commits/ce7c4e7ad071d7e2d9c53745372c581445c6b3ec
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-8843