Created attachment 1795538 [details] Spreadsheet containing leaked resources. +++ This bug was initially created as a clone of Bug #1975533 +++ This "stale cruft" is created as a result of the following scenario. Release A had manifest M that lead the CVO to reconcile resource R. But then the component maintainers decided they didn't need R any longer, so they dropped manifest M in release B. The new CVO will no longer reconcile R, but clusters updating from A to B will still have resource R in-cluster, as an unmaintained orphan. Now that https://issues.redhat.com/browse/OTA-222 has been implemented teams can go back through and create deletion manifests for these leaked resources. The attachment delete-candidates.csv contains a list of leaked resources as compared to a freshly installed 4.9 cluster. Use this list to find your component's resources and use the manifest delete annotation (https://github.com/openshift/cluster-version-operator/pull/438) to remove them. Note also that in the case of a cluster-scoped resource it may not need to be removed but simply be modified to remove namespace. --- Additional comment from Stefan Schimanski on 2021-06-24 09:03:13 UTC --- I don't see openshift-apiserver among them. Closing.
Is this one of your resources? PrometheusRule authentication-operator openshift-authentication-operator from 0000_90_cluster-authentication-operator_03_prometheusrule.yaml
Could I ask for your assistance with this, Jack Ottofaro? As you suggested in this Bugzilla, I implemented a mechanism for removing our Prometheus Rule here [1]. Just to add a bit of the context, this Prometheus Rule has been added in 4.7 [2] and is not present in master branch. Unfortunately, it seems using the "release.openshift.io/delete: "true"" doesn't work in this case for unknown reason. My testing procedure is the following: 1) Requested a new 4.7 cluster from Cluster Bot using `launch 4.7.20` command 2) Requested a new upgrade payload with [1] patch using `build openshift/cluster-authentication-operator#461`. The result might be found here [3]. 3) Logged into the cluster created in step 1, configured a global pull secret and started an upgrade using `oc adm upgrade --to-image='registry.build01.ci.openshift.org/ci-ln-btyyx12/release:latest' --force --allow-explicit-upgrade` Once the upgrade procedure went through the "authentication" Cluster Operator, I checked if the Prometheus Rule is still there. Unfortunately it was. So it seems the annotation didn't work for some reason. Additional things that I checked: - There is no indication that the Prometheus Rule has been synced by the CVO: $ oc logs cluster-version-operator-6c59f7f5fb-2cjxx | grep -i authentication-operator | grep -i rule # nothing was returned - The CVO image is correct: $ oc describe pod cluster-version-operator-6c59f7f5fb-2cjxx | grep Image Image: registry.build01.ci.openshift.org/ci-ln-btyyx12/release:latest - The manifest is present in the Cluster Authentication Operator: $ oc rsh -n openshift-authentication-operator authentication-operator-757cf6c7c6-9nd6j $ ls /manifests | grep -i prometheusrule 0000_90_cluster-authentication-operator_03_prometheusrule.yaml # Checked also and the file contains proper release.openshift.io/delete: "true" annotation I attached full CVO logs to this Bugzilla. [1] https://github.com/openshift/cluster-authentication-operator/pull/461/files#diff-76cc26a0810ee3aeb2c87e3099178392ea6d67df791427689f4ad5a7de8b2f86R10 [2] https://github.com/openshift/cluster-authentication-operator/commit/312ab66d1cbc5745fcee93dc518b546ef9a6602f [3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1414484652390879232
(In reply to Sebastian Łaskawiec from comment #4) > Could I ask for your assistance with this, Jack Ottofaro? > From looking at the attached CVO log I can see that CVO never reconciles a manifest for prometheusrule "openshift-authentication-operator/authentication-operator" which leads me to believe the image does not contain your added manifest. However looking at the CI upgrade test log [1] I see the expected log entry: W0709 08:49:01.190530 1 helper.go:97] PrometheusRule "openshift-authentication-operator/authentication-operator" has already been removed. This is expected since the release ci was upgrading from did not contain the prometheusule to begin with. But you can see in the log the manifest you added is being handled by CVO. Can you try your upgrade test using the image your PR is building which I believe last was registry.build02.ci.openshift.org/ci-op-z4n315sh/release:latest from [2]. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-authentication-operator/461/pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade/1413391593544617984/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-85f8dd68-479kv_cluster-version-operator.log [2] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-authentication-operator/461/pull-ci-openshift-cluster-authentication-operator-master-images/1413391593653669888/build-log.txt
Together with Jack we tested this using a few different methods. The easiest one is to inspect the CVO logs from the upgrade job [1]: I0715 12:23:52.981471 1 sync_worker.go:753] Running sync for prometheusrule "openshift-authentication-operator/authentication-operator" (614 of 709) ... W0715 12:23:53.179603 1 helper.go:97] PrometheusRule "openshift-authentication-operator/authentication-operator" has already been removed. I0715 12:23:53.179630 1 sync_worker.go:765] Done syncing for prometheusrule "openshift-authentication-operator/authentication-operator" (614 of 709)
Installation with 4.9.0-0.nightly-2021-07-28-181504 passed and the authentication-operator is not created. # oc get PrometheusRule authentication-operator -n openshift-authentication-operator Error from server (NotFound): prometheusrules.monitoring.coreos.com "authentication-operator" not found Upgrade test from 4.7 to 4.9.0-0.nightly-2021-07-28-181504, the authentication-operator is removed. # oc get PrometheusRule authentication-operator -n openshift-authentication-operator Error from server (NotFound): prometheusrules.monitoring.coreos.com "authentication-operator" not found [ { "lastTransitionTime": "2021-07-29T01:06:58Z", "message": "Done applying 4.9.0-0.nightly-2021-07-28-181504", "status": "True", "type": "Available" }, { "lastTransitionTime": "2021-07-29T03:22:06Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2021-07-29T04:19:34Z", "message": "Cluster version is 4.9.0-0.nightly-2021-07-28-181504", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2021-07-29T00:39:36Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.9.0-0.nightly-2021-07-28-181504 not found in the \"stable-4.7\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" }, { "lastTransitionTime": "2021-07-29T04:21:57Z", "message": "Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=PrometheusRule \"openshift-kube-apiserver/kube-apiserver\",PrometheusRule \"openshift-authentication-operator/authentication-operator\"", "reason": "ResourceDeletesInProgress", "status": "False", "type": "Upgradeable" } ] # grep authentication-operator * | grep -v 'throt\|syn' 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:E0729 03:01:21.493060 1 task.go:112] error running apply for deployment "openshift-authentication-operator/authentication-operator" (233 of 676): context canceled 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:I0729 03:01:21.499494 1 task_graph.go:555] Result of work: [Could not update deployment "openshift-authentication-operator/authentication-operator" (233 of 676) Could not update customresourcedefinition "provisionings.metal3.io" (183 of 676)] 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:* Could not update deployment "openshift-authentication-operator/authentication-operator" (233 of 676) 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 04:19:30.840264 1 helper.go:65] Delete requested for PrometheusRule "openshift-authentication-operator/authentication-operator". 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 04:21:57.154378 1 upgradeable.go:229] Resource deletions in progress; resources=PrometheusRule "openshift-kube-apiserver/kube-apiserver",PrometheusRule "openshift-authentication-operator/authentication-operator" 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 04:22:53.111374 1 helper.go:76] Delete of PrometheusRule "openshift-authentication-operator/authentication-operator" completed. The removal is started late and then results in progressing=false but ResourceDeletesInProgress. CVO logs `error running apply for deployment`. Sebastian Łaskawiec, could you please help check if there is something wrong with the manifest?
This question should probably be routed to Jack as he's an expert in this field. My suspicion is that resource deletion process should not block upgrades in any matter. That's why the `progressing` flag is set to false and all steps are retried on error. So I think it looks OK, but Jack has a deciding call here.
(In reply to Yang Yang from comment #12) > > The removal is started late and then results in progressing=false but > ResourceDeletesInProgress. CVO logs `error running apply for deployment`. > > Sebastian Łaskawiec, could you please help check if there is something wrong > with the manifest? Per our slack discussion https://coreos.slack.com/archives/CEGKQ43CP/p1627462194394000 I don't think there's an issue here but it would be helpful to see the entire CVO log if you still have it.
(In reply to Yang Yang from comment #12) I'm not seeing any issues. More specifically: These are just "normal", i.e. not unexpected, logs that occur in the prior release that doesn't even contain the delete manifest: > # grep authentication-operator * | grep -v 'throt\|syn' > 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:E0729 > 03:01:21.493060 1 task.go:112] error running apply for deployment > "openshift-authentication-operator/authentication-operator" (233 of 676): > context canceled > 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:I0729 > 03:01:21.499494 1 task_graph.go:555] Result of work: [Could not update > deployment "openshift-authentication-operator/authentication-operator" (233 > of 676) Could not update customresourcedefinition "provisionings.metal3.io" > (183 of 676)] > 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:* > Could not update deployment > "openshift-authentication-operator/authentication-operator" (233 of 676) Here we see the deletion started and completed: > 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 > 04:19:30.840264 1 helper.go:65] Delete requested for PrometheusRule > "openshift-authentication-operator/authentication-operator". > 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 > 04:21:57.154378 1 upgradeable.go:229] Resource deletions in progress; > resources=PrometheusRule > "openshift-kube-apiserver/kube-apiserver",PrometheusRule > "openshift-authentication-operator/authentication-operator" > 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 > 04:22:53.111374 1 helper.go:76] Delete of PrometheusRule > "openshift-authentication-operator/authentication-operator" completed. Not sure why you think the removal is started late. What determines this is the manifest name regardless of its contents. The deletion processing has not direct effect on "progressing". It is set based on the state of the operators and when the upgrade is completed when we return to reconciling mode. > The removal is started late and then results in progressing=false but > ResourceDeletesInProgress. CVO logs `error running apply for deployment`. > > Sebastian Łaskawiec, could you please help check if there is something wrong > with the manifest?
Jack, thanks for confirming that. Moving it to verified state based on comment#12
Thanks, will check (sorry didn't verify it timely due to busy on work of led subteams; will look asap)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759