Bug 1977027 - [oauth-apiserver] Remove stale cruft installed by CVO in earlier releases
Summary: [oauth-apiserver] Remove stale cruft installed by CVO in earlier releases
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oauth-apiserver
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.9.0
Assignee: Sebastian Łaskawiec
QA Contact: Xingxing Xia
URL:
Whiteboard: tag-ci
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-28 19:00 UTC by Jack Ottofaro
Modified: 2021-10-18 17:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1975533
Environment:
Last Closed: 2021-10-18 17:36:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Spreadsheet containing leaked resources. (12.36 KB, text/plain)
2021-06-28 19:00 UTC, Jack Ottofaro
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-authentication-operator pull 461 0 None open Bug 1977027: Remove not needed Prometheus Rule 2021-07-16 08:57:40 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:37:10 UTC

Description Jack Ottofaro 2021-06-28 19:00:13 UTC
Created attachment 1795538 [details]
Spreadsheet containing leaked resources.

+++ This bug was initially created as a clone of Bug #1975533 +++

This "stale cruft" is created as a result of the following scenario. Release A had manifest M that lead the CVO to reconcile resource R. But then the component maintainers decided they didn't need R any longer, so they dropped manifest M in release B. The new CVO will no longer reconcile R, but clusters updating from A to B will still have resource R in-cluster, as an unmaintained orphan.

Now that https://issues.redhat.com/browse/OTA-222 has been implemented teams can go back through and create deletion manifests for these leaked resources.

The attachment delete-candidates.csv contains a list of leaked resources as compared to a freshly installed 4.9 cluster. Use this list to find your component's resources and use the manifest delete annotation (https://github.com/openshift/cluster-version-operator/pull/438) to remove them.

Note also that in the case of a cluster-scoped resource it may not need to be removed but simply be modified to remove namespace.

--- Additional comment from Stefan Schimanski on 2021-06-24 09:03:13 UTC ---

I don't see openshift-apiserver among them. Closing.

Comment 1 Jack Ottofaro 2021-06-28 19:01:28 UTC
Is this one of your resources?

PrometheusRule	authentication-operator	openshift-authentication-operator from 0000_90_cluster-authentication-operator_03_prometheusrule.yaml

Comment 4 Sebastian Łaskawiec 2021-07-12 10:08:15 UTC
Could I ask for your assistance with this, Jack Ottofaro?

As you suggested in this Bugzilla, I implemented a mechanism for removing our Prometheus Rule here [1]. Just to add a bit of the context, this Prometheus Rule has been added in 4.7 [2] and is not present in master branch. Unfortunately, it seems using the "release.openshift.io/delete: "true"" doesn't work in this case for unknown reason.

My testing procedure is the following:
1) Requested a new 4.7 cluster from Cluster Bot using `launch 4.7.20` command
2) Requested a new upgrade payload with [1] patch using `build openshift/cluster-authentication-operator#461`. The result might be found here [3].
3) Logged into the cluster created in step 1, configured a global pull secret and started an upgrade using `oc adm upgrade --to-image='registry.build01.ci.openshift.org/ci-ln-btyyx12/release:latest' --force --allow-explicit-upgrade`

Once the upgrade procedure went through the "authentication" Cluster Operator, I checked if the Prometheus Rule is still there. Unfortunately it was. So it seems the annotation didn't work for some reason.

Additional things that I checked:
- There is no indication that the Prometheus Rule has been synced by the CVO:
$ oc logs cluster-version-operator-6c59f7f5fb-2cjxx | grep -i authentication-operator | grep -i rule
# nothing was returned

- The CVO image is correct:
$ oc describe pod cluster-version-operator-6c59f7f5fb-2cjxx | grep Image
    Image:         registry.build01.ci.openshift.org/ci-ln-btyyx12/release:latest

- The manifest is present in the Cluster Authentication Operator:
$ oc rsh -n openshift-authentication-operator authentication-operator-757cf6c7c6-9nd6j
$ ls /manifests | grep -i prometheusrule
0000_90_cluster-authentication-operator_03_prometheusrule.yaml
# Checked also and the file contains proper release.openshift.io/delete: "true" annotation

I attached full CVO logs to this Bugzilla.

[1] https://github.com/openshift/cluster-authentication-operator/pull/461/files#diff-76cc26a0810ee3aeb2c87e3099178392ea6d67df791427689f4ad5a7de8b2f86R10
[2] https://github.com/openshift/cluster-authentication-operator/commit/312ab66d1cbc5745fcee93dc518b546ef9a6602f
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1414484652390879232

Comment 6 Jack Ottofaro 2021-07-12 18:41:14 UTC
(In reply to Sebastian Łaskawiec from comment #4)
> Could I ask for your assistance with this, Jack Ottofaro?
> 
From looking at the attached CVO log I can see that CVO never reconciles a manifest for prometheusrule "openshift-authentication-operator/authentication-operator" which leads me to believe the image does not contain your added manifest.

However looking at the CI upgrade test log [1] I see the expected log entry:

W0709 08:49:01.190530       1 helper.go:97] PrometheusRule "openshift-authentication-operator/authentication-operator" has already been removed.

This is expected since the release ci was upgrading from did not contain the prometheusule to begin with. But you can see in the log the manifest you added is being handled by CVO.

Can you try your upgrade test using the image your PR is building which I believe last was registry.build02.ci.openshift.org/ci-op-z4n315sh/release:latest from [2].

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-authentication-operator/461/pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade/1413391593544617984/artifacts/e2e-agnostic-upgrade/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-85f8dd68-479kv_cluster-version-operator.log
[2] https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-authentication-operator/461/pull-ci-openshift-cluster-authentication-operator-master-images/1413391593653669888/build-log.txt

Comment 10 Sebastian Łaskawiec 2021-07-16 10:26:56 UTC
Together with Jack we tested this using a few different methods. The easiest one is to inspect the CVO logs from the upgrade job [1]:

I0715 12:23:52.981471       1 sync_worker.go:753] Running sync for prometheusrule "openshift-authentication-operator/authentication-operator" (614 of 709)
...
W0715 12:23:53.179603       1 helper.go:97] PrometheusRule "openshift-authentication-operator/authentication-operator" has already been removed.
I0715 12:23:53.179630       1 sync_worker.go:765] Done syncing for prometheusrule "openshift-authentication-operator/authentication-operator" (614 of 709)

Comment 12 Yang Yang 2021-07-29 06:31:26 UTC
Installation with 4.9.0-0.nightly-2021-07-28-181504 passed and the authentication-operator is not created.

# oc get PrometheusRule authentication-operator -n openshift-authentication-operator
Error from server (NotFound): prometheusrules.monitoring.coreos.com "authentication-operator" not found

Upgrade test from 4.7 to 4.9.0-0.nightly-2021-07-28-181504, the authentication-operator is removed.

# oc get PrometheusRule authentication-operator -n openshift-authentication-operator
Error from server (NotFound): prometheusrules.monitoring.coreos.com "authentication-operator" not found

[
  {
    "lastTransitionTime": "2021-07-29T01:06:58Z",
    "message": "Done applying 4.9.0-0.nightly-2021-07-28-181504",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2021-07-29T03:22:06Z",
    "status": "False",
    "type": "Failing"
  },
  {
    "lastTransitionTime": "2021-07-29T04:19:34Z",
    "message": "Cluster version is 4.9.0-0.nightly-2021-07-28-181504",
    "status": "False",
    "type": "Progressing"
  },
  {
    "lastTransitionTime": "2021-07-29T00:39:36Z",
    "message": "Unable to retrieve available updates: currently reconciling cluster version 4.9.0-0.nightly-2021-07-28-181504 not found in the \"stable-4.7\" channel",
    "reason": "VersionNotFound",
    "status": "False",
    "type": "RetrievedUpdates"
  },
  {
    "lastTransitionTime": "2021-07-29T04:21:57Z",
    "message": "Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=PrometheusRule \"openshift-kube-apiserver/kube-apiserver\",PrometheusRule \"openshift-authentication-operator/authentication-operator\"",
    "reason": "ResourceDeletesInProgress",
    "status": "False",
    "type": "Upgradeable"
  }
]

# grep authentication-operator * | grep -v 'throt\|syn'
2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:E0729 03:01:21.493060       1 task.go:112] error running apply for deployment "openshift-authentication-operator/authentication-operator" (233 of 676): context canceled
2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:I0729 03:01:21.499494       1 task_graph.go:555] Result of work: [Could not update deployment "openshift-authentication-operator/authentication-operator" (233 of 676) Could not update customresourcedefinition "provisionings.metal3.io" (183 of 676)]
2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:* Could not update deployment "openshift-authentication-operator/authentication-operator" (233 of 676)
2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 04:19:30.840264       1 helper.go:65] Delete requested for PrometheusRule "openshift-authentication-operator/authentication-operator".
2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 04:21:57.154378       1 upgradeable.go:229] Resource deletions in progress; resources=PrometheusRule "openshift-kube-apiserver/kube-apiserver",PrometheusRule "openshift-authentication-operator/authentication-operator"
2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729 04:22:53.111374       1 helper.go:76] Delete of PrometheusRule "openshift-authentication-operator/authentication-operator" completed.

The removal is started late and then results in progressing=false but ResourceDeletesInProgress. CVO logs `error running apply for deployment`.
	
Sebastian Łaskawiec, could you please help check if there is something wrong with the manifest?

Comment 13 Sebastian Łaskawiec 2021-08-09 10:55:15 UTC
This question should probably be routed to Jack as he's an expert in this field. 

My suspicion is that resource deletion process should not block upgrades in any matter. That's why the `progressing` flag is set to false and all steps are retried on error. So I think it looks OK, but Jack has a deciding call here.

Comment 14 Jack Ottofaro 2021-08-09 13:03:23 UTC
(In reply to Yang Yang from comment #12)
> 
> The removal is started late and then results in progressing=false but
> ResourceDeletesInProgress. CVO logs `error running apply for deployment`.
> 	
> Sebastian Łaskawiec, could you please help check if there is something wrong
> with the manifest?

Per our slack discussion https://coreos.slack.com/archives/CEGKQ43CP/p1627462194394000 I don't think there's an issue here but it would be helpful to see the entire CVO log if you still have it.

Comment 16 Jack Ottofaro 2021-08-10 14:19:32 UTC
(In reply to Yang Yang from comment #12)

I'm not seeing any issues. More specifically:

These are just "normal", i.e. not unexpected, logs that occur in the prior release that doesn't even contain the delete manifest:
 
> # grep authentication-operator * | grep -v 'throt\|syn'
> 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:E0729
> 03:01:21.493060       1 task.go:112] error running apply for deployment
> "openshift-authentication-operator/authentication-operator" (233 of 676):
> context canceled
> 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:I0729
> 03:01:21.499494       1 task_graph.go:555] Result of work: [Could not update
> deployment "openshift-authentication-operator/authentication-operator" (233
> of 676) Could not update customresourcedefinition "provisionings.metal3.io"
> (183 of 676)]
> 2021-07-29T02:59:51+0000-cluster-version-operator-748955994f-ms7hh.log:*
> Could not update deployment
> "openshift-authentication-operator/authentication-operator" (233 of 676)

Here we see the deletion started and completed:
> 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729
> 04:19:30.840264       1 helper.go:65] Delete requested for PrometheusRule
> "openshift-authentication-operator/authentication-operator".
> 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729
> 04:21:57.154378       1 upgradeable.go:229] Resource deletions in progress;
> resources=PrometheusRule
> "openshift-kube-apiserver/kube-apiserver",PrometheusRule
> "openshift-authentication-operator/authentication-operator"
> 2021-07-29T04:09:12+0000-cluster-version-operator-d5d5b8db9-lb6r5.log:I0729
> 04:22:53.111374       1 helper.go:76] Delete of PrometheusRule
> "openshift-authentication-operator/authentication-operator" completed.

Not sure why you think the removal is started late. What determines this is the manifest name regardless of its contents. The deletion processing has not direct effect on "progressing". It is set based on the state of the operators and when the upgrade is completed when we return to reconciling mode.
> The removal is started late and then results in progressing=false but
> ResourceDeletesInProgress. CVO logs `error running apply for deployment`.
> 	
> Sebastian Łaskawiec, could you please help check if there is something wrong
> with the manifest?

Comment 17 Yang Yang 2021-08-11 01:22:10 UTC
Jack, thanks for confirming that. Moving it to verified state based on comment#12

Comment 19 Xingxing Xia 2021-08-11 01:28:04 UTC
Thanks, will check (sorry didn't verify it timely due to busy on work of led subteams; will look asap)

Comment 22 errata-xmlrpc 2021-10-18 17:36:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.