Bug 1975537

Summary: [Monitoring] Remove stale cruft installed by CVO in earlier releases
Product: OpenShift Container Platform Reporter: Jack Ottofaro <jack.ottofaro>
Component: MonitoringAssignee: Brad Ison <brad.ison>
Status: CLOSED NOTABUG QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: medium    
Version: 4.9CC: alegrand, anpicker, aos-bugs, erooth, jfajersk, kakkoyun, mfojtik, pkrupa, sttts, xxia, yanyang
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1975533 Environment:
Last Closed: 2021-07-06 13:23:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Spreadsheet containing leaked resources. none

Description Jack Ottofaro 2021-06-23 21:22:28 UTC
Created attachment 1793634 [details]
Spreadsheet containing leaked resources.

+++ This bug was initially created as a clone of Bug #1975533 +++

This "stale cruft" is created as a result of the following scenario. Release A had manifest M that lead the CVO to reconcile resource R. But then the component maintainers decided they didn't need R any longer, so they dropped manifest M in release B. The new CVO will no longer reconcile R, but clusters updating from A to B will still have resource R in-cluster, as an unmaintained orphan.

Now that https://issues.redhat.com/browse/OTA-222 has been implemented teams can go back through and create deletion manifests for these leaked resources.

The attachment delete-candidates.csv contains a list of leaked resources as compared to a freshly installed 4.9 cluster. Use this list to find your component's resources and use the manifest delete annotation (https://github.com/openshift/cluster-version-operator/pull/438) to remove them.

Note also that in the case of a cluster-scoped resource it may not need to be removed but simply be modified to remove namespace.

Comment 1 Jan Fajerski 2021-06-24 12:12:59 UTC
CMO seems to occur once in the linked spreadsheet:

GROUP	        KIND	        NAME	                        NAMESPACE	      Born In	Last In	YAML File
extensions	Deployment	cluster-monitoring-operator	openshift-monitoring	4.1	4.1	0000_50_cluster-monitoring-operator_04-deployment.yaml

The manifest in question seems to be this https://github.com/openshift/cluster-monitoring-operator/blob/release-4.1/manifests/04-deployment.yaml

Iiuc the manifest was just renamed and there shouldn't be any orphans left behind.
Leaving the final decision to Simon though.

Comment 2 Jack Ottofaro 2021-06-24 14:00:51 UTC
Changing the manifest in a given release does not change the resource if it was installed in a prior release. So if a customer had a 4.1 cluster this resource would be present and if they upgraded that cluster to 4.9.z the resource would still be present. The purpose of the manifest delete is to remove such resources during the upgrade.

Comment 3 Simon Pasquier 2021-06-28 15:32:59 UTC
IIUC the changes that happened between 4.1 and 4.2 are:
* manifests/04-deployment.yaml renamed to manifests/0000_50_cluster_monitoring_operator_04-deployment.yaml [1]
* switching from "apiVersion: extensions/v1beta1" to "apiVersion: apps/v1"

Having said that, the manifest maps in both releases to the deployment resource named "cluster-monitoring-operator" in the "openshift-monitoring" namespace so there's no leaked resource left behind and nothing to be removed. Did I miss something?

[1] https://github.com/openshift/cluster-monitoring-operator/commit/9fd35c2f598d903a6be299a940e975c1672096d7
[2] https://github.com/openshift/cluster-monitoring-operator/commit/63ea3bca39ab33fb1e2f8c9cbd0fd804b3de112f

Comment 4 Jack Ottofaro 2021-06-28 17:28:27 UTC
(In reply to Simon Pasquier from comment #3)
>
The issue is that Group 'extensions' != 'apps' so the Deployments are not the same. So there will be an orphaned 'extension openshift-monitoring/cluster-monitoring-operator' in a 4.9 that was upgraded from a 4.1. If it was just a version change, extensions/v1beta1 -> 
extensions/v1, that wouldn't be an issue.

Comment 5 Brad Ison 2021-07-05 13:27:18 UTC
I don't see how this can possibly be the case. Deployments were moved from `extensions/v1beta1` to `apps/v1` and the old representation in the `extensions` group was actually removed in Kubernetes v1.16 altogether -- OpenShift 4.9 doesn't know about Deployments with version `extensions/v1beta1`. The previous versions would have been converted to the preferred storage version by the API server, right?

If there was an orphaned resource, we would have a bunch of clusters with multiple cluster-monitoring-operator deployments.

If we create a deletion manifest for this in 4.9, I don't think it would work, and if it did I think it would delete our operator deployment. Am I misunderstanding something here?

Comment 6 Jack Ottofaro 2021-07-06 13:10:48 UTC
(In reply to Brad Ison from comment #5)
> I don't see how this can possibly be the case. Deployments were moved from
> `extensions/v1beta1` to `apps/v1` and the old representation in the
> `extensions` group was actually removed in Kubernetes v1.16 altogether --
> OpenShift 4.9 doesn't know about Deployments with version
> `extensions/v1beta1`. The previous versions would have been converted to the
> preferred storage version by the API server, right?
> 
> If there was an orphaned resource, we would have a bunch of clusters with
> multiple cluster-monitoring-operator deployments.
> 
> If we create a deletion manifest for this in 4.9, I don't think it would
> work, and if it did I think it would delete our operator deployment. Am I
> misunderstanding something here?

That makes sense to me as well. This was one I couldn't easily verify since it was born in 4.1. Feel free to close.

Comment 7 Brad Ison 2021-07-06 13:23:00 UTC
Sounds good. Thanks.