Bug 1977319 - [Hive] Remove stale cruft installed by CVO in earlier releases
Summary: [Hive] Remove stale cruft installed by CVO in earlier releases
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.10.0
Assignee: Nobody
QA Contact: wang lin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-29 13:11 UTC by Jack Ottofaro
Modified: 2022-03-10 16:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: 'controller-manager-service' service resource was created by older versions of CCO that is no longer needed Consequence: stale 'controller-manager-service' service resource created by CCO was still present even though no longer used Fix: Recreated the service with delete annotation so that CVO can clean it up Result: stale 'controller-manager-service' service resource created by CCO is no longer present
Clone Of: 1975533
Environment:
Last Closed: 2022-03-10 16:04:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Spreadsheet containing leaked resources. (14.95 KB, text/plain)
2021-06-29 13:11 UTC, Jack Ottofaro
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-credential-operator pull 388 0 None open Bug 1977319: cleanup orphaned Service 'controller-manager-service' 2021-09-24 12:11:36 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:04:33 UTC

Description Jack Ottofaro 2021-06-29 13:11:08 UTC
Created attachment 1795797 [details]
Spreadsheet containing leaked resources.

Created attachment 1795797 [details]
Spreadsheet containing leaked resources.

+++ This bug was initially created as a clone of Bug #1975533 +++

This "stale cruft" is created as a result of the following scenario. Release A had manifest M that lead the CVO to reconcile resource R. But then the component maintainers decided they didn't need R any longer, so they dropped manifest M in release B. The new CVO will no longer reconcile R, but clusters updating from A to B will still have resource R in-cluster, as an unmaintained orphan.

Now that https://issues.redhat.com/browse/OTA-222 has been implemented teams can go back through and create deletion manifests for these leaked resources.

The attachment delete-candidates.csv contains a list of leaked resources as compared to a freshly installed 4.9 cluster. Use this list to find your component's resources and use the manifest delete annotation (https://github.com/openshift/cluster-version-operator/pull/438) to remove them.

Note also that in the case of a cluster-scoped resource it may not need to be removed but simply be modified to remove namespace.

The two lines thought to be owned by Hive are:

Service	controller-manager-service	openshift-cloud-credential-operator	4.1	4.5	0000_50_cloud-credential-operator_01_deployment.yaml
ConfigMap	cloud-credential-operator-config	openshift-cloud-credential-operator	4.3	4.4	0000_50_cloud-credential-operator_01_operator_configmap.yaml

Comment 3 Joel Diaz 2021-09-23 14:39:02 UTC
From 4.1 to present the name of the file for the cloud-cred-operator deployment changed from 0000_50_cloud-credential-operator_01_deployment.yaml
 to 0000_50_cloud-credential-operator_03-deployment.yaml . The resource that CVO manages is still the same Deployment at openshift-cloud-credential-operator/cloud-credential-operator. So there is no orphaned resource.

From 4.3 to present cloud-cred-operator has stopped deploying the "default" configmap which used to be deployed via 0000_50_cloud-credential-operator_01_operator_configmap.yaml. While the ConfigMap is deprecated, the cloud-cred-operator still supports that resource if it exists. Marking it for removal by CVO can have unintended effects if a cluster is relying on the old ConfigMap to enable/disable the cloud-cred-operator.

It doesn't appear that there is anything to do here, as there are no orphaned resources. When we finally drop support for the old ConfigMap, we will have to ensure that it gets cleaned up (but we have our own "cleanup" controller to self-manage objects that we orphan/retire). Closing...

Comment 4 W. Trevor King 2021-09-23 23:02:16 UTC
(In reply to Joel Diaz from comment #3)
> From 4.1 to present the name of the file for the cloud-cred-operator
> deployment changed from 0000_50_cloud-credential-operator_01_deployment.yaml
>  to 0000_50_cloud-credential-operator_03-deployment.yaml . The resource that
> CVO manages is still the same Deployment at
> openshift-cloud-credential-operator/cloud-credential-operator. So there is
> no orphaned resource.

Comment 0's:

  Service	controller-manager-service	openshift-cloud-credential-operator	4.1	4.5	0000_50_cloud-credential-operator_01_deployment.yaml

was talking about a Service in that file, not the Deployment.

  $ git log -p -G 'kind: Service$' manifests | grep '^commit \|kind: Service$'
  commit 04c400f160500202ea48468b10847f237bf2fcf4
  +kind: Service
  -kind: Service
  -kind: Service
  commit f8da01cd8b275a1ce766ee30bc84a13da2f1e09f
   kind: Service
  +kind: Service
  commit fd2cc043dbc7223ac4f361f92be06507e8be2eb5
  +kind: Service

So yeah, looks like 04c400f16050 dropped a Service.  Checking names:

  $ git show 04c400f16050 manifests | grep -A5 'kind: Service$'
  +kind: Service
  +metadata:
  +  name: cco-metrics
  +  namespace: openshift-cloud-credential-operator
  +spec:
  +  ports:
  --
  -kind: Service
  -metadata:
  -  name: cco-metrics
  -  namespace: openshift-cloud-credential-operator
  -spec:
  -  ports:
  --
  -kind: Service
  -metadata:
  -  labels:
  -    control-plane: controller-manager
  -    controller-tools.k8s.io: "1.0"
  -  name: controller-manager-service

So yup, seems like you dropped the controller-manager-service Service, and should grow a delete manifest to remove it from born-before-4.6 clusters.  Unless you're handling that in your own orphan/cleanup controller already?

> From 4.3 to present cloud-cred-operator has stopped deploying the "default"
> configmap which used to be deployed via
> 0000_50_cloud-credential-operator_01_operator_configmap.yaml. While the
> ConfigMap is deprecated, the cloud-cred-operator still supports that
> resource if it exists. Marking it for removal by CVO can have unintended
> effects if a cluster is relying on the old ConfigMap to enable/disable the
> cloud-cred-operator.
>
> It doesn't appear that there is anything to do here, as there are no
> orphaned resources. When we finally drop support for the old ConfigMap, we
> will have to ensure that it gets cleaned up...

Can you share more details on how this works.  If there is a new config object that, when set, masks the config from the old ConfigMap, it would be safe to remove the ConfigMap in those clusters, right?  You wouldn't want admins tweaking the (masked) ConfigMap under the impression that that was still driving operator config.  But yeah, if the ConfigMap is a source of defaults for a new config object, and the new config object is unset, that would be one way that the old ConfigMap might still be having some effect on a modern cluster.

Comment 5 Joel Diaz 2021-09-24 12:04:59 UTC
Sorry, I missed that it was a Service resource that was reported as orphaned. I do see the Service named 'controller-manager-service' was embedded in the "deployment" file. I'll put up a PR to re-add it as an orphaned object. Thanks.

We wrote a controller to clean up anything we dropped from our manifests. Whether it pre-dates this CVO "delete" functionality, I'm not sure. But you can see in https://github.com/openshift/cloud-credential-operator/blob/master/pkg/operator/cleanup/cleanup_controller.go that we watch for a list of orphaned CredentialsRequest resources that need cleaning up. There is only a single one that we watch for and clean up https://github.com/openshift/cloud-credential-operator/blob/master/pkg/operator/constants/constants.go#L150-L155 , and our cleanup controller only really cleans up a single kind of resource (as presently written).

Given that CVO can delete resources for us, I think we can look into retiring our cleanup controller going forward.

Comment 6 Joel Diaz 2021-09-24 12:17:28 UTC
Opened https://issues.redhat.com/browse/CCO-146 to add removal of the CCO cleanup controller now that we know CVO can do this work for us.

Comment 11 wang lin 2021-10-13 04:53:31 UTC
Verified using nightly build

a. a fresh 4.10 cluster doesn't have this stale service

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-12-203611   True        False         144m    Cluster version is 4.10.0-0.nightly-2021-10-12-203611
$ oc get service -n openshift-cloud-credential-operator controller-manager-service
Error from server (NotFound): services "controller-manager-service" not found

b. installing a 4.9 cluster and create a service/controller-manager-service manually, then upgrade to 4.10.

####before upgrade:
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-10-12-084355   True        False         144m    Cluster version is 4.9.0-0.nightly-2021-10-12-084355
oc get service -n openshift-cloud-credential-operator
NAME                         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
cco-metrics                  ClusterIP   172.30.34.147    <none>        8443/TCP   165m
controller-manager-service   ClusterIP   172.30.242.209   <none>        443/TCP    12m
pod-identity-webhook         ClusterIP   172.30.87.47     <none>        443/TCP    156m

###after upgrade:
$ oc get service -n openshift-cloud-credential-operator controller-manager-service
No resources found in openshift-cloud-credential-operator controller-manager-service namespace.

Comment 15 errata-xmlrpc 2022-03-10 16:04:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.