2102289 – CGU resources cleanup is not consistent

Bug 2102289 - CGU resources cleanup is not consistent

Summary: CGU resources cleanup is not consistent

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Telco Edge
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Steven Skeard
QA Contact:	yliu1
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-29 15:17 UTC by Steven Skeard
Modified:	2022-07-12 11:39 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-11 15:28:45 UTC
Target Upstream Version:
Embargoed:
Flags:	sskeard: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift-kni cluster-group-upgrades-operator pull 238	0	None	Merged	Enhancement CNF-5013: Make CGU resource cleanup more consistent	2022-06-30 13:32:07 UTC
Red Hat Product Errata	RHBA-2022:5514	0	None	None	None	2022-07-11 15:28:49 UTC

Description Steven Skeard 2022-06-29 15:17:38 UTC

Description of problem:
The cleanup of resources related to CGUs is not consistent.
Managed policies are deleted on the next reconcile whereas transient resources are deleted on the current reconcile. This can cause issues when ACM is managing a large number of nodes as there can be memory issues related to these resources. 

Version-Release number of selected component (if applicable):
4.10

How reproducible:
Every time

Steps to Reproduce:
1. Perform a cgu with backup or precaching enabled
2. Observe that some resources are deleted on the current reconcile while others are deleted on the next
3.

Actual results:
The timing for deleting resources is not consistent.

Expected results:
All resources would be cleaned up at the same time.

Additional info:

Comment 2 yliu1 2022-07-05 18:46:00 UTC

@sskeard could you please set target release? Thanks.

Comment 4 yliu1 2022-07-07 17:37:14 UTC

This issue does not seem to be fully resolved. 

CGU spec used: 
  spec:
    clusters:
    - worker-2
    enable: true
    managedPolicies:
    - common-config-policy
    - common-subscriptions-policy
    preCaching: true
    remediationStrategy:
      maxConcurrency: 1
      timeout: 240

Following managedclusterviews shortly after precaching started:
Precache started, following managedclusterviews are created.
[kni ~]$ oc get managedclusterviews.view.open-cluster-management.io -A
NAMESPACE   NAME                                 AGE
worker-2    view-precache-cluster-role-binding   105s
worker-2    view-precache-job                    30s
worker-2    view-precache-namespace              2m48s
worker-2    view-precache-service-acct           109s
worker-2    view-precache-spec-configmap         114s

Shortly after (probably precaching completed), only 1 remained. It did not get cleaned up until after the upgrade is completed.
$ oc get managedclusterviews.view.open-cluster-management.io -A
NAMESPACE   NAME                AGE
worker-2    view-precache-job   10m14s

Comment 5 yliu1 2022-07-07 17:40:42 UTC

Change state to verified based on comment from here: https://issues.redhat.com/browse/CNF-5013?focusedCommentId=20140982&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel

Comment 7 errata-xmlrpc 2022-07-11 15:28:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10.22 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:5514

Note You need to log in before you can comment on or make changes to this bug.