Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1767547

Summary: Marketplace-operator produces unbounded number of package registry replicasets when CatalogSourceConfig targetNamespace is missing
Product: OpenShift Container Platform Reporter: Chance Zibolski <chancez>
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OperatorHub QA Contact: Fan Jia <jfan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: cblecker, ecordell, jeder, rbastos
Version: 4.3.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1769841 1769844 (view as bug list) Environment:
Last Closed: 2020-01-23 11:10:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1769841, 1769844    
Attachments:
Description Flags
screenshot of etcd object count for the cluster this occurred on none

Description Chance Zibolski 2019-10-31 16:29:07 UTC
Description of problem: The marketplace-operator pod updates the package-registry deployment for CatalogSourceConfigs forever when the targetNamespace of the CatalogSourceConfig is missing. 

As a result, this triggers the deployment to create a new replicaset, but for some reason the old ones are left around. This results in an unbounded growth in the number of replicasets being created. Since this reconciles fairly regularly according to logs and the number of replicasets after 24 hours, you can see thousands of replicasets over 24 hours. 


Since the growth is unbounded, eventually there will be too many objects for the master to manage in either the controllers, apiserver, or etcd.

Version-Release number of selected component (if applicable): OCP 4.2.2, sometime after upgrading to 4.1.18, but may or may not be tied to an upgrade.


How reproducible: Unknown


Steps to Reproduce:

I don't have actual steps. I just know it was caused by the targetNamespace specified in a CatalogSourceConfig (created via the UI or something) targeting a namespace that was deleted and didnt exist, and it was resolved by re-creating that namespace.

Actual results:

I noticed it after roughly 24 hours, and here's what I found:

Here are the events in the openshift-marketplace namespace: https://gist.github.com/chancez/8680938c4e9fa6e1d591ffb90615f367, you can see the pods and replicasets changing regularly for what seems to be no reason.

Here's the list of replicasets in the namespace https://gist.github.com/chancez/b0d161f9ed11475516308fc2e6968fa2

A rough count of replicasets:
kubectl get rs -n openshift-marketplace  | wc -l
   38251

When I inspect the deployment, they do have a revisionHistoryLimit of 10, so I don't understand why that isn't applying here.

And here is a snippet of the marketplace-operator's pod logs: https://gist.github.com/chancez/0795be851895ca7753bafb4504bc2338

From the logs you can see for each CatalogSourceConfig where the targetNamespace doesn't exist, you get logs like so:

time="2019-10-29T14:58:45Z" level=info msg="Reconciling CatalogSourceConfig openshift-marketplace/elasticsearch\n"
time="2019-10-29T14:58:45Z" level=info msg="Updated Deployment elasticsearch with registry command: [appregistry-server -r https://quay.io/cnr|redhat-operators -o elasticsearch-operator]" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig
time="2019-10-29T14:58:45Z" level=info msg="Service elasticsearch is present" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig
time="2019-10-29T14:58:45Z" level=info msg="Child resource openshift-marketplace/elasticsearch owned by a CatalogSourceConfig was deleted"
time="2019-10-29T14:58:45Z" level=info msg="Deleted Service elasticsearch" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig
time="2019-10-29T14:58:45Z" level=info msg="Created Service elasticsearch" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig
time="2019-10-29T14:58:45Z" level=info msg="Creating CatalogSource elasticsearch" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig
time="2019-10-29T14:58:45Z" level=error msg="Failed to create CatalogSource : namespaces \"openshift-operators-redhat\" not found" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig



Expected results: Old replicasets are deleted, and that the operator somehow surfaces the issue of the targetNamespace outside of the pods logs, potentially in Kubernetes events.


Additional info:
ClusterID is af8bc55b-9ae3-4735-bf65-b6ef43aeced9. I was able to resolve it be creating the missing namespaces.

Comment 2 Chance Zibolski 2019-10-31 16:33:57 UTC
Created attachment 1631159 [details]
screenshot of etcd object count for the cluster this occurred on

Comment 4 Fan Jia 2019-11-11 06:02:53 UTC
No nightly build include this fix PR for now.

Comment 7 Rogerio Bastos 2019-11-13 15:01:01 UTC
*** Bug 1771747 has been marked as a duplicate of this bug. ***

Comment 8 Rogerio Bastos 2019-11-13 15:03:00 UTC
This issue was also identified in v4.2.2, with customer clusters in production. The duplicated BZ for additional info is: https://bugzilla.redhat.com/show_bug.cgi?id=1771747

Comment 10 errata-xmlrpc 2020-01-23 11:10:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062