Description of problem: The marketplace-operator pod updates the package-registry deployment for CatalogSourceConfigs forever when the targetNamespace of the CatalogSourceConfig is missing. As a result, this triggers the deployment to create a new replicaset, but for some reason the old ones are left around. This results in an unbounded growth in the number of replicasets being created. Since this reconciles fairly regularly according to logs and the number of replicasets after 24 hours, you can see thousands of replicasets over 24 hours. Since the growth is unbounded, eventually there will be too many objects for the master to manage in either the controllers, apiserver, or etcd. Version-Release number of selected component (if applicable): OCP 4.2.2, sometime after upgrading to 4.1.18, but may or may not be tied to an upgrade. How reproducible: Unknown Steps to Reproduce: I don't have actual steps. I just know it was caused by the targetNamespace specified in a CatalogSourceConfig (created via the UI or something) targeting a namespace that was deleted and didnt exist, and it was resolved by re-creating that namespace. Actual results: I noticed it after roughly 24 hours, and here's what I found: Here are the events in the openshift-marketplace namespace: https://gist.github.com/chancez/8680938c4e9fa6e1d591ffb90615f367, you can see the pods and replicasets changing regularly for what seems to be no reason. Here's the list of replicasets in the namespace https://gist.github.com/chancez/b0d161f9ed11475516308fc2e6968fa2 A rough count of replicasets: kubectl get rs -n openshift-marketplace | wc -l 38251 When I inspect the deployment, they do have a revisionHistoryLimit of 10, so I don't understand why that isn't applying here. And here is a snippet of the marketplace-operator's pod logs: https://gist.github.com/chancez/0795be851895ca7753bafb4504bc2338 From the logs you can see for each CatalogSourceConfig where the targetNamespace doesn't exist, you get logs like so: time="2019-10-29T14:58:45Z" level=info msg="Reconciling CatalogSourceConfig openshift-marketplace/elasticsearch\n" time="2019-10-29T14:58:45Z" level=info msg="Updated Deployment elasticsearch with registry command: [appregistry-server -r https://quay.io/cnr|redhat-operators -o elasticsearch-operator]" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig time="2019-10-29T14:58:45Z" level=info msg="Service elasticsearch is present" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig time="2019-10-29T14:58:45Z" level=info msg="Child resource openshift-marketplace/elasticsearch owned by a CatalogSourceConfig was deleted" time="2019-10-29T14:58:45Z" level=info msg="Deleted Service elasticsearch" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig time="2019-10-29T14:58:45Z" level=info msg="Created Service elasticsearch" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig time="2019-10-29T14:58:45Z" level=info msg="Creating CatalogSource elasticsearch" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig time="2019-10-29T14:58:45Z" level=error msg="Failed to create CatalogSource : namespaces \"openshift-operators-redhat\" not found" name=elasticsearch targetNamespace=openshift-operators-redhat type=CatalogSourceConfig Expected results: Old replicasets are deleted, and that the operator somehow surfaces the issue of the targetNamespace outside of the pods logs, potentially in Kubernetes events. Additional info: ClusterID is af8bc55b-9ae3-4735-bf65-b6ef43aeced9. I was able to resolve it be creating the missing namespaces.
Created attachment 1631159 [details] screenshot of etcd object count for the cluster this occurred on
No nightly build include this fix PR for now.
*** Bug 1771747 has been marked as a duplicate of this bug. ***
This issue was also identified in v4.2.2, with customer clusters in production. The duplicated BZ for additional info is: https://bugzilla.redhat.com/show_bug.cgi?id=1771747
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062