Bug 2100176

Summary:	File-based catalog index images with large catalogs fail to start on all versions of OpenShift
Product:	OpenShift Container Platform	Reporter:	Chris Johnson <cdjohnson>
Component:	OLM	Assignee:	jkeister
OLM sub component:	OLM	QA Contact:	Jian Zhang <jiazha>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	high
Priority:	high	CC:	jdockter, jkeister, skachhwa
Version:	4.6	Keywords:	Triaged
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-31 13:54:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chris Johnson 2022-06-22 16:22:10 UTC

Description of problem:
This problem was surfaced and described in: https://bugzilla.redhat.com/show_bug.cgi?id=2093288#

This bug is to request an alternative fix to the operator-registry / opm within the catalog index image itself rather than OLM.

The fix in bug 2093288 fixes OLM, which requires that it be backported, which doesn't appear to happen.

We are publishing a single, large catalog that runs on all versions of OpenShift.  Customers install a CatalogSource pointing to our catalog:  icr.io/cp/ibm-operator-catalog.

We would like this fixed in the image itself, as it can continue to work on all versions of OpenShift.  We are currently blocked from moving to file-based catalogs.

Some options are described in 2093288 as an alternative to how 2093288 was resolved

Comment 1 Per da Silva 2022-06-23 12:16:31 UTC

Hey Chris,

We have an upstream PR already: https://github.com/operator-framework/operator-registry/pull/974

I'd like to have the fix in opm. Though this isn't backwards compatible and registries shipped with and old version of olm will break.
Waiting for Joe to figure out how we should deal with this.

Cheers,

Per

Comment 3 Chris Johnson 2022-06-23 13:03:58 UTC

Hi Per:  The referenced PR is an sqlite db fix.  

Not sure if you meant the OLM startupProbe PR:
https://github.com/operator-framework/operator-lifecycle-manager/pull/2791

If so, yes, I'm totally aware and it would need to be backported to all OCP versions.  Hence, this is why I'm looking for a fix that can be done in opm serve in addition...

Comment 4 jkeister 2022-06-23 16:09:29 UTC

The original approach was intended as a short-term minimum-effort fix to enable release pipelines to continue. To reduce friction in the area of FBC adoption, we would need to provide something which is independent of the OLM version.

On discussion with the team, we feel that an approach which deploys an active grpc endpoint immediately and performs loading in the background will resolve the issue independent of the OLM version.

We anticipate that requests arriving before the catalog is made ready will receive something like '202 Accepted' status.

WIP PR: https://github.com/operator-framework/operator-registry/pull/977

Comment 5 Chris Johnson 2022-06-29 19:32:24 UTC

I believe this PR is the prototype of a fix to lazily parse and load the catalog:
https://github.com/operator-framework/operator-registry/pull/977

Comment 6 Chris Johnson 2022-08-11 13:50:24 UTC

The original 977 PR I believe is being abandoned in favor of a PR to allow pre-caching the parsed results of the FBC objects:
https://github.com/operator-framework/operator-registry/pull/1005

Comment 7 jkeister 2022-08-11 15:14:58 UTC

Correct.  We will update this BZ to track downstream action, when the upstream PR merges.  Right now 1005 looks like the best candidate architecturally-speaking, but since discussions are ongoing I've avoided BZ updates until something lands.

Comment 9 jkeister 2022-08-31 13:54:12 UTC

We're transitioning from bz to jira so I will close this as wontfix, but the jira tracking the downstreaming of the fix for OCP4.12 is actually https://issues.redhat.com/browse/OLM-2726.