Bug 2100176

Summary: File-based catalog index images with large catalogs fail to start on all versions of OpenShift
Product: OpenShift Container Platform Reporter: Chris Johnson <cdjohnson>
Component: OLMAssignee: jkeister
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: high CC: jdockter, jkeister, skachhwa
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-31 13:54:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris Johnson 2022-06-22 16:22:10 UTC
Description of problem:
This problem was surfaced and described in: https://bugzilla.redhat.com/show_bug.cgi?id=2093288#

This bug is to request an alternative fix to the operator-registry / opm within the catalog index image itself rather than OLM.

The fix in bug 2093288 fixes OLM, which requires that it be backported, which doesn't appear to happen.

We are publishing a single, large catalog that runs on all versions of OpenShift.  Customers install a CatalogSource pointing to our catalog:  icr.io/cp/ibm-operator-catalog.

We would like this fixed in the image itself, as it can continue to work on all versions of OpenShift.  We are currently blocked from moving to file-based catalogs.

Some options are described in 2093288 as an alternative to how 2093288 was resolved

Comment 1 Per da Silva 2022-06-23 12:16:31 UTC
Hey Chris,

We have an upstream PR already: https://github.com/operator-framework/operator-registry/pull/974

I'd like to have the fix in opm. Though this isn't backwards compatible and registries shipped with and old version of olm will break.
Waiting for Joe to figure out how we should deal with this.

Cheers,

Per

Comment 3 Chris Johnson 2022-06-23 13:03:58 UTC
Hi Per:  The referenced PR is an sqlite db fix.  

Not sure if you meant the OLM startupProbe PR:
https://github.com/operator-framework/operator-lifecycle-manager/pull/2791

If so, yes, I'm totally aware and it would need to be backported to all OCP versions.  Hence, this is why I'm looking for a fix that can be done in opm serve in addition...

Comment 4 jkeister 2022-06-23 16:09:29 UTC
The original approach was intended as a short-term minimum-effort fix to enable release pipelines to continue. To reduce friction in the area of FBC adoption, we would need to provide something which is independent of the OLM version.

On discussion with the team, we feel that an approach which deploys an active grpc endpoint immediately and performs loading in the background will resolve the issue independent of the OLM version.

We anticipate that requests arriving before the catalog is made ready will receive something like '202 Accepted' status.

WIP PR: https://github.com/operator-framework/operator-registry/pull/977

Comment 5 Chris Johnson 2022-06-29 19:32:24 UTC
I believe this PR is the prototype of a fix to lazily parse and load the catalog:
https://github.com/operator-framework/operator-registry/pull/977

Comment 6 Chris Johnson 2022-08-11 13:50:24 UTC
The original 977 PR I believe is being abandoned in favor of a PR to allow pre-caching the parsed results of the FBC objects:
https://github.com/operator-framework/operator-registry/pull/1005

Comment 7 jkeister 2022-08-11 15:14:58 UTC
Correct.  We will update this BZ to track downstream action, when the upstream PR merges.  Right now 1005 looks like the best candidate architecturally-speaking, but since discussions are ongoing I've avoided BZ updates until something lands.

Comment 9 jkeister 2022-08-31 13:54:12 UTC
We're transitioning from bz to jira so I will close this as wontfix, but the jira tracking the downstreaming of the fix for OCP4.12 is actually https://issues.redhat.com/browse/OLM-2726.