Bug 2100176 - File-based catalog index images with large catalogs fail to start on all versions of OpenShift
Summary: File-based catalog index images with large catalogs fail to start on all vers...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.6
Hardware: All
OS: All
Target Milestone: ---
: ---
Assignee: jkeister
QA Contact: Jian Zhang
Depends On:
TreeView+ depends on / blocked
Reported: 2022-06-22 16:22 UTC by Chris Johnson
Modified: 2022-08-31 13:54 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2022-08-31 13:54:12 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 2093288 0 urgent CLOSED Default catalogs fails liveness/readiness probes 2022-08-10 11:16:45 UTC

Description Chris Johnson 2022-06-22 16:22:10 UTC
Description of problem:
This problem was surfaced and described in: https://bugzilla.redhat.com/show_bug.cgi?id=2093288#

This bug is to request an alternative fix to the operator-registry / opm within the catalog index image itself rather than OLM.

The fix in bug 2093288 fixes OLM, which requires that it be backported, which doesn't appear to happen.

We are publishing a single, large catalog that runs on all versions of OpenShift.  Customers install a CatalogSource pointing to our catalog:  icr.io/cp/ibm-operator-catalog.

We would like this fixed in the image itself, as it can continue to work on all versions of OpenShift.  We are currently blocked from moving to file-based catalogs.

Some options are described in 2093288 as an alternative to how 2093288 was resolved

Comment 1 Per da Silva 2022-06-23 12:16:31 UTC
Hey Chris,

We have an upstream PR already: https://github.com/operator-framework/operator-registry/pull/974

I'd like to have the fix in opm. Though this isn't backwards compatible and registries shipped with and old version of olm will break.
Waiting for Joe to figure out how we should deal with this.



Comment 3 Chris Johnson 2022-06-23 13:03:58 UTC
Hi Per:  The referenced PR is an sqlite db fix.  

Not sure if you meant the OLM startupProbe PR:

If so, yes, I'm totally aware and it would need to be backported to all OCP versions.  Hence, this is why I'm looking for a fix that can be done in opm serve in addition...

Comment 4 jkeister 2022-06-23 16:09:29 UTC
The original approach was intended as a short-term minimum-effort fix to enable release pipelines to continue. To reduce friction in the area of FBC adoption, we would need to provide something which is independent of the OLM version.

On discussion with the team, we feel that an approach which deploys an active grpc endpoint immediately and performs loading in the background will resolve the issue independent of the OLM version.

We anticipate that requests arriving before the catalog is made ready will receive something like '202 Accepted' status.

WIP PR: https://github.com/operator-framework/operator-registry/pull/977

Comment 5 Chris Johnson 2022-06-29 19:32:24 UTC
I believe this PR is the prototype of a fix to lazily parse and load the catalog:

Comment 6 Chris Johnson 2022-08-11 13:50:24 UTC
The original 977 PR I believe is being abandoned in favor of a PR to allow pre-caching the parsed results of the FBC objects:

Comment 7 jkeister 2022-08-11 15:14:58 UTC
Correct.  We will update this BZ to track downstream action, when the upstream PR merges.  Right now 1005 looks like the best candidate architecturally-speaking, but since discussions are ongoing I've avoided BZ updates until something lands.

Comment 9 jkeister 2022-08-31 13:54:12 UTC
We're transitioning from bz to jira so I will close this as wontfix, but the jira tracking the downstreaming of the fix for OCP4.12 is actually https://issues.redhat.com/browse/OLM-2726.

Note You need to log in before you can comment on or make changes to this bug.