Description of problem: All default catalogsources in 4.11 are built using file-based catalogsouce. Those catalogsources fail to deploy successfully in 4.11 OCP cluster. Multiple CI runs on nightly build have failed due to this reason. The main culprit is the longer process time for YAML/JSON unmarshalling in the registry pod. The proposal to address this issue to add startupProbe to the registry pod. The startupProbe will check for grpc health before activating the liveness/readiness probe. Version-Release number of selected component (if applicable): 4.11 How reproducible: Steps to Reproduce: 1. Delay an 4.11 OpenShift cluster 2. Check registry pods for default catalogsources such as redhat-operators 3. Actual results: The pods fail due to liveness/readiness probe failure: openshift-marketplace pod/redhat-operators-h22ms node/ci-op-s04xckx3-de73b-7fxs4-master-1 - reason/Unhealthy Readiness probe failed: timeout: failed to connect service ":50051" within 1s Expected results: The registry pods for default catalogsources should be up and running. Additional info: See Slack thread for more information: https://coreos.slack.com/archives/C01CQA76KMX/p1654190057669689
verify zhaoxia@xzha-mac bug-2093288 % oc exec catalog-operator-5b4c4fb995-7bzxn -- olm --version OLM version: 0.19.0 git commit: a37156c1248d098260c3cdf229b95e4ffa85a261 1, create catalog resource zhaoxia@xzha-mac bug-2093288 % cat catsrc.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test-index-1 namespace: openshift-marketplace spec: displayName: Test publisher: OLM-QE sourceType: grpc image: registry.redhat.io/redhat/redhat-operator-index:v4.11 updateStrategy: registryPoll: interval: 10m --- apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test-index-2 namespace: openshift-marketplace spec: displayName: Test publisher: OLM-QE sourceType: grpc image: registry.redhat.io/redhat/certified-operator-index:v4.11 updateStrategy: registryPoll: interval: 10m --- apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test-index-3 namespace: openshift-marketplace spec: displayName: Test publisher: OLM-QE sourceType: grpc image: registry.redhat.io/redhat/community-operator-index:v4.11 updateStrategy: registryPoll: interval: 10m --- apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test-index-4 namespace: openshift-marketplace spec: displayName: Test publisher: OLM-QE sourceType: grpc image: registry.redhat.io/redhat/redhat-marketplace-index:v4.11 updateStrategy: registryPoll: interval: 10m zhaoxia@xzha-mac bug-2093288 % oc apply -f catsrc.yaml catalogsource.operators.coreos.com/test-index-1 created catalogsource.operators.coreos.com/test-index-2 created catalogsource.operators.coreos.com/test-index-3 created catalogsource.operators.coreos.com/test-index-4 created 2, check pod status zhaoxia@xzha-mac bug-2093288 % oc get catsrc -A NAMESPACE NAME DISPLAY TYPE PUBLISHER AGE openshift-marketplace certified-operators Certified Operators grpc Red Hat 29m openshift-marketplace community-operators Community Operators grpc Red Hat 29m openshift-marketplace redhat-marketplace Red Hat Marketplace grpc Red Hat 29m openshift-marketplace redhat-operators Red Hat Operators grpc Red Hat 29m openshift-marketplace test-index Test grpc OLM-QE 7m59s openshift-marketplace test-index-1 Test grpc OLM-QE 36s openshift-marketplace test-index-2 Test grpc OLM-QE 35s openshift-marketplace test-index-3 Test grpc OLM-QE 34s openshift-marketplace test-index-4 Test grpc OLM-QE 33s 3, check events zhaoxia@xzha-mac bug-2093288 % oc get events | grep timeout 6s Warning Unhealthy pod/test-index-3-5t9cp Startup probe failed: timeout: failed to connect service ":50051" within 1s 4s Warning Unhealthy pod/test-index-4-vz7vg Startup probe failed: timeout: failed to connect service ":50051" within 1s LGTM, verified.
the pod status is: zhaoxia@xzha-mac bug-2093288 % oc get pod NAME READY STATUS RESTARTS AGE certified-operators-4hkcf 1/1 Running 0 44m community-operators-hgjtv 1/1 Running 0 44m marketplace-operator-6f6c99685d-rwtkn 1/1 Running 1 (32m ago) 48m redhat-marketplace-6fcvl 1/1 Running 0 44m redhat-operators-4qzd8 1/1 Running 0 44m test-index-1-5gzct 1/1 Running 0 14m test-index-2-4sbhj 1/1 Running 0 14m test-index-2bhx4 1/1 Running 0 22m test-index-3-5t9cp 1/1 Running 0 14m test-index-4-vz7vg 1/1 Running 0 14m
*** Bug 2093129 has been marked as a duplicate of this bug. ***
*** Bug 2084213 has been marked as a duplicate of this bug. ***
Since this fix is in OLM, we would need this to be backported in all OLM releases back to 4.6, as we use the same catalog images for all versions of OpensShift. When we switch from SQLITE to FBC, this will prevent us from running on older versions of OpenShift.
Hey Chris, We usually only backport the fix to two previous versions at best. So I don’t think we will backport this to 4.6/4.7. Plus, we begin to support FBC starting with 4.10 (maybe 4.9 as well). From technical standpoint, it is possible to make FBC work with older versions of OCP but OLM doesn’t extend the support statement to those versions. Vu
Per the OLM dev call this afternoon, I described our problem. 1. We don't want to have two catalog implementations per OCP version. 2. We want FBC to run on 4.6+ 3. We want to just switch our latest catalog to point to FBC shortly after RH switches theirs over. That said, a couple of other options were discussed that would enhance opm serve instead, which would allow us to build an FBC image that didn't need a startupProbe: 1. make GRPC serving and FBC reading disjoint, where the endpoint would return a "not ready" until the FBC load is complete (for e.g. '202 -- Request Accepted'). 2. as type: image preparation, pre-process FBC into a binary blob to change FBC service as a linear load to constant-time. I would really like to see if either of these options would be feasible to avoid requiring a startupProbe.
Chris - I recorded the design decisions we discussed in the CNCF WG meeting, in the upstream PR which delivered the startupProbe, including the rationale for the choice: https://github.com/operator-framework/operator-lifecycle-manager/pull/2791
I opened a separate bug to drive an alternative implementation described here. https://bugzilla.redhat.com/show_bug.cgi?id=2100176
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069