This bug was initially created as a copy of Bug #2093288 I am copying this bug because: Backporting process Description of problem: All default catalogsources in 4.11 are built using file-based catalogsouce. Those catalogsources fail to deploy successfully in 4.11 OCP cluster. Multiple CI runs on nightly build have failed due to this reason. The main culprit is the longer process time for YAML/JSON unmarshalling in the registry pod. The proposal to address this issue to add startupProbe to the registry pod. The startupProbe will check for grpc health before activating the liveness/readiness probe. Version-Release number of selected component (if applicable): 4.11 How reproducible: Steps to Reproduce: 1. Delay an 4.11 OpenShift cluster 2. Check registry pods for default catalogsources such as redhat-operators 3. Actual results: The pods fail due to liveness/readiness probe failure: openshift-marketplace pod/redhat-operators-h22ms node/ci-op-s04xckx3-de73b-7fxs4-master-1 - reason/Unhealthy Readiness probe failed: timeout: failed to connect service ":50051" within 1s Expected results: The registry pods for default catalogsources should be up and running. Additional info: See Slack thread for more information: https://coreos.slack.com/archives/C01CQA76KMX/p1654190057669689
1, Create an OCP cluster with the fixed PR via cluster-bot MacBook-Pro:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.ci.test-2022-08-08-063840-ci-ln-h9qsxqb-latest True False 9m17s Cluster version is 4.10.0-0.ci.test-2022-08-08-063840-ci-ln-h9qsxqb-latest 2, Create a CatalogSource that uses the file-based index image. MacBook-Pro:~ jianzhang$ oc create -f cs-redhat.yaml catalogsource.operators.coreos.com/test-operators created MacBook-Pro:~ jianzhang$ cat cs-redhat.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: test-operators namespace: openshift-marketplace spec: displayName: Jian Operators image: registry.redhat.io/redhat/redhat-operator-index:v4.11 priority: -100 publisher: Jian sourceType: grpc updateStrategy: registryPoll: interval: 10m0s 3, check if the startup probe works well MacBook-Pro:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE certified-operators-jcdng 1/1 Running 0 26m community-operators-8ghn6 1/1 Running 0 26m marketplace-operator-958cf79bb-r2n59 1/1 Running 0 32m redhat-marketplace-tk8t7 1/1 Running 0 26m redhat-operators-xtj5l 1/1 Running 0 26m test-operators-52bpk 1/1 Running 0 5m23s MacBook-Pro:~ jianzhang$ oc get pods test-operators-52bpk -o=jsonpath='{.spec.containers[0].startupProbe}' {"exec":{"command":["grpc_health_probe","-addr=:50051"]},"failureThreshold":15,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1} MacBook-Pro:~ jianzhang$ oc get event|grep timeout 6m11s Warning Unhealthy pod/test-operators-52bpk Startup probe failed: timeout: failed to connect service ":50051" within 1s After one timeout failure(at most 15), the CatalogSource pod running, startup probe works well. LGTM, verify it.
Verify it based on commnet 1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.10.30 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6133