Bug 2115874 - Default catalogs fails liveness/readiness probes
Summary: Default catalogs fails liveness/readiness probes
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.11
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.10.z
Assignee: Per da Silva
QA Contact: Jian Zhang
Depends On: 2093288
TreeView+ depends on / blocked
Reported: 2022-08-05 15:21 UTC by Vu Dinh
Modified: 2022-08-31 12:34 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2022-08-31 12:34:13 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 348 0 None open Bug 2115874: fix(grpc): Add startupProbe to check for grpc health readiness (#2791) 2022-08-05 15:22:41 UTC
Red Hat Product Errata RHSA-2022:6133 0 None None None 2022-08-31 12:34:43 UTC

Description Vu Dinh 2022-08-05 15:21:19 UTC
This bug was initially created as a copy of Bug #2093288

I am copying this bug because: 

Backporting process

Description of problem:
All default catalogsources in 4.11 are built using file-based catalogsouce. Those catalogsources fail to deploy successfully in 4.11 OCP cluster. Multiple CI runs on nightly build have failed due to this reason.

The main culprit is the longer process time for YAML/JSON unmarshalling in the registry pod. The proposal to address this issue to add startupProbe to the registry pod. The startupProbe will check for grpc health before activating the liveness/readiness probe.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Delay an 4.11 OpenShift cluster
2. Check registry pods for default catalogsources such as redhat-operators

Actual results:
The pods fail due to liveness/readiness probe failure: openshift-marketplace pod/redhat-operators-h22ms node/ci-op-s04xckx3-de73b-7fxs4-master-1 - reason/Unhealthy Readiness probe failed: timeout: failed to connect service ":50051" within 1s

Expected results:
The registry pods for default catalogsources should be up and running.

Additional info:

See Slack thread for more information: https://coreos.slack.com/archives/C01CQA76KMX/p1654190057669689

Comment 1 Jian Zhang 2022-08-08 07:23:02 UTC
1, Create an OCP cluster with the fixed PR via cluster-bot
MacBook-Pro:~ jianzhang$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.ci.test-2022-08-08-063840-ci-ln-h9qsxqb-latest   True        False         9m17s   Cluster version is 4.10.0-0.ci.test-2022-08-08-063840-ci-ln-h9qsxqb-latest

2, Create a CatalogSource that uses the file-based index image.

MacBook-Pro:~ jianzhang$ oc create -f cs-redhat.yaml 
catalogsource.operators.coreos.com/test-operators created

MacBook-Pro:~ jianzhang$ cat cs-redhat.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
  name: test-operators
  namespace: openshift-marketplace
  displayName: Jian Operators
  image: registry.redhat.io/redhat/redhat-operator-index:v4.11
  priority: -100
  publisher: Jian
  sourceType: grpc
      interval: 10m0s

3, check if the startup probe works well

MacBook-Pro:~ jianzhang$ oc get pods
NAME                                   READY   STATUS    RESTARTS   AGE
certified-operators-jcdng              1/1     Running   0          26m
community-operators-8ghn6              1/1     Running   0          26m
marketplace-operator-958cf79bb-r2n59   1/1     Running   0          32m
redhat-marketplace-tk8t7               1/1     Running   0          26m
redhat-operators-xtj5l                 1/1     Running   0          26m
test-operators-52bpk                   1/1     Running   0          5m23s
MacBook-Pro:~ jianzhang$ oc get pods test-operators-52bpk  -o=jsonpath='{.spec.containers[0].startupProbe}'

MacBook-Pro:~ jianzhang$ oc get event|grep timeout
6m11s       Warning   Unhealthy           pod/test-operators-52bpk                    Startup probe failed: timeout: failed to connect service ":50051" within 1s

After one timeout failure(at most 15), the CatalogSource pod running, startup probe works well. LGTM, verify it.

Comment 3 Jian Zhang 2022-08-19 00:49:26 UTC
Verify it based on commnet 1

Comment 6 errata-xmlrpc 2022-08-31 12:34:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.10.30 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.