Bug 2093288 - Default catalogs fails liveness/readiness probes
Summary: Default catalogs fails liveness/readiness probes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.11
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.11.0
Assignee: Vu Dinh
QA Contact: xzha
URL:
Whiteboard:
: 2084213 2093129 (view as bug list)
Depends On:
Blocks: 2115874
TreeView+ depends on / blocked
 
Reported: 2022-06-03 12:52 UTC by Vu Dinh
Modified: 2022-08-10 11:16 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:16:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 314 0 None open Bug 2093288: fix(grpc): Add startupProbe to check for grpc health readiness (#2791) 2022-06-03 12:57:14 UTC
Red Hat Bugzilla 2084213 1 high CLOSED FBC catalog liveness and readiness probe failed to connect service :50051 2022-06-13 19:53:00 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:16:45 UTC

Internal Links: 2100176

Description Vu Dinh 2022-06-03 12:52:06 UTC
Description of problem:
All default catalogsources in 4.11 are built using file-based catalogsouce. Those catalogsources fail to deploy successfully in 4.11 OCP cluster. Multiple CI runs on nightly build have failed due to this reason.

The main culprit is the longer process time for YAML/JSON unmarshalling in the registry pod. The proposal to address this issue to add startupProbe to the registry pod. The startupProbe will check for grpc health before activating the liveness/readiness probe.

Version-Release number of selected component (if applicable):
4.11

How reproducible:


Steps to Reproduce:
1. Delay an 4.11 OpenShift cluster
2. Check registry pods for default catalogsources such as redhat-operators
3.

Actual results:
The pods fail due to liveness/readiness probe failure: openshift-marketplace pod/redhat-operators-h22ms node/ci-op-s04xckx3-de73b-7fxs4-master-1 - reason/Unhealthy Readiness probe failed: timeout: failed to connect service ":50051" within 1s

Expected results:
The registry pods for default catalogsources should be up and running.

Additional info:

See Slack thread for more information: https://coreos.slack.com/archives/C01CQA76KMX/p1654190057669689

Comment 1 xzha 2022-06-06 06:55:27 UTC
verify

zhaoxia@xzha-mac bug-2093288 % oc exec catalog-operator-5b4c4fb995-7bzxn      -- olm --version
OLM version: 0.19.0
git commit: a37156c1248d098260c3cdf229b95e4ffa85a261

1, create catalog resource
zhaoxia@xzha-mac bug-2093288 % cat catsrc.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index-1
  namespace: openshift-marketplace
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: registry.redhat.io/redhat/redhat-operator-index:v4.11
  updateStrategy:
    registryPoll:
      interval: 10m
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index-2
  namespace: openshift-marketplace
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: registry.redhat.io/redhat/certified-operator-index:v4.11
  updateStrategy:
    registryPoll:
      interval: 10m
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index-3
  namespace: openshift-marketplace
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: registry.redhat.io/redhat/community-operator-index:v4.11
  updateStrategy:
    registryPoll:
      interval: 10m
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index-4
  namespace: openshift-marketplace
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: registry.redhat.io/redhat/redhat-marketplace-index:v4.11
  updateStrategy:
    registryPoll:
      interval: 10m

zhaoxia@xzha-mac bug-2093288 % oc apply -f catsrc.yaml 
catalogsource.operators.coreos.com/test-index-1 created
catalogsource.operators.coreos.com/test-index-2 created
catalogsource.operators.coreos.com/test-index-3 created
catalogsource.operators.coreos.com/test-index-4 created

2, check pod status
zhaoxia@xzha-mac bug-2093288 % oc get catsrc -A
NAMESPACE               NAME                  DISPLAY               TYPE   PUBLISHER   AGE
openshift-marketplace   certified-operators   Certified Operators   grpc   Red Hat     29m
openshift-marketplace   community-operators   Community Operators   grpc   Red Hat     29m
openshift-marketplace   redhat-marketplace    Red Hat Marketplace   grpc   Red Hat     29m
openshift-marketplace   redhat-operators      Red Hat Operators     grpc   Red Hat     29m
openshift-marketplace   test-index            Test                  grpc   OLM-QE      7m59s
openshift-marketplace   test-index-1          Test                  grpc   OLM-QE      36s
openshift-marketplace   test-index-2          Test                  grpc   OLM-QE      35s
openshift-marketplace   test-index-3          Test                  grpc   OLM-QE      34s
openshift-marketplace   test-index-4          Test                  grpc   OLM-QE      33s

3, check events
zhaoxia@xzha-mac bug-2093288 % oc get events | grep timeout
6s          Warning   Unhealthy                      pod/test-index-3-5t9cp                       Startup probe failed: timeout: failed to connect service ":50051" within 1s
4s          Warning   Unhealthy                      pod/test-index-4-vz7vg                       Startup probe failed: timeout: failed to connect service ":50051" within 1s

LGTM, verified.

Comment 2 xzha 2022-06-06 06:57:12 UTC
the pod status is:
zhaoxia@xzha-mac bug-2093288 % oc get pod
NAME                                    READY   STATUS    RESTARTS      AGE
certified-operators-4hkcf               1/1     Running   0             44m
community-operators-hgjtv               1/1     Running   0             44m
marketplace-operator-6f6c99685d-rwtkn   1/1     Running   1 (32m ago)   48m
redhat-marketplace-6fcvl                1/1     Running   0             44m
redhat-operators-4qzd8                  1/1     Running   0             44m
test-index-1-5gzct                      1/1     Running   0             14m
test-index-2-4sbhj                      1/1     Running   0             14m
test-index-2bhx4                        1/1     Running   0             22m
test-index-3-5t9cp                      1/1     Running   0             14m
test-index-4-vz7vg                      1/1     Running   0             14m

Comment 4 tflannag 2022-06-07 14:22:41 UTC
*** Bug 2093129 has been marked as a duplicate of this bug. ***

Comment 7 Kevin Rizza 2022-06-13 19:53:01 UTC
*** Bug 2084213 has been marked as a duplicate of this bug. ***

Comment 8 Chris Johnson 2022-06-15 13:19:13 UTC
Since this fix is in OLM, we would need this to be backported in all OLM releases back to 4.6, as we use the same catalog images for all versions of OpensShift.  

When we switch from SQLITE to FBC, this will prevent us from running on older versions of OpenShift.

Comment 9 Vu Dinh 2022-06-16 18:49:23 UTC
Hey Chris,

We usually only backport the fix to two previous versions at best. So I don’t think we will backport this to 4.6/4.7. Plus, we begin to support FBC starting with 4.10 (maybe 4.9 as well). From technical standpoint, it is possible to make FBC work with older versions of OCP but OLM doesn’t extend the support statement to those versions.

Vu

Comment 10 Chris Johnson 2022-06-16 19:55:54 UTC
Per the OLM dev call this afternoon, I described our problem.  
1.  We don't want to have two catalog implementations per OCP version.
2.  We want FBC to run on 4.6+
3.  We want to just switch our latest catalog to point to FBC shortly after RH switches theirs over.

That said, a couple of other options were discussed that would enhance opm serve instead, which would allow us to build an FBC image that didn't need a startupProbe:

1. make GRPC serving and FBC reading disjoint, where the endpoint would return a "not ready" until the FBC load is complete (for e.g. '202 -- Request Accepted').
2. as type: image preparation, pre-process FBC into a binary blob to change FBC service as a linear load to constant-time.

I would really like to see if either of these options would be feasible to avoid requiring a startupProbe.

Comment 11 jkeister 2022-06-17 14:40:42 UTC
Chris - 

I recorded the design decisions we discussed in the CNCF WG meeting, in the upstream PR which delivered the startupProbe, including the rationale for the choice:
https://github.com/operator-framework/operator-lifecycle-manager/pull/2791

Comment 14 Chris Johnson 2022-06-22 16:23:10 UTC
I opened a separate bug to drive an alternative implementation described here.
https://bugzilla.redhat.com/show_bug.cgi?id=2100176

Comment 18 errata-xmlrpc 2022-08-10 11:16:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.