2093288 – Default catalogs fails liveness/readiness probes

Bug 2093288 - Default catalogs fails liveness/readiness probes

Summary: Default catalogs fails liveness/readiness probes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Vu Dinh
QA Contact:	xzha
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	2084213 2093129 (view as bug list)
Depends On:
Blocks:	2115874
TreeView+	depends on / blocked

Reported:	2022-06-03 12:52 UTC by Vu Dinh
Modified:	2022-08-10 11:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:16:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift operator-framework-olm pull 314	0	None	open	Bug 2093288: fix(grpc): Add startupProbe to check for grpc health readiness (#2791)	2022-06-03 12:57:14 UTC
Red Hat Bugzilla	2084213	1	high	CLOSED	FBC catalog liveness and readiness probe failed to connect service :50051	2022-06-13 19:53:00 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:16:45 UTC

Internal Links: 2100176

Description Vu Dinh 2022-06-03 12:52:06 UTC

Description of problem:
All default catalogsources in 4.11 are built using file-based catalogsouce. Those catalogsources fail to deploy successfully in 4.11 OCP cluster. Multiple CI runs on nightly build have failed due to this reason.

The main culprit is the longer process time for YAML/JSON unmarshalling in the registry pod. The proposal to address this issue to add startupProbe to the registry pod. The startupProbe will check for grpc health before activating the liveness/readiness probe.

Version-Release number of selected component (if applicable):
4.11

How reproducible:


Steps to Reproduce:
1. Delay an 4.11 OpenShift cluster
2. Check registry pods for default catalogsources such as redhat-operators
3.

Actual results:
The pods fail due to liveness/readiness probe failure: openshift-marketplace pod/redhat-operators-h22ms node/ci-op-s04xckx3-de73b-7fxs4-master-1 - reason/Unhealthy Readiness probe failed: timeout: failed to connect service ":50051" within 1s

Expected results:
The registry pods for default catalogsources should be up and running.

Additional info:

See Slack thread for more information: https://coreos.slack.com/archives/C01CQA76KMX/p1654190057669689

Comment 1 xzha 2022-06-06 06:55:27 UTC

verify

zhaoxia@xzha-mac bug-2093288 % oc exec catalog-operator-5b4c4fb995-7bzxn      -- olm --version
OLM version: 0.19.0
git commit: a37156c1248d098260c3cdf229b95e4ffa85a261

1, create catalog resource
zhaoxia@xzha-mac bug-2093288 % cat catsrc.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index-1
  namespace: openshift-marketplace
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: registry.redhat.io/redhat/redhat-operator-index:v4.11
  updateStrategy:
    registryPoll:
      interval: 10m
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index-2
  namespace: openshift-marketplace
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: registry.redhat.io/redhat/certified-operator-index:v4.11
  updateStrategy:
    registryPoll:
      interval: 10m
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index-3
  namespace: openshift-marketplace
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: registry.redhat.io/redhat/community-operator-index:v4.11
  updateStrategy:
    registryPoll:
      interval: 10m
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-index-4
  namespace: openshift-marketplace
spec:
  displayName: Test
  publisher: OLM-QE
  sourceType: grpc
  image: registry.redhat.io/redhat/redhat-marketplace-index:v4.11
  updateStrategy:
    registryPoll:
      interval: 10m

zhaoxia@xzha-mac bug-2093288 % oc apply -f catsrc.yaml 
catalogsource.operators.coreos.com/test-index-1 created
catalogsource.operators.coreos.com/test-index-2 created
catalogsource.operators.coreos.com/test-index-3 created
catalogsource.operators.coreos.com/test-index-4 created

2, check pod status
zhaoxia@xzha-mac bug-2093288 % oc get catsrc -A
NAMESPACE               NAME                  DISPLAY               TYPE   PUBLISHER   AGE
openshift-marketplace   certified-operators   Certified Operators   grpc   Red Hat     29m
openshift-marketplace   community-operators   Community Operators   grpc   Red Hat     29m
openshift-marketplace   redhat-marketplace    Red Hat Marketplace   grpc   Red Hat     29m
openshift-marketplace   redhat-operators      Red Hat Operators     grpc   Red Hat     29m
openshift-marketplace   test-index            Test                  grpc   OLM-QE      7m59s
openshift-marketplace   test-index-1          Test                  grpc   OLM-QE      36s
openshift-marketplace   test-index-2          Test                  grpc   OLM-QE      35s
openshift-marketplace   test-index-3          Test                  grpc   OLM-QE      34s
openshift-marketplace   test-index-4          Test                  grpc   OLM-QE      33s

3, check events
zhaoxia@xzha-mac bug-2093288 % oc get events | grep timeout
6s          Warning   Unhealthy                      pod/test-index-3-5t9cp                       Startup probe failed: timeout: failed to connect service ":50051" within 1s
4s          Warning   Unhealthy                      pod/test-index-4-vz7vg                       Startup probe failed: timeout: failed to connect service ":50051" within 1s

LGTM, verified.

Comment 2 xzha 2022-06-06 06:57:12 UTC

the pod status is:
zhaoxia@xzha-mac bug-2093288 % oc get pod
NAME                                    READY   STATUS    RESTARTS      AGE
certified-operators-4hkcf               1/1     Running   0             44m
community-operators-hgjtv               1/1     Running   0             44m
marketplace-operator-6f6c99685d-rwtkn   1/1     Running   1 (32m ago)   48m
redhat-marketplace-6fcvl                1/1     Running   0             44m
redhat-operators-4qzd8                  1/1     Running   0             44m
test-index-1-5gzct                      1/1     Running   0             14m
test-index-2-4sbhj                      1/1     Running   0             14m
test-index-2bhx4                        1/1     Running   0             22m
test-index-3-5t9cp                      1/1     Running   0             14m
test-index-4-vz7vg                      1/1     Running   0             14m

Comment 4 tflannag 2022-06-07 14:22:41 UTC

*** Bug 2093129 has been marked as a duplicate of this bug. ***

Comment 7 Kevin Rizza 2022-06-13 19:53:01 UTC

*** Bug 2084213 has been marked as a duplicate of this bug. ***

Comment 8 Chris Johnson 2022-06-15 13:19:13 UTC

Since this fix is in OLM, we would need this to be backported in all OLM releases back to 4.6, as we use the same catalog images for all versions of OpensShift.  

When we switch from SQLITE to FBC, this will prevent us from running on older versions of OpenShift.

Comment 9 Vu Dinh 2022-06-16 18:49:23 UTC

Hey Chris,

We usually only backport the fix to two previous versions at best. So I don’t think we will backport this to 4.6/4.7. Plus, we begin to support FBC starting with 4.10 (maybe 4.9 as well). From technical standpoint, it is possible to make FBC work with older versions of OCP but OLM doesn’t extend the support statement to those versions.

Vu

Comment 10 Chris Johnson 2022-06-16 19:55:54 UTC

Per the OLM dev call this afternoon, I described our problem.  
1.  We don't want to have two catalog implementations per OCP version.
2.  We want FBC to run on 4.6+
3.  We want to just switch our latest catalog to point to FBC shortly after RH switches theirs over.

That said, a couple of other options were discussed that would enhance opm serve instead, which would allow us to build an FBC image that didn't need a startupProbe:

1. make GRPC serving and FBC reading disjoint, where the endpoint would return a "not ready" until the FBC load is complete (for e.g. '202 -- Request Accepted').
2. as type: image preparation, pre-process FBC into a binary blob to change FBC service as a linear load to constant-time.

I would really like to see if either of these options would be feasible to avoid requiring a startupProbe.

Comment 11 jkeister 2022-06-17 14:40:42 UTC

Chris - 

I recorded the design decisions we discussed in the CNCF WG meeting, in the upstream PR which delivered the startupProbe, including the rationale for the choice:
https://github.com/operator-framework/operator-lifecycle-manager/pull/2791

Comment 14 Chris Johnson 2022-06-22 16:23:10 UTC

I opened a separate bug to drive an alternative implementation described here.
https://bugzilla.redhat.com/show_bug.cgi?id=2100176

Comment 18 errata-xmlrpc 2022-08-10 11:16:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.