2115874 – Default catalogs fails liveness/readiness probes

Bug 2115874 - Default catalogs fails liveness/readiness probes

Summary: Default catalogs fails liveness/readiness probes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.z
Assignee:	Per da Silva
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:	2093288
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-05 15:21 UTC by Vu Dinh
Modified:	2022-08-31 12:34 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-31 12:34:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift operator-framework-olm pull 348	0	None	open	Bug 2115874: fix(grpc): Add startupProbe to check for grpc health readiness (#2791)	2022-08-05 15:22:41 UTC
Red Hat Product Errata	RHSA-2022:6133	0	None	None	None	2022-08-31 12:34:43 UTC

Description Vu Dinh 2022-08-05 15:21:19 UTC

This bug was initially created as a copy of Bug #2093288

I am copying this bug because: 

Backporting process

Description of problem:
All default catalogsources in 4.11 are built using file-based catalogsouce. Those catalogsources fail to deploy successfully in 4.11 OCP cluster. Multiple CI runs on nightly build have failed due to this reason.

The main culprit is the longer process time for YAML/JSON unmarshalling in the registry pod. The proposal to address this issue to add startupProbe to the registry pod. The startupProbe will check for grpc health before activating the liveness/readiness probe.

Version-Release number of selected component (if applicable):
4.11

How reproducible:


Steps to Reproduce:
1. Delay an 4.11 OpenShift cluster
2. Check registry pods for default catalogsources such as redhat-operators
3.

Actual results:
The pods fail due to liveness/readiness probe failure: openshift-marketplace pod/redhat-operators-h22ms node/ci-op-s04xckx3-de73b-7fxs4-master-1 - reason/Unhealthy Readiness probe failed: timeout: failed to connect service ":50051" within 1s

Expected results:
The registry pods for default catalogsources should be up and running.

Additional info:

See Slack thread for more information: https://coreos.slack.com/archives/C01CQA76KMX/p1654190057669689

Comment 1 Jian Zhang 2022-08-08 07:23:02 UTC

1, Create an OCP cluster with the fixed PR via cluster-bot
MacBook-Pro:~ jianzhang$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.ci.test-2022-08-08-063840-ci-ln-h9qsxqb-latest   True        False         9m17s   Cluster version is 4.10.0-0.ci.test-2022-08-08-063840-ci-ln-h9qsxqb-latest

2, Create a CatalogSource that uses the file-based index image.

MacBook-Pro:~ jianzhang$ oc create -f cs-redhat.yaml 
catalogsource.operators.coreos.com/test-operators created

MacBook-Pro:~ jianzhang$ cat cs-redhat.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-operators
  namespace: openshift-marketplace
spec:
  displayName: Jian Operators
  image: registry.redhat.io/redhat/redhat-operator-index:v4.11
  priority: -100
  publisher: Jian
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s

3, check if the startup probe works well

MacBook-Pro:~ jianzhang$ oc get pods
NAME                                   READY   STATUS    RESTARTS   AGE
certified-operators-jcdng              1/1     Running   0          26m
community-operators-8ghn6              1/1     Running   0          26m
marketplace-operator-958cf79bb-r2n59   1/1     Running   0          32m
redhat-marketplace-tk8t7               1/1     Running   0          26m
redhat-operators-xtj5l                 1/1     Running   0          26m
test-operators-52bpk                   1/1     Running   0          5m23s
MacBook-Pro:~ jianzhang$ oc get pods test-operators-52bpk  -o=jsonpath='{.spec.containers[0].startupProbe}'
{"exec":{"command":["grpc_health_probe","-addr=:50051"]},"failureThreshold":15,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1}


MacBook-Pro:~ jianzhang$ oc get event|grep timeout
6m11s       Warning   Unhealthy           pod/test-operators-52bpk                    Startup probe failed: timeout: failed to connect service ":50051" within 1s

After one timeout failure(at most 15), the CatalogSource pod running, startup probe works well. LGTM, verify it.

Comment 3 Jian Zhang 2022-08-19 00:49:26 UTC

Verify it based on commnet 1

Comment 6 errata-xmlrpc 2022-08-31 12:34:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.10.30 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6133

Note You need to log in before you can comment on or make changes to this bug.