Bug 2115874

Summary: Default catalogs fails liveness/readiness probes
Product: OpenShift Container Platform Reporter: Vu Dinh <vdinh>
Component: OLMAssignee: Per da Silva <pegoncal>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: tyslaton
Version: 4.11   
Target Milestone: ---   
Target Release: 4.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-31 12:34:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2093288    
Bug Blocks:    

Description Vu Dinh 2022-08-05 15:21:19 UTC
This bug was initially created as a copy of Bug #2093288

I am copying this bug because: 

Backporting process

Description of problem:
All default catalogsources in 4.11 are built using file-based catalogsouce. Those catalogsources fail to deploy successfully in 4.11 OCP cluster. Multiple CI runs on nightly build have failed due to this reason.

The main culprit is the longer process time for YAML/JSON unmarshalling in the registry pod. The proposal to address this issue to add startupProbe to the registry pod. The startupProbe will check for grpc health before activating the liveness/readiness probe.

Version-Release number of selected component (if applicable):
4.11

How reproducible:


Steps to Reproduce:
1. Delay an 4.11 OpenShift cluster
2. Check registry pods for default catalogsources such as redhat-operators
3.

Actual results:
The pods fail due to liveness/readiness probe failure: openshift-marketplace pod/redhat-operators-h22ms node/ci-op-s04xckx3-de73b-7fxs4-master-1 - reason/Unhealthy Readiness probe failed: timeout: failed to connect service ":50051" within 1s

Expected results:
The registry pods for default catalogsources should be up and running.

Additional info:

See Slack thread for more information: https://coreos.slack.com/archives/C01CQA76KMX/p1654190057669689

Comment 1 Jian Zhang 2022-08-08 07:23:02 UTC
1, Create an OCP cluster with the fixed PR via cluster-bot
MacBook-Pro:~ jianzhang$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.ci.test-2022-08-08-063840-ci-ln-h9qsxqb-latest   True        False         9m17s   Cluster version is 4.10.0-0.ci.test-2022-08-08-063840-ci-ln-h9qsxqb-latest

2, Create a CatalogSource that uses the file-based index image.

MacBook-Pro:~ jianzhang$ oc create -f cs-redhat.yaml 
catalogsource.operators.coreos.com/test-operators created

MacBook-Pro:~ jianzhang$ cat cs-redhat.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: test-operators
  namespace: openshift-marketplace
spec:
  displayName: Jian Operators
  image: registry.redhat.io/redhat/redhat-operator-index:v4.11
  priority: -100
  publisher: Jian
  sourceType: grpc
  updateStrategy:
    registryPoll:
      interval: 10m0s

3, check if the startup probe works well

MacBook-Pro:~ jianzhang$ oc get pods
NAME                                   READY   STATUS    RESTARTS   AGE
certified-operators-jcdng              1/1     Running   0          26m
community-operators-8ghn6              1/1     Running   0          26m
marketplace-operator-958cf79bb-r2n59   1/1     Running   0          32m
redhat-marketplace-tk8t7               1/1     Running   0          26m
redhat-operators-xtj5l                 1/1     Running   0          26m
test-operators-52bpk                   1/1     Running   0          5m23s
MacBook-Pro:~ jianzhang$ oc get pods test-operators-52bpk  -o=jsonpath='{.spec.containers[0].startupProbe}'
{"exec":{"command":["grpc_health_probe","-addr=:50051"]},"failureThreshold":15,"periodSeconds":10,"successThreshold":1,"timeoutSeconds":1}


MacBook-Pro:~ jianzhang$ oc get event|grep timeout
6m11s       Warning   Unhealthy           pod/test-operators-52bpk                    Startup probe failed: timeout: failed to connect service ":50051" within 1s

After one timeout failure(at most 15), the CatalogSource pod running, startup probe works well. LGTM, verify it.

Comment 3 Jian Zhang 2022-08-19 00:49:26 UTC
Verify it based on commnet 1

Comment 6 errata-xmlrpc 2022-08-31 12:34:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.10.30 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6133