Bug 1798135 - Catalog Operator pod hangs indefinitely if it can't reach the api server when starting
Summary: Catalog Operator pod hangs indefinitely if it can't reach the api server when...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.4.0
Assignee: Ben Luddy
QA Contact: Bruno Andrade
URL:
Whiteboard:
Depends On:
Blocks: 1798665
TreeView+ depends on / blocked
 
Reported: 2020-02-04 16:11 UTC by Ben Browning
Modified: 2020-05-04 11:33 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1798665 1798666 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:33:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1277 0 None closed Bug 1798135: Fix cases where an operator's ready channel may never close. 2020-06-22 16:09:24 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:33:29 UTC

Description Ben Browning 2020-02-04 16:11:49 UTC
Description of problem:

In a long-running cluster, I noticed that newly deployed CatalogSource objects were not being reconciled at all. After some digging, I found the following in the catalog-operator pod logs:


$ oc logs catalog-operator-754858b4d7-gphgx -n openshift-operator-lifecycle-manager
time="2020-01-20T17:43:28Z" level=info msg="log level info"
time="2020-01-20T17:43:28Z" level=info msg="TLS keys set, using https for metrics"
time="2020-01-20T17:43:28Z" level=info msg="Using in-cluster kube client config"
time="2020-01-20T17:43:28Z" level=info msg="Using in-cluster kube client config"
time="2020-01-20T17:43:28Z" level=info msg="Using in-cluster kube client config"
time="2020-01-20T17:43:28Z" level=info msg="operator not ready: communicating with server failed: Get https://172.30.0.1:443/version?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused"



Version-Release number of selected component (if applicable):

Observed on a 4.2.14 cluster


How reproducible:

This only happens if there's an issue contacting the Kubernetes api servers when the catalog operator pod boots. If there is an issue, the error handling logic in the pod has a gap today where the pod still reports as healthy, the pod continues to run, but it never retries contacting the kubernetes api server nor does it ever start reconciling OLM objects.

Something around https://github.com/operator-framework/operator-lifecycle-manager/blob/7d6665d6585a733356c2fbc0919a047de244f59b/pkg/lib/queueinformer/queueinformer_operator.go#L193 is where the logic is letting things get in this hung state.


Steps to Reproduce:

I'm not sure how easy it will be to reproduce this on-demand. An error contacting the K8s api server needs to happen when the pod boots. An integration or unit test in the code of the catalog operator should be able to reproduce this at all, but that's not quite the same as reproducing it in a live cluster.

Actual results:

The catalog operator pod stays running but does not do its job. Once it gets in this state, no OLM objects managed by this operator get reconciled and thus new CatalogSources and such do not work.

Expected results:

The catalog operator pod should be able to recover from a failure contacting the kuberntes api server.


Additional info:

Comment 3 Bruno Andrade 2020-02-11 15:47:42 UTC
After 25 hours, seeing catalog operator keeps running and healthy. Could add and remove operators without any issue. Marking as VERIFIED

OCP version: 4.4.0-0.nightly-2020-02-10-035806
OLM version: 0.14.1
git commit: b42c78dfcce511e49f8987e99d18725dd2ffe076


oc get pods -n openshift-operator-lifecycle-manager                                                                                                 
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-56b7f4d6c9-lkkbm   1/1     Running   0          25h
olm-operator-79bd65d66b-lt5l9       1/1     Running   0          25h
packageserver-585c6c9694-9n5h9      1/1     Running   0          66m
packageserver-585c6c9694-tb5tq      1/1     Running   0          67m

Comment 5 errata-xmlrpc 2020-05-04 11:33:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.