Bug 1807128

Summary: If OLM catalog operator cannot reach the API server, it does not seem to retry
Product: OpenShift Container Platform Reporter: Stephen Benjamin <stbenjam>
Component: OLMAssignee: Ben Luddy <bluddy>
OLM sub component: OLM QA Contact: Bruno Andrade <bandrade>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: bluddy, pbalogh, scolange
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 18:02:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1808418    

Description Stephen Benjamin 2020-02-25 16:34:19 UTC
I have an install stuck at: level=debug msg="Still waiting for the cluster to initialize: Cluster operator operator-lifecycle-manager-catalog has not yet reported success"

oc get clusteroperators does not show operator-lifecycle-manager-catalog... and the logs show:

$ oc logs $POD -n openshift-operator-lifecycle-managertime="2020-02-25T14:58:32Z" level=info msg="log level info"
time="2020-02-25T14:58:32Z" level=info msg="TLS keys set, using https for metrics"
W0225 14:58:32.552916       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-02-25T14:58:32Z" level=info msg="Using in-cluster kube client config"
time="2020-02-25T14:58:32Z" level=info msg="Using in-cluster kube client config"
W0225 14:58:32.557542       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2020-02-25T14:58:32Z" level=info msg="Using in-cluster kube client config"
time="2020-02-25T14:58:32Z" level=info msg="operator not ready: communicating with server failed: Get https://172.30.0.1:443/version?timeout=32s: dial tcp 172.30.0.1:443: connect: connection refused"
time="2020-02-25T14:58:32Z" level=info msg="ClusterOperator api not present, skipping update (Get https://172.30.0.1:443/api?timeout=32s: dial tcp 
172.30.0.1:443: connect: connection refused)"



However, currently the API is now available:

$ oc rsh -n openshift-operator-lifecycle-manager $POD                        
sh-4.2$ curl -k https://172.30.0.1:443/api?timeout=32s:
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {
  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/api\"",
  "reason": "Forbidden",
  "details": {
  },
  "code": 403
}sh-4.2$ 



But it appears the operator is not retrying.

Comment 1 Stephen Benjamin 2020-02-25 16:35:13 UTC
Similar BZ: BZ1798135

Comment 6 Bruno Andrade 2020-03-10 16:35:37 UTC
Installed cluster and left it installed for approximately one day and OLM Cluster Operators are running as expected. Marking as VERIFIED.


OCP Cluster Version: 4.5.0-0.nightly-2020-03-06-190457

oc get clusteroperators | grep "operator-lifecycle-manager*"                                                           
operator-lifecycle-manager                 4.5.0-0.nightly-2020-03-06-190457   True        False         False      16h
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-03-06-190457   True        False         False      16h
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-03-06-190457   True        False         False      4h11m
                                 
oc get pods -n openshift-operator-lifecycle-manager                                                                    
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6d54448f87-qktbj   1/1     Running   0          16h
olm-operator-7c876bcb96-rxsxq       1/1     Running   0          16h
packageserver-6dcdd88944-88tjg      1/1     Running   0          4h11m
packageserver-6dcdd88944-cwqnx      1/1     Running   0          4h11m

Comment 7 Evan Cordell 2020-03-12 14:30:15 UTC
*** Bug 1810025 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2020-08-04 18:02:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409