Bug 1760086 - Catalog-operator reporting problem health-checking custom catalog
Summary: Catalog-operator reporting problem health-checking custom catalog
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.0
Assignee: Alexander Greene
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-09 19:31 UTC by Rogerio Bastos
Modified: 2019-11-07 20:43 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-07 20:43:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
CatalogSource (232 bytes, text/plain)
2019-10-09 19:31 UTC, Rogerio Bastos
no flags Details
Catalog Source yaml file (341 bytes, text/plain)
2019-10-16 20:28 UTC, Rogerio Bastos
no flags Details

Description Rogerio Bastos 2019-10-09 19:31:52 UTC
Created attachment 1624012 [details]
CatalogSource

Description of problem:

The catalog-operator pod is constantly reporting failed healthcheck attempts against catalog registries

Version-Release number of selected component (if applicable):
v4.1.18

How reproducible:
Appears consistently in v4.1.18 cluster for different CatalogSource objects

Steps to Reproduce:
1. Deploy attached CatalogSource into a cluster
2. Watch catalog-operator pod logs


Actual results:

The catalog appears functional, and responds as expected to GRPC queries:

oc run grpcurl-query -n openshift-operators --rm=true  --restart=Never --attach=true --image=quay.io/rogbas/grpcurl -- -plaintext prometheus-catalog-registry:50051 api.Registry/ListPackages

{
  "name": "prometheus"
}


But the catalog-operator pod is constantly outputting messages like:

time="2019-10-09T19:25:15Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{prometheus-catalog-registry openshift-operators}" id=hSMFu source=prometheus-catalog-registry
time="2019-10-09T19:25:23Z" level=info msg="building connection to registry" currentSource="{prometheus-catalog-registry openshift-operators}" id=31yeh source=prometheus-catalog-registry
time="2019-10-09T19:25:23Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{prometheus-catalog-registry openshift-operators}" id=31yeh source=prometheus-catalog-registry
time="2019-10-09T19:25:29Z" level=info msg="building connection to registry" currentSource="{prometheus-catalog-registry openshift-operators}" id=El6De source=prometheus-catalog-registry

Comment 1 Alexander Greene 2019-10-14 17:33:49 UTC
I am looking into this now.

Comment 2 Alexander Greene 2019-10-14 19:20:41 UTC
(In reply to Rogerio Bastos from comment #0)
> Created attachment 1624012 [details]
> CatalogSource
> 
> Description of problem:
> 
> The catalog-operator pod is constantly reporting failed healthcheck attempts
> against catalog registries
> 
> Version-Release number of selected component (if applicable):
> v4.1.18
> 
> How reproducible:
> Appears consistently in v4.1.18 cluster for different CatalogSource objects
> 
> Steps to Reproduce:
> 1. Deploy attached CatalogSource into a cluster
> 2. Watch catalog-operator pod logs
> 
> 
> Actual results:
> 
> The catalog appears functional, and responds as expected to GRPC queries:
> 
> oc run grpcurl-query -n openshift-operators --rm=true  --restart=Never
> --attach=true --image=quay.io/rogbas/grpcurl -- -plaintext
> prometheus-catalog-registry:50051 api.Registry/ListPackages
> 
> {
>   "name": "prometheus"
> }
> 
> 
> But the catalog-operator pod is constantly outputting messages like:
> 
> time="2019-10-09T19:25:15Z" level=info msg="client hasn't yet become
> healthy, attempt a health check" currentSource="{prometheus-catalog-registry
> openshift-operators}" id=hSMFu source=prometheus-catalog-registry
> time="2019-10-09T19:25:23Z" level=info msg="building connection to registry"
> currentSource="{prometheus-catalog-registry openshift-operators}" id=31yeh
> source=prometheus-catalog-registry
> time="2019-10-09T19:25:23Z" level=info msg="client hasn't yet become
> healthy, attempt a health check" currentSource="{prometheus-catalog-registry
> openshift-operators}" id=31yeh source=prometheus-catalog-registry
> time="2019-10-09T19:25:29Z" level=info msg="building connection to registry"
> currentSource="{prometheus-catalog-registry openshift-operators}" id=El6De
> source=prometheus-catalog-registry

Hello Rogerio,

I was unable to reproduce the behavior you described after deploying a 4.1.18 cluster as shown below:

(resource-limits → origin {16} ✓) operator-lifecycle-manager oc get clusteroperator
NAME                                 VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                       4.1.18    True        False         False      24m
...
...
...
marketplace                          4.1.18    True        False         False      29m
...
...
...
operator-lifecycle-manager           4.1.18    True        False         False      33m
operator-lifecycle-manager-catalog   4.1.18    True        False         False      33m

$ oc get subscriptions --all-namespaces
NAMESPACE                              NAME            PACKAGE         SOURCE                                      CHANNEL
openshift-marketplace                  prometheus      prometheus      installed-community-openshift-marketplace   beta
openshift-operator-lifecycle-manager   packageserver   packageserver   olm-operators                               alpha
openshift-operators                    amq-streams     amq-streams     installed-redhat-openshift-operators        stable

# The AMQ-Streams operator in the openshift-operators namespace...
$ oc get pod -n openshift-operators
NAME                                            READY   STATUS    RESTARTS   AGE
amq-streams-cluster-operator-7b6558fdc6-l895c   1/1     Running   0          3m32s

# The Prometheus operator in the openshift-marketplace namespace...
$ oc get pods -n openshift-marketplace
NAME                                                         READY   STATUS    RESTARTS   AGE
certified-operators-6db694488c-fdn9d                         1/1     Running   0          30m
community-operators-5494945db9-hftzp                         1/1     Running   0          30m
installed-community-openshift-marketplace-7f69d49697-5kh52   1/1     Running   0          92s
installed-community-openshift-operators-6b5ffd988f-qgbzk     1/1     Running   0          10m
installed-redhat-openshift-operators-77db67777b-9pt6q        1/1     Running   0          4m4s
marketplace-operator-8459dc96dd-w9zsj                        1/1     Running   0          31m
prometheus-operator-b74d786b4-pdwtr                          1/1     Running   0          66s
redhat-operators-789df5478c-p6qlv                            1/1     Running   0          30m


How are you building your CatalogSource object?

Comment 3 Dan Geoffroy 2019-10-14 20:47:23 UTC
Moving to 4.3 as this is not release blocking for 4.2.  We will continue to try to reproduce there and backport any applicable fixes to z-stream releases.

Comment 4 Rogerio Bastos 2019-10-15 18:00:04 UTC
As requested, the catalog image is being built with the following structure:

manifests
├── 0.32.0
│   ├── prometheus.alertmanager.crd.yaml
│   ├── prometheus.csv.yaml
│   ├── prometheus.podmonitors.crd.yaml
│   ├── prometheus.prometheus.crd.yaml
│   ├── prometheus.prometheusrule.crd.yaml
│   └── prometheus.servicemonitor.crd.yaml
└── prometheus.package.yaml


...and using the following Dockerfile:

FROM quay.io/openshift/origin-operator-registry:latest

ARG SRC_BUNDLES

COPY ${SRC_BUNDLES} manifests
RUN initializer

CMD ["registry-server", "-t", "/tmp/terminate.log"]

Comment 5 Alexander Greene 2019-10-16 18:06:47 UTC
Hello @rbastos,

The attachment you provided was a subscription, not a catalogSource. In an effort to reproduce your issues, I recreated the manifest dir and the image using the Dockerfile you provided and the following manifest files: https://github.com/operator-framework/community-operators/tree/master/community-operators/prometheus

I then created the following catalogSource:
```
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: prometheus-catalog-registry
  namespace: olm
spec:
  displayName: Prometheus Catalog Source
  image: quay.io/agreene/catalog-operator:latest
  publisher: OperatorHub.io
  sourceType: grpc

```

After creating the CatalogSource, the Prometheus operator was deployed successfully:
```
$ oc get pods
NAME                                   READY   STATUS    RESTARTS   AGE
catalog-operator-74f5d45f65-89clr      1/1     Running   0          83m
olm-operator-6c9b6c5c9d-xbn5s          1/1     Running   0          83m
operatorhubio-catalog-b28vt            1/1     Running   0          83m
packageserver-5f4d84f757-q9srv         1/1     Running   0          83m
prometheus-catalog-registry-ldqgg      1/1     Running   0          3m15s
prometheus-operator-6df4755cb4-gxv7d   1/1     Running   0          2m40s
```

I was not able to reproduce your issue, could you share your CatalogSource?

Comment 6 Alexander Greene 2019-10-16 18:08:40 UTC
(In reply to Alexander Greene from comment #5)
> Hello @rbastos,
> 
> The attachment you provided was a subscription, not a catalogSource. In an
> effort to reproduce your issues, I recreated the manifest dir and the image
> using the Dockerfile you provided and the following manifest files:
> https://github.com/operator-framework/community-operators/tree/master/
> community-operators/prometheus
> 
> I then created the following catalogSource:
> ```
> apiVersion: operators.coreos.com/v1alpha1
> kind: CatalogSource
> metadata:
>   name: prometheus-catalog-registry
>   namespace: olm
> spec:
>   displayName: Prometheus Catalog Source
>   image: quay.io/agreene/catalog-operator:latest
>   publisher: OperatorHub.io
>   sourceType: grpc
> 
> ```
> 
> After creating the CatalogSource, the Prometheus operator was deployed
> successfully:
> ```
> $ oc get pods
> NAME                                   READY   STATUS    RESTARTS   AGE
> catalog-operator-74f5d45f65-89clr      1/1     Running   0          83m
> olm-operator-6c9b6c5c9d-xbn5s          1/1     Running   0          83m
> operatorhubio-catalog-b28vt            1/1     Running   0          83m
> packageserver-5f4d84f757-q9srv         1/1     Running   0          83m
> prometheus-catalog-registry-ldqgg      1/1     Running   0          3m15s
> prometheus-operator-6df4755cb4-gxv7d   1/1     Running   0          2m40s
> ```
> 
> I was not able to reproduce your issue, could you share your CatalogSource?

Note: This was on a 4.3 cluster

Comment 7 Rogerio Bastos 2019-10-16 20:28:35 UTC
Created attachment 1626604 [details]
Catalog Source yaml file

Comment 8 Rogerio Bastos 2019-10-16 20:30:28 UTC
I just added the CatalogSource yaml file as an attachment.

Could you please confirm if you get the same error msg in the output of catalog-operator?


Thanks a lot for testing

Comment 9 Alexander Greene 2019-11-07 20:43:20 UTC
@Rogerio Bastos

I appologize for the delay. This is not a bug.

For multitenancy purposes, subscriptions can only pull from CatalogSource deployed in their same namespace UNLESS the the CatlogSource exists in a special "Global Catalog Source" namespace. You can view which namespace is marked as a global provided here: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/manifests/0000_50_olm_08-catalog-operator.deployment.yaml#L28

As such, when you created your CatalogSource in the `openshift-operators` namespace and the subscription in the `prometheus` namespace, the logs you shared are generated.

Your subscription and CatalogSource will work if you move the CatalogSource to the `openshift-marketplace` namespace OR move your subscription to the same namespace as your CatalogSource.


Note You need to log in before you can comment on or make changes to this bug.