1768819 – The configmap sourceType of the CatalogSource failed to connect the ConfigMap

Bug 1768819 - The configmap sourceType of the CatalogSource failed to connect the ConfigMap

Summary: The configmap sourceType of the CatalogSource failed to connect the ConfigMap

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Evan Cordell
QA Contact:	Salvatore Colangelo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1805268
TreeView+	depends on / blocked

Reported:	2019-11-05 10:12 UTC by Jian Zhang
Modified:	2020-05-04 11:15 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1805268 (view as bug list)
Environment:
Last Closed:	2020-05-04 11:14:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	operator-framework operator-lifecycle-manager pull 1268	0	None	closed	Bug 1768819: Fix configmap registry server readiness probe timeouts	2021-01-26 18:15:17 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:15:11 UTC

Description Jian Zhang 2019-11-05 10:12:59 UTC

Description of problem:
When I use the configmap sourceType of CatalogSource, it gets stuck in the `CONNECTING` status.

Version-Release number of selected component (if applicable):
Cluster version is 4.3.0-0.nightly-2019-11-02-092336 

mac:~ jianzhang$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-7868f48c46-rhfnz -- olm --version
OLM version: 0.12.0
git commit: 1878972c541d52d22f87b12d14e4b76bbda1c0ea


How reproducible:
always

Steps to Reproduce:
1. Install the OCP 4.3.
2. Create a namespace called "test-operators".
3. Install a ConfigMap in it.
# oc create -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml
mac:~ jianzhang$ oc get cm
NAME                                   DATA   AGE
installed-community-global-operators   3      11m
4, Create a CatalogSource, which its sourceTyoe is "configmap", as below:
mac:~ jianzhang$ cat catalogsource.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: installed-community-global-operators
  namespace: test-operators
spec:
  configMap: installed-community-global-operators
  displayName: Community Operators
  icon:
    base64data: ""
    mediatype: ""
  publisher: Community
  sourceType: configmap

5, Check this CatalogSource object status.

Actual results:
It get stuck in the `CONNECTING` status.
mac:~ jianzhang$ oc get catalogsource installed-community-global-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2019-11-05T09:54:28Z"
  generation: 1
  name: installed-community-global-operators
  namespace: test-operators
  resourceVersion: "609506"
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test-operators/catalogsources/installed-community-global-operators
  uid: cd8ee412-3a10-41d6-9db6-0a7dd7d6ae33
spec:
  configMap: installed-community-global-operators
  displayName: Community Operators
  icon:
    base64data: ""
    mediatype: ""
  publisher: Community
  sourceType: configmap
status:
  configMapReference:
    lastUpdateTime: "2019-11-05T09:54:25Z"
    name: installed-community-global-operators
    namespace: test-operators
    resourceVersion: "609438"
    uid: aef65a5a-d66f-408b-b920-4166d0b43114
  connectionState:
    address: installed-community-global-operators.test-operators.svc:50051
    lastConnect: "2019-11-05T09:54:34Z"
    lastObservedState: CONNECTING
  registryService:
    createdAt: "2019-11-05T09:54:26Z"
    port: "50051"
    protocol: grpc
    serviceName: installed-community-global-operators
    serviceNamespace: test-operators


Expected results:
It should be in READY status or report a error.

Additional info:
mac:~ jianzhang$ oc get svc
NAME                                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)     AGE
installed-community-global-operators   ClusterIP   172.30.86.79   <none>        50051/TCP   15m

mac:~ jianzhang$ oc get pods 
NAME                                         READY   STATUS    RESTARTS   AGE
installed-community-global-operators-bkkgv   1/1     Running   0          3m16s
mac:~ jianzhang$ oc get pods installed-community-global-operators-bkkgv -o yaml|grep image
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e7d4ba96eaf95f0d0b5da7490ca8e1d63fdf04f1978371aa0c70f305961f9946
    imagePullPolicy: IfNotPresent
  imagePullSecrets:
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e7d4ba96eaf95f0d0b5da7490ca8e1d63fdf04f1978371aa0c70f305961f9946
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e7d4ba96eaf95f0d0b5da7490ca8e1d63fdf04f1978371aa0c70f305961f9946
mac:~ jianzhang$ oc rsh installed-community-global-operators-bkkgv 
sh-4.2$ ps -elf|cat
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S 1001           1       0  0  80   0 - 194242 -     09:54 ?        00:00:00 configmap-server -c installed-community-global-operators -n test-operators
4 S 1001         667       0  0  80   0 -  2959 -      09:58 pts/0    00:00:00 /bin/sh

mac:~ jianzhang$ oc logs installed-community-global-operators-bkkgv 
time="2019-11-05T09:54:40Z" level=info msg="Using in-cluster kube client config"
time="2019-11-05T09:54:40Z" level=info msg="loading CRDs" configmap=installed-community-global-operators ns=test-operators
time="2019-11-05T09:54:40Z" level=info msg="loading Bundles" configmap=installed-community-global-operators ns=test-operators
time="2019-11-05T09:54:40Z" level=info msg="loading Packages" configmap=installed-community-global-operators ns=test-operators
time="2019-11-05T09:54:40Z" level=info msg="serving registry" configMapName=installed-community-global-operators configMapNamespace=test-operators port=50051

Comment 1 Daniel Sover 2019-11-05 18:26:13 UTC

I'm seeing the configmap type catalog source work as expected on minikube v1.16.2 with master checkout to the same commit 1878972c541d52d22f87b12d14e4b76bbda1c0ea

When the catalog source is created from the provided configmap the status of the catalogsource is READY and the catalog source is serving traffic normally. This can be verified by portforwarding and using the grpcurl tool. 
(In one terminal session) k port-forward installed-community-global-operators-lqg2v 50051:50051
(In another terminal session) grpcurl -plaintext -d '{"name":"etcd-update"}' localhost:50051 api.Registry/GetPackage

This returns 

{
  "name": "etcd-update",
  "channels": [
    {
      "name": "alpha",
      "csvName": "etcdoperator.v0.9.2"
    }
  ],
  "defaultChannelName": "alpha"
}

which indicates the catalog source is serving traffic as expected.

This issue looks to be a transient networking failure inside the CI cluster itself. However, if the reproducibility is "always" then it would suggest potentially another problem with the cluster configuration.

Comment 2 Jian Zhang 2019-11-06 02:05:51 UTC

Daniel,

Thank you for your analysis! Yes, seems like it works well.
mac:~ jianzhang$ oc port-forward installed-community-global-operators-lxt96 50051:50051
Forwarding from 127.0.0.1:50051 -> 50051
Forwarding from [::1]:50051 -> 50051
Handling connection for 50051
...

mac:~ jianzhang$ grpcurl -plaintext -d '{"name":"etcd-update"}' localhost:50051 api.Registry/GetPackage
{
  "name": "etcd-update",
  "channels": [
    {
      "name": "alpha",
      "csvName": "etcdoperator.v0.9.2"
    }
  ],
  "defaultChannelName": "alpha"
}


I think there is potentially another problem with the cluster. Do you have any suggestions?

Comment 3 Daniel Sover 2019-11-06 19:26:47 UTC

Hi Jian,

I spun up the same CI cluster that you were using to test this out on a real cluster. 

~: k exec catalog-operator-7868f48c46-thcnn -- olm --version
OLM version: 0.12.0
git commit: 1878972c541d52d22f87b12d14e4b76bbda1c0ea

I did see the original bug, where the Last Observed State:  CONNECTING was present on the CatalogSource pod for a few minutes - it did change to READY after a minute or two. 

Looking at the catalog operator pod logs, I saw the following after installing the new catalogsource:

time="2019-11-06T18:18:16Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING"
time="2019-11-06T18:18:19Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=TRANSIENT_FAILURE"
time="2019-11-06T18:18:20Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING"
time="2019-11-06T18:18:35Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=READY"

The next time I ran it I saw something similar:

time="2019-11-06T18:41:34Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING"
time="2019-11-06T18:41:38Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=TRANSIENT_FAILURE"
time="2019-11-06T18:41:39Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING"
time="2019-11-06T18:41:46Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=READY"

I think there are some initial connection failures due to the nature of the gRPC connection - as you can see the the connection does establish itself after some time. 

The reason we are seeing this is because there is a delay between when the catalog source parses the config map into the DB and is ready to serve traffic versus when the catalog source pod passes the gRPC health check we have implemented currently. The health check passes instantly, but there is a delay between when the pod is reporting ready and when the pod can actually provide catalog source info (because of the additional work of parsing the config map data into the pod DB). In catalog sources where the data is in the image already, this delay is non-existent and we basically don't run into this issue. This issue is compounded by the fact that gRPC connections exponentially backoff after failure. There are potential ways to change this backoff behavior however https://godoc.org/google.golang.org/grpc#DialOption. 

This also explains why we don't see this issue on minikube - since all the pods are on one node there is no real networking. In a real cluster where the gRPC connection needs to be established this issue pops up.

That being said, we are moving away from configmap based catalog sources and its not entirely clear we can definitely make this go away entirely. Recommended plan is to keep the status the same and update Daniel Messer regarding this bug and see whether or not its a priority.

Comment 4 Daniel Sover 2019-11-06 22:04:49 UTC

*** Bug 1749031 has been marked as a duplicate of this bug. ***

Comment 5 Jian Zhang 2019-11-07 09:33:33 UTC

Daniel,

Many thanks for your details! Yes, it becomes ready after a while. I couldn't reproduce it either. Maybe the network didn't work well at that time.

Comment 6 Daniel Sover 2019-11-07 14:52:58 UTC

After discussing with Daniel Messer and Evan we are going to look into a fix for this issue, I will keep this open and assign a medium priority for now.

Comment 7 Jeff Peeler 2019-11-20 16:30:11 UTC

Resetting this to NEW as I'm going to put this down for a while (others may feel free to steal this if bandwidth permits) to look at higher priority bugs. The transient failures resolve in every case I've seen.

I invested a bit of effort in testing this with the reproducer with Kubernetes using our dev deploy. I've also found that both GRPC and configmap backed catalog sources exhibit the same behavior of TRANSIENT_FAILURE when starting up the connection. I added code to specifically check that the catalog source pod status was reporting ContainersReady before doing the grpc.Dial, which fixed the GRPC catalog sources but did not the configmap based ones.

I suspect that the readiness probe is not 100% accurate and that's why configmap pods still do not succeed immediately, but that's just a guess. Given that the GRPC server is not started until after the configmap is read and the database is processed, it should have worked reliably with the pod status check in theory.

Comment 12 Salvatore Colangelo 2020-03-03 12:18:49 UTC

Hi below the steps:

[scolange@scolange ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-03-052929   True        False         123m    Cluster version is 4.3.0-0.nightly-2020-03-03-052929

[scolange@scolange ~]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-7945dfd867-kjgdg -- olm --version
OLM version: 0.13.0
git commit: 93f346148ab797b3d68616d4f4bddf1286fe3fd8

Steps to Reproduce:
1. Install the OCP 4.3.

2. Create a namespace called "test-operators".

[scolange@scolange ~]$ oc create ns test-operators
namespace/test-operators created

3. Install a ConfigMap in it.
# oc create -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml

[scolange@scolange ~]$ oc create -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml
configmap/installed-community-global-operators created

[scolange@scolange ~]$ oc get cm
NAME                                   DATA   AGE
installed-community-global-operators   3      109s

4, Create a CatalogSource, which its sourceTyoe is "configmap", as below:

[scolange@scolange 1768819]$ cat catalogSource.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: installed-community-global-operators
  namespace: test-operators
spec:
  configMap: installed-community-global-operators
  displayName: Community Operators
  icon:
    base64data: ""
    mediatype: ""
  publisher: Community
  sourceType: configmap

[scolange@scolange 1768819]$ oc create -f catalogSource.yaml
catalogsource.operators.coreos.com/installed-community-global-operators created

[scolange@scolange 1768819]$ oc get catalogsource installed-community-global-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2020-03-03T12:11:43Z"
  generation: 1
  name: installed-community-global-operators
  namespace: test-operators
  resourceVersion: "59234"
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test-operators/catalogsources/installed-community-global-operators
  uid: 63a244b3-3c69-4fbf-b6c9-001930a51d6c
spec:
  configMap: installed-community-global-operators
  displayName: Community Operators
  icon:
    base64data: ""
    mediatype: ""
  publisher: Community
  sourceType: configmap
status:
  configMapReference:
    lastUpdateTime: "2020-03-03T12:11:43Z"
    name: installed-community-global-operators
    namespace: test-operators
    resourceVersion: "59129"
    uid: 97fdb749-b2ba-4f99-bc15-f96213ec35c5
  connectionState:
    address: installed-community-global-operators.test-operators.svc:50051
    lastConnect: "2020-03-03T12:11:55Z"
    lastObservedState: CONNECTING
  registryService:
    createdAt: "2020-03-03T12:11:44Z"
    port: "50051"
    protocol: grpc
    serviceName: installed-community-global-operators
    serviceNamespace: test-operators

after 5 minutes: BECAME Ready

[scolange@scolange 1768819]$ oc get catalogsource installed-community-global-operators -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2020-03-03T12:11:43Z"
  generation: 1
  name: installed-community-global-operators
  namespace: test-operators
  resourceVersion: "60077"
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test-operators/catalogsources/installed-community-global-operators
  uid: 63a244b3-3c69-4fbf-b6c9-001930a51d6c
spec:
  configMap: installed-community-global-operators
  displayName: Community Operators
  icon:
    base64data: ""
    mediatype: ""
  publisher: Community
  sourceType: configmap
status:
  configMapReference:
    lastUpdateTime: "2020-03-03T12:11:43Z"
    name: installed-community-global-operators
    namespace: test-operators
    resourceVersion: "59129"
    uid: 97fdb749-b2ba-4f99-bc15-f96213ec35c5
  connectionState:
    address: installed-community-global-operators.test-operators.svc:50051
    lastConnect: "2020-03-03T12:15:08Z"
    lastObservedState: READY
  registryService:
    createdAt: "2020-03-03T12:11:44Z"
    port: "50051"
    protocol: grpc
    serviceName: installed-community-global-operators
    serviceNamespace: test-operators

LGMT.

Additional info:

[scolange@scolange 1768819]$ oc get pods
NAME                                         READY   STATUS    RESTARTS   AGE
installed-community-global-operators-98wlm   1/1     Running   0          92s

[scolange@scolange 1768819]$ oc log installed-community-global-operators-98wlm
log is DEPRECATED and will be removed in a future version. Use logs instead.
time="2020-03-03T12:11:55Z" level=info msg="Using in-cluster kube client config"
time="2020-03-03T12:11:55Z" level=info msg="loading CRDs" configmap=installed-community-global-operators ns=test-operators
time="2020-03-03T12:11:55Z" level=info msg="loading Bundles" configmap=installed-community-global-operators ns=test-operators
time="2020-03-03T12:11:55Z" level=info msg="loading Packages" configmap=installed-community-global-operators ns=test-operators
time="2020-03-03T12:11:55Z" level=info msg="serving registry" configMapName=installed-community-global-operators configMapNamespace=test-operators port=50051

[scolange@scolange 1768819]$ oc rsh installed-community-global-operators-98wlm
sh-4.2$ ps -elf|cat
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S 1001           1       0  0  80   0 - 157339 -     12:11 ?        00:00:00 configmap-server -c installed-community-global-operators -n test-operators
4 S 1001         380       0  0  80   0 -  2959 -      12:13 pts/0    00:00:00 /bin/sh
4 R 1001         416     380  0  80   0 - 12941 -      12:14 pts/0    00:00:00 ps -elf
4 S 1001         417     380  0  80   0 -  1099 -      12:14 pts/0    00:00:00 cat
5 R root         418       0  0  80   0 - 140643 -     12:14 ?        00:00:00 /usr/bin/runc init

[scolange@scolange 1768819]$ oc get pods installed-community-global-operators-98wlm -o yaml|grep image
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9
    imagePullPolicy: IfNotPresent
  imagePullSecrets:
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9

[scolange@scolange 1768819]$ oc get svc
NAME                                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
installed-community-global-operators   ClusterIP   172.30.226.32   <none>        50051/TCP   3m23s

Comment 15 errata-xmlrpc 2020-05-04 11:14:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.