Description of problem: When I use the configmap sourceType of CatalogSource, it gets stuck in the `CONNECTING` status. Version-Release number of selected component (if applicable): Cluster version is 4.3.0-0.nightly-2019-11-02-092336 mac:~ jianzhang$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-7868f48c46-rhfnz -- olm --version OLM version: 0.12.0 git commit: 1878972c541d52d22f87b12d14e4b76bbda1c0ea How reproducible: always Steps to Reproduce: 1. Install the OCP 4.3. 2. Create a namespace called "test-operators". 3. Install a ConfigMap in it. # oc create -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml mac:~ jianzhang$ oc get cm NAME DATA AGE installed-community-global-operators 3 11m 4, Create a CatalogSource, which its sourceTyoe is "configmap", as below: mac:~ jianzhang$ cat catalogsource.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: installed-community-global-operators namespace: test-operators spec: configMap: installed-community-global-operators displayName: Community Operators icon: base64data: "" mediatype: "" publisher: Community sourceType: configmap 5, Check this CatalogSource object status. Actual results: It get stuck in the `CONNECTING` status. mac:~ jianzhang$ oc get catalogsource installed-community-global-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2019-11-05T09:54:28Z" generation: 1 name: installed-community-global-operators namespace: test-operators resourceVersion: "609506" selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test-operators/catalogsources/installed-community-global-operators uid: cd8ee412-3a10-41d6-9db6-0a7dd7d6ae33 spec: configMap: installed-community-global-operators displayName: Community Operators icon: base64data: "" mediatype: "" publisher: Community sourceType: configmap status: configMapReference: lastUpdateTime: "2019-11-05T09:54:25Z" name: installed-community-global-operators namespace: test-operators resourceVersion: "609438" uid: aef65a5a-d66f-408b-b920-4166d0b43114 connectionState: address: installed-community-global-operators.test-operators.svc:50051 lastConnect: "2019-11-05T09:54:34Z" lastObservedState: CONNECTING registryService: createdAt: "2019-11-05T09:54:26Z" port: "50051" protocol: grpc serviceName: installed-community-global-operators serviceNamespace: test-operators Expected results: It should be in READY status or report a error. Additional info: mac:~ jianzhang$ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE installed-community-global-operators ClusterIP 172.30.86.79 <none> 50051/TCP 15m mac:~ jianzhang$ oc get pods NAME READY STATUS RESTARTS AGE installed-community-global-operators-bkkgv 1/1 Running 0 3m16s mac:~ jianzhang$ oc get pods installed-community-global-operators-bkkgv -o yaml|grep image image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e7d4ba96eaf95f0d0b5da7490ca8e1d63fdf04f1978371aa0c70f305961f9946 imagePullPolicy: IfNotPresent imagePullSecrets: image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e7d4ba96eaf95f0d0b5da7490ca8e1d63fdf04f1978371aa0c70f305961f9946 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e7d4ba96eaf95f0d0b5da7490ca8e1d63fdf04f1978371aa0c70f305961f9946 mac:~ jianzhang$ oc rsh installed-community-global-operators-bkkgv sh-4.2$ ps -elf|cat F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S 1001 1 0 0 80 0 - 194242 - 09:54 ? 00:00:00 configmap-server -c installed-community-global-operators -n test-operators 4 S 1001 667 0 0 80 0 - 2959 - 09:58 pts/0 00:00:00 /bin/sh mac:~ jianzhang$ oc logs installed-community-global-operators-bkkgv time="2019-11-05T09:54:40Z" level=info msg="Using in-cluster kube client config" time="2019-11-05T09:54:40Z" level=info msg="loading CRDs" configmap=installed-community-global-operators ns=test-operators time="2019-11-05T09:54:40Z" level=info msg="loading Bundles" configmap=installed-community-global-operators ns=test-operators time="2019-11-05T09:54:40Z" level=info msg="loading Packages" configmap=installed-community-global-operators ns=test-operators time="2019-11-05T09:54:40Z" level=info msg="serving registry" configMapName=installed-community-global-operators configMapNamespace=test-operators port=50051
I'm seeing the configmap type catalog source work as expected on minikube v1.16.2 with master checkout to the same commit 1878972c541d52d22f87b12d14e4b76bbda1c0ea When the catalog source is created from the provided configmap the status of the catalogsource is READY and the catalog source is serving traffic normally. This can be verified by portforwarding and using the grpcurl tool. (In one terminal session) k port-forward installed-community-global-operators-lqg2v 50051:50051 (In another terminal session) grpcurl -plaintext -d '{"name":"etcd-update"}' localhost:50051 api.Registry/GetPackage This returns { "name": "etcd-update", "channels": [ { "name": "alpha", "csvName": "etcdoperator.v0.9.2" } ], "defaultChannelName": "alpha" } which indicates the catalog source is serving traffic as expected. This issue looks to be a transient networking failure inside the CI cluster itself. However, if the reproducibility is "always" then it would suggest potentially another problem with the cluster configuration.
Daniel, Thank you for your analysis! Yes, seems like it works well. mac:~ jianzhang$ oc port-forward installed-community-global-operators-lxt96 50051:50051 Forwarding from 127.0.0.1:50051 -> 50051 Forwarding from [::1]:50051 -> 50051 Handling connection for 50051 ... mac:~ jianzhang$ grpcurl -plaintext -d '{"name":"etcd-update"}' localhost:50051 api.Registry/GetPackage { "name": "etcd-update", "channels": [ { "name": "alpha", "csvName": "etcdoperator.v0.9.2" } ], "defaultChannelName": "alpha" } I think there is potentially another problem with the cluster. Do you have any suggestions?
Hi Jian, I spun up the same CI cluster that you were using to test this out on a real cluster. ~: k exec catalog-operator-7868f48c46-thcnn -- olm --version OLM version: 0.12.0 git commit: 1878972c541d52d22f87b12d14e4b76bbda1c0ea I did see the original bug, where the Last Observed State: CONNECTING was present on the CatalogSource pod for a few minutes - it did change to READY after a minute or two. Looking at the catalog operator pod logs, I saw the following after installing the new catalogsource: time="2019-11-06T18:18:16Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING" time="2019-11-06T18:18:19Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=TRANSIENT_FAILURE" time="2019-11-06T18:18:20Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING" time="2019-11-06T18:18:35Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=READY" The next time I ran it I saw something similar: time="2019-11-06T18:41:34Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING" time="2019-11-06T18:41:38Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=TRANSIENT_FAILURE" time="2019-11-06T18:41:39Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING" time="2019-11-06T18:41:46Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=READY" I think there are some initial connection failures due to the nature of the gRPC connection - as you can see the the connection does establish itself after some time. The reason we are seeing this is because there is a delay between when the catalog source parses the config map into the DB and is ready to serve traffic versus when the catalog source pod passes the gRPC health check we have implemented currently. The health check passes instantly, but there is a delay between when the pod is reporting ready and when the pod can actually provide catalog source info (because of the additional work of parsing the config map data into the pod DB). In catalog sources where the data is in the image already, this delay is non-existent and we basically don't run into this issue. This issue is compounded by the fact that gRPC connections exponentially backoff after failure. There are potential ways to change this backoff behavior however https://godoc.org/google.golang.org/grpc#DialOption. This also explains why we don't see this issue on minikube - since all the pods are on one node there is no real networking. In a real cluster where the gRPC connection needs to be established this issue pops up. That being said, we are moving away from configmap based catalog sources and its not entirely clear we can definitely make this go away entirely. Recommended plan is to keep the status the same and update Daniel Messer regarding this bug and see whether or not its a priority.
*** Bug 1749031 has been marked as a duplicate of this bug. ***
Daniel, Many thanks for your details! Yes, it becomes ready after a while. I couldn't reproduce it either. Maybe the network didn't work well at that time.
After discussing with Daniel Messer and Evan we are going to look into a fix for this issue, I will keep this open and assign a medium priority for now.
Resetting this to NEW as I'm going to put this down for a while (others may feel free to steal this if bandwidth permits) to look at higher priority bugs. The transient failures resolve in every case I've seen. I invested a bit of effort in testing this with the reproducer with Kubernetes using our dev deploy. I've also found that both GRPC and configmap backed catalog sources exhibit the same behavior of TRANSIENT_FAILURE when starting up the connection. I added code to specifically check that the catalog source pod status was reporting ContainersReady before doing the grpc.Dial, which fixed the GRPC catalog sources but did not the configmap based ones. I suspect that the readiness probe is not 100% accurate and that's why configmap pods still do not succeed immediately, but that's just a guess. Given that the GRPC server is not started until after the configmap is read and the database is processed, it should have worked reliably with the pod status check in theory.
Hi below the steps: [scolange@scolange ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-03-052929 True False 123m Cluster version is 4.3.0-0.nightly-2020-03-03-052929 [scolange@scolange ~]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-7945dfd867-kjgdg -- olm --version OLM version: 0.13.0 git commit: 93f346148ab797b3d68616d4f4bddf1286fe3fd8 Steps to Reproduce: 1. Install the OCP 4.3. 2. Create a namespace called "test-operators". [scolange@scolange ~]$ oc create ns test-operators namespace/test-operators created 3. Install a ConfigMap in it. # oc create -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml [scolange@scolange ~]$ oc create -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml configmap/installed-community-global-operators created [scolange@scolange ~]$ oc get cm NAME DATA AGE installed-community-global-operators 3 109s 4, Create a CatalogSource, which its sourceTyoe is "configmap", as below: [scolange@scolange 1768819]$ cat catalogSource.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: installed-community-global-operators namespace: test-operators spec: configMap: installed-community-global-operators displayName: Community Operators icon: base64data: "" mediatype: "" publisher: Community sourceType: configmap [scolange@scolange 1768819]$ oc create -f catalogSource.yaml catalogsource.operators.coreos.com/installed-community-global-operators created [scolange@scolange 1768819]$ oc get catalogsource installed-community-global-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2020-03-03T12:11:43Z" generation: 1 name: installed-community-global-operators namespace: test-operators resourceVersion: "59234" selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test-operators/catalogsources/installed-community-global-operators uid: 63a244b3-3c69-4fbf-b6c9-001930a51d6c spec: configMap: installed-community-global-operators displayName: Community Operators icon: base64data: "" mediatype: "" publisher: Community sourceType: configmap status: configMapReference: lastUpdateTime: "2020-03-03T12:11:43Z" name: installed-community-global-operators namespace: test-operators resourceVersion: "59129" uid: 97fdb749-b2ba-4f99-bc15-f96213ec35c5 connectionState: address: installed-community-global-operators.test-operators.svc:50051 lastConnect: "2020-03-03T12:11:55Z" lastObservedState: CONNECTING registryService: createdAt: "2020-03-03T12:11:44Z" port: "50051" protocol: grpc serviceName: installed-community-global-operators serviceNamespace: test-operators after 5 minutes: BECAME Ready [scolange@scolange 1768819]$ oc get catalogsource installed-community-global-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2020-03-03T12:11:43Z" generation: 1 name: installed-community-global-operators namespace: test-operators resourceVersion: "60077" selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test-operators/catalogsources/installed-community-global-operators uid: 63a244b3-3c69-4fbf-b6c9-001930a51d6c spec: configMap: installed-community-global-operators displayName: Community Operators icon: base64data: "" mediatype: "" publisher: Community sourceType: configmap status: configMapReference: lastUpdateTime: "2020-03-03T12:11:43Z" name: installed-community-global-operators namespace: test-operators resourceVersion: "59129" uid: 97fdb749-b2ba-4f99-bc15-f96213ec35c5 connectionState: address: installed-community-global-operators.test-operators.svc:50051 lastConnect: "2020-03-03T12:15:08Z" lastObservedState: READY registryService: createdAt: "2020-03-03T12:11:44Z" port: "50051" protocol: grpc serviceName: installed-community-global-operators serviceNamespace: test-operators LGMT. Additional info: [scolange@scolange 1768819]$ oc get pods NAME READY STATUS RESTARTS AGE installed-community-global-operators-98wlm 1/1 Running 0 92s [scolange@scolange 1768819]$ oc log installed-community-global-operators-98wlm log is DEPRECATED and will be removed in a future version. Use logs instead. time="2020-03-03T12:11:55Z" level=info msg="Using in-cluster kube client config" time="2020-03-03T12:11:55Z" level=info msg="loading CRDs" configmap=installed-community-global-operators ns=test-operators time="2020-03-03T12:11:55Z" level=info msg="loading Bundles" configmap=installed-community-global-operators ns=test-operators time="2020-03-03T12:11:55Z" level=info msg="loading Packages" configmap=installed-community-global-operators ns=test-operators time="2020-03-03T12:11:55Z" level=info msg="serving registry" configMapName=installed-community-global-operators configMapNamespace=test-operators port=50051 [scolange@scolange 1768819]$ oc rsh installed-community-global-operators-98wlm sh-4.2$ ps -elf|cat F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S 1001 1 0 0 80 0 - 157339 - 12:11 ? 00:00:00 configmap-server -c installed-community-global-operators -n test-operators 4 S 1001 380 0 0 80 0 - 2959 - 12:13 pts/0 00:00:00 /bin/sh 4 R 1001 416 380 0 80 0 - 12941 - 12:14 pts/0 00:00:00 ps -elf 4 S 1001 417 380 0 80 0 - 1099 - 12:14 pts/0 00:00:00 cat 5 R root 418 0 0 80 0 - 140643 - 12:14 ? 00:00:00 /usr/bin/runc init [scolange@scolange 1768819]$ oc get pods installed-community-global-operators-98wlm -o yaml|grep image image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9 imagePullPolicy: IfNotPresent imagePullSecrets: image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9 [scolange@scolange 1768819]$ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE installed-community-global-operators ClusterIP 172.30.226.32 <none> 50051/TCP 3m23s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581