Bug 1768819
| Summary: | The configmap sourceType of the CatalogSource failed to connect the ConfigMap | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jian Zhang <jiazha> | |
| Component: | OLM | Assignee: | Evan Cordell <ecordell> | |
| OLM sub component: | OLM | QA Contact: | Salvatore Colangelo <scolange> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | high | CC: | bandrade, dsover, ecordell, jfan, nhale, scolange, tbuskey | |
| Version: | 4.3.0 | Keywords: | Reopened | |
| Target Milestone: | --- | |||
| Target Release: | 4.4.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1805268 (view as bug list) | Environment: | ||
| Last Closed: | 2020-05-04 11:14:43 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1805268 | |||
|
Description
Jian Zhang
2019-11-05 10:12:59 UTC
I'm seeing the configmap type catalog source work as expected on minikube v1.16.2 with master checkout to the same commit 1878972c541d52d22f87b12d14e4b76bbda1c0ea
When the catalog source is created from the provided configmap the status of the catalogsource is READY and the catalog source is serving traffic normally. This can be verified by portforwarding and using the grpcurl tool.
(In one terminal session) k port-forward installed-community-global-operators-lqg2v 50051:50051
(In another terminal session) grpcurl -plaintext -d '{"name":"etcd-update"}' localhost:50051 api.Registry/GetPackage
This returns
{
"name": "etcd-update",
"channels": [
{
"name": "alpha",
"csvName": "etcdoperator.v0.9.2"
}
],
"defaultChannelName": "alpha"
}
which indicates the catalog source is serving traffic as expected.
This issue looks to be a transient networking failure inside the CI cluster itself. However, if the reproducibility is "always" then it would suggest potentially another problem with the cluster configuration.
Daniel,
Thank you for your analysis! Yes, seems like it works well.
mac:~ jianzhang$ oc port-forward installed-community-global-operators-lxt96 50051:50051
Forwarding from 127.0.0.1:50051 -> 50051
Forwarding from [::1]:50051 -> 50051
Handling connection for 50051
...
mac:~ jianzhang$ grpcurl -plaintext -d '{"name":"etcd-update"}' localhost:50051 api.Registry/GetPackage
{
"name": "etcd-update",
"channels": [
{
"name": "alpha",
"csvName": "etcdoperator.v0.9.2"
}
],
"defaultChannelName": "alpha"
}
I think there is potentially another problem with the cluster. Do you have any suggestions?
Hi Jian, I spun up the same CI cluster that you were using to test this out on a real cluster. ~: k exec catalog-operator-7868f48c46-thcnn -- olm --version OLM version: 0.12.0 git commit: 1878972c541d52d22f87b12d14e4b76bbda1c0ea I did see the original bug, where the Last Observed State: CONNECTING was present on the CatalogSource pod for a few minutes - it did change to READY after a minute or two. Looking at the catalog operator pod logs, I saw the following after installing the new catalogsource: time="2019-11-06T18:18:16Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING" time="2019-11-06T18:18:19Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=TRANSIENT_FAILURE" time="2019-11-06T18:18:20Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING" time="2019-11-06T18:18:35Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=READY" The next time I ran it I saw something similar: time="2019-11-06T18:41:34Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING" time="2019-11-06T18:41:38Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=TRANSIENT_FAILURE" time="2019-11-06T18:41:39Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=CONNECTING" time="2019-11-06T18:41:46Z" level=info msg="state.Key.Namespace=test-operators state.Key.Name=installed-community-global-operators state.State=READY" I think there are some initial connection failures due to the nature of the gRPC connection - as you can see the the connection does establish itself after some time. The reason we are seeing this is because there is a delay between when the catalog source parses the config map into the DB and is ready to serve traffic versus when the catalog source pod passes the gRPC health check we have implemented currently. The health check passes instantly, but there is a delay between when the pod is reporting ready and when the pod can actually provide catalog source info (because of the additional work of parsing the config map data into the pod DB). In catalog sources where the data is in the image already, this delay is non-existent and we basically don't run into this issue. This issue is compounded by the fact that gRPC connections exponentially backoff after failure. There are potential ways to change this backoff behavior however https://godoc.org/google.golang.org/grpc#DialOption. This also explains why we don't see this issue on minikube - since all the pods are on one node there is no real networking. In a real cluster where the gRPC connection needs to be established this issue pops up. That being said, we are moving away from configmap based catalog sources and its not entirely clear we can definitely make this go away entirely. Recommended plan is to keep the status the same and update Daniel Messer regarding this bug and see whether or not its a priority. *** Bug 1749031 has been marked as a duplicate of this bug. *** Daniel, Many thanks for your details! Yes, it becomes ready after a while. I couldn't reproduce it either. Maybe the network didn't work well at that time. After discussing with Daniel Messer and Evan we are going to look into a fix for this issue, I will keep this open and assign a medium priority for now. Resetting this to NEW as I'm going to put this down for a while (others may feel free to steal this if bandwidth permits) to look at higher priority bugs. The transient failures resolve in every case I've seen. I invested a bit of effort in testing this with the reproducer with Kubernetes using our dev deploy. I've also found that both GRPC and configmap backed catalog sources exhibit the same behavior of TRANSIENT_FAILURE when starting up the connection. I added code to specifically check that the catalog source pod status was reporting ContainersReady before doing the grpc.Dial, which fixed the GRPC catalog sources but did not the configmap based ones. I suspect that the readiness probe is not 100% accurate and that's why configmap pods still do not succeed immediately, but that's just a guess. Given that the GRPC server is not started until after the configmap is read and the database is processed, it should have worked reliably with the pod status check in theory. Hi below the steps: [scolange@scolange ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-03-052929 True False 123m Cluster version is 4.3.0-0.nightly-2020-03-03-052929 [scolange@scolange ~]$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-7945dfd867-kjgdg -- olm --version OLM version: 0.13.0 git commit: 93f346148ab797b3d68616d4f4bddf1286fe3fd8 Steps to Reproduce: 1. Install the OCP 4.3. 2. Create a namespace called "test-operators". [scolange@scolange ~]$ oc create ns test-operators namespace/test-operators created 3. Install a ConfigMap in it. # oc create -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml [scolange@scolange ~]$ oc create -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml configmap/installed-community-global-operators created [scolange@scolange ~]$ oc get cm NAME DATA AGE installed-community-global-operators 3 109s 4, Create a CatalogSource, which its sourceTyoe is "configmap", as below: [scolange@scolange 1768819]$ cat catalogSource.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: installed-community-global-operators namespace: test-operators spec: configMap: installed-community-global-operators displayName: Community Operators icon: base64data: "" mediatype: "" publisher: Community sourceType: configmap [scolange@scolange 1768819]$ oc create -f catalogSource.yaml catalogsource.operators.coreos.com/installed-community-global-operators created [scolange@scolange 1768819]$ oc get catalogsource installed-community-global-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2020-03-03T12:11:43Z" generation: 1 name: installed-community-global-operators namespace: test-operators resourceVersion: "59234" selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test-operators/catalogsources/installed-community-global-operators uid: 63a244b3-3c69-4fbf-b6c9-001930a51d6c spec: configMap: installed-community-global-operators displayName: Community Operators icon: base64data: "" mediatype: "" publisher: Community sourceType: configmap status: configMapReference: lastUpdateTime: "2020-03-03T12:11:43Z" name: installed-community-global-operators namespace: test-operators resourceVersion: "59129" uid: 97fdb749-b2ba-4f99-bc15-f96213ec35c5 connectionState: address: installed-community-global-operators.test-operators.svc:50051 lastConnect: "2020-03-03T12:11:55Z" lastObservedState: CONNECTING registryService: createdAt: "2020-03-03T12:11:44Z" port: "50051" protocol: grpc serviceName: installed-community-global-operators serviceNamespace: test-operators after 5 minutes: BECAME Ready [scolange@scolange 1768819]$ oc get catalogsource installed-community-global-operators -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: creationTimestamp: "2020-03-03T12:11:43Z" generation: 1 name: installed-community-global-operators namespace: test-operators resourceVersion: "60077" selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test-operators/catalogsources/installed-community-global-operators uid: 63a244b3-3c69-4fbf-b6c9-001930a51d6c spec: configMap: installed-community-global-operators displayName: Community Operators icon: base64data: "" mediatype: "" publisher: Community sourceType: configmap status: configMapReference: lastUpdateTime: "2020-03-03T12:11:43Z" name: installed-community-global-operators namespace: test-operators resourceVersion: "59129" uid: 97fdb749-b2ba-4f99-bc15-f96213ec35c5 connectionState: address: installed-community-global-operators.test-operators.svc:50051 lastConnect: "2020-03-03T12:15:08Z" lastObservedState: READY registryService: createdAt: "2020-03-03T12:11:44Z" port: "50051" protocol: grpc serviceName: installed-community-global-operators serviceNamespace: test-operators LGMT. Additional info: [scolange@scolange 1768819]$ oc get pods NAME READY STATUS RESTARTS AGE installed-community-global-operators-98wlm 1/1 Running 0 92s [scolange@scolange 1768819]$ oc log installed-community-global-operators-98wlm log is DEPRECATED and will be removed in a future version. Use logs instead. time="2020-03-03T12:11:55Z" level=info msg="Using in-cluster kube client config" time="2020-03-03T12:11:55Z" level=info msg="loading CRDs" configmap=installed-community-global-operators ns=test-operators time="2020-03-03T12:11:55Z" level=info msg="loading Bundles" configmap=installed-community-global-operators ns=test-operators time="2020-03-03T12:11:55Z" level=info msg="loading Packages" configmap=installed-community-global-operators ns=test-operators time="2020-03-03T12:11:55Z" level=info msg="serving registry" configMapName=installed-community-global-operators configMapNamespace=test-operators port=50051 [scolange@scolange 1768819]$ oc rsh installed-community-global-operators-98wlm sh-4.2$ ps -elf|cat F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S 1001 1 0 0 80 0 - 157339 - 12:11 ? 00:00:00 configmap-server -c installed-community-global-operators -n test-operators 4 S 1001 380 0 0 80 0 - 2959 - 12:13 pts/0 00:00:00 /bin/sh 4 R 1001 416 380 0 80 0 - 12941 - 12:14 pts/0 00:00:00 ps -elf 4 S 1001 417 380 0 80 0 - 1099 - 12:14 pts/0 00:00:00 cat 5 R root 418 0 0 80 0 - 140643 - 12:14 ? 00:00:00 /usr/bin/runc init [scolange@scolange 1768819]$ oc get pods installed-community-global-operators-98wlm -o yaml|grep image image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9 imagePullPolicy: IfNotPresent imagePullSecrets: image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9 imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:68763083b100b5d82739dedb08d9c154cd3c8ee0435ed4fad5674cb9b013aec9 [scolange@scolange 1768819]$ oc get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE installed-community-global-operators ClusterIP 172.30.226.32 <none> 50051/TCP 3m23s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |