1814457 – community operator catalog image crashloop when applying CatalogSource

Bug 1814457 - community operator catalog image crashloop when applying CatalogSource

Summary: community operator catalog image crashloop when applying CatalogSource

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Evan Cordell
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1814821
TreeView+	depends on / blocked

Reported:	2020-03-17 23:29 UTC by Chris Doan
Modified:	2023-10-06 19:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1814777 1814821 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:22:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	operator-framework operator-lifecycle-manager pull 1389	0	None	closed	Bug 1814457: fix(catsrc): remove limits on catalogsource pods	2021-01-20 21:13:15 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:22:45 UTC

Description Chris Doan 2020-03-17 23:29:26 UTC

Description of problem:

Try to enable disconnected community catalog on IPV6 restricted network.
The catalog image running openshift-marketplace is constantly in crashloop or OOMKILL
No operators are served and we cannot deploy commmunity operators in disconnected IPV6 environment.

How reproducible:

1. went through the disconnected OLM procedure documented to enable community operators.
2. created the CatalogSource.yaml
3. applied using `oc apply -f CatalogSource.yaml`

```
---

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: my-community-operator-catalog
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: registry.XXXX:5000/opcatalog/community-operators:v1
  displayName: My ACM Community Operator Catalog
  publisher: grpc
```


Steps to Reproduce:
1.
2.
3.

Actual results:

```
marketplace
NAME                                    READY   STATUS             RESTARTS   AGE
marketplace-operator-86c6c54c55-5cv6f   1/1     Running            0          84m
my-community-operator-catalog-rk6sq     0/1     CrashLoopBackOff   9          24m
catalogsource
NAME                            DISPLAY                             TYPE   PUBLISHER   AGE
my-community-operator-catalog   My ACM Community Operator Catalog   grpc   grpc        69m
packagemanifest
No resources found in openshift-marketplace namespace.

```

Expected results:

Running pod.

Additional info:

Comment 1 Chris Doan 2020-03-17 23:38:26 UTC

Steps:

1. oc adm catalog build for community catalog
2. oc adm catalog mirror
3. oc apply -f community-operators-manifests
4. oc patch OperatorHub cluster --type json ...
5. oc apply -f catalogsource.yml

Comment 2 Evan Cordell 2020-03-18 02:40:34 UTC

Please provide any logs from the crashlooping pod, any logs from the disconnected mirroring process, and, if possible, a copy of the image that is being referenced (registry.XXXX:5000/opcatalog/community-operators:v1)

Comment 4 Chris Doan 2020-03-18 12:30:06 UTC

pod log:

[kni@r640-u01 ~]$ cat catalogimage.pod.log
time="2020-03-18T12:26:32Z" level=info msg="serving registry" database=/bundles.db port=50051

pod describe:

Name:         my-community-operator-catalog-rk6sq
Namespace:    openshift-marketplace
Priority:     0
Node:         openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com/2620:52:0:1386::37
Start Time:   Tue, 17 Mar 2020 19:03:39 -0400
Labels:       olm.catalogSource=my-community-operator-catalog
Annotations:  k8s.ovn.org/pod-networks:
                {"default":{"ip_address":"fd01::5:a006:cfff:fe00:3b/64","mac_address":"a2:06:cf:00:00:3b","gateway_ip":"fd01:0:0:5::1"}}
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "ovn-kubernetes",
                    "interface": "eth0",
                    "ips": [
                        "fd01::5:a006:cfff:fe00:3b"
                    ],
                    "mac": "a2:06:cf:00:00:3b",
                    "dns": {}
                }]
              openshift.io/scc: privileged
Status:       Running
IP:           fd01::5:a006:cfff:fe00:3b
IPs:
  IP:  fd01::5:a006:cfff:fe00:3b
Containers:
  registry-server:
    Container ID:   cri-o://0fc56da80f5115b283cd977640714daf768e02c5aab2d4f60761d04ad99928dd
    Image:          registry.qe1.kni.lab.eng.bos.redhat.com:5000/opcatalog/community-operators:v1
    Image ID:       registry.qe1.kni.lab.eng.bos.redhat.com:5000/opcatalog/community-operators@sha256:b017b2f8ddacc32d3d226449b0202d3bf5a85e9be2afa7b04af7148138bebd0f
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 18 Mar 2020 08:26:32 -0400
      Finished:     Wed, 18 Mar 2020 08:26:45 -0400
    Ready:          False
    Restart Count:  152
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=localhost:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=localhost:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-hzmcn (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-hzmcn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-hzmcn
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:
Events:
  Type     Reason     Age                     From                                                        Message
  ----     ------     ----                    ----                                                        -------
  Warning  Unhealthy  138m (x12 over 13h)     kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com  Liveness probe failed:
  Warning  Unhealthy  83m (x14 over 13h)      kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com  Readiness probe failed: command timed out
  Warning  Unhealthy  33m (x19 over 13h)      kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com  Readiness probe failed:
  Normal   Pulled     28m (x148 over 13h)     kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com  Container image "registry.qe1.kni.lab.eng.bos.redhat.com:5000/opcatalog/community-operators:v1" already present on machine
  Warning  BackOff    3m14s (x3531 over 13h)  kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com  Back-off restarting failed container

Comment 5 Chris Doan 2020-03-18 13:03:12 UTC

quay.io/cdoan/community-operators:v1

* since the image is created by the `oc adm catalog build...` not sure how valuable it is.
* is there a way to limit the images that is included in the built catalog images?

Comment 8 Chris Doan 2020-03-18 15:51:12 UTC

sorry about that. repushed image to: docker pull dhubchris/community-operators:v1

Comment 9 Constantin Vultur 2020-03-18 16:19:22 UTC

In my case, I am seeing this is on disconnected IPv6 environment on OCP 4.3

Comment 10 Evan Cordell 2020-03-18 17:31:31 UTC

There is currently a limit on the size of the pod allowed as a catalogsource - 100mb. This limit is exceeded by the community catalog, so the container is being killed by kube. There is currently a fix prepped and ready to merge, please see the linked PR. Once the PR merges, we will backport to 4.3 so that no workarounds are required.

In the meantime, there is a somewhat straightforward workaround:


1. Create a Pod that points to the catalog image in the operator-marketplace namespace. In this example I am using the image from Constantin's cluster, you could also do this with dhubchris/community-operators:v1

kind: Pod
apiVersion: v1
metadata:
  name: disconnected-operator-catalog-community-fixed
  namespace: openshift-marketplace
  labels:
    olm.catalogSource: disconnected-operator-catalog-community
spec:
  nodeSelector:
    beta.kubernetes.io/os: linux
  restartPolicy: Always
  serviceAccountName: default
  imagePullSecrets:
    - name: default-dockercfg-nlhhd
  enableServiceLinks: true
  terminationGracePeriodSeconds: 30
  containers:
    - resources:
        requests:
          cpu: 10m
          memory: 50Mi
      readinessProbe:
        exec:
          command:
            - grpc_health_probe
            - '-addr=localhost:50051'
        initialDelaySeconds: 5
        timeoutSeconds: 5
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      name: registry-server
      livenessProbe:
        exec:
          command:
            - grpc_health_probe
            - '-addr=localhost:50051'
        initialDelaySeconds: 10
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      ports:
        - name: grpc
          containerPort: 50051
          protocol: TCP
      imagePullPolicy: IfNotPresent
      image: registry.ocp-edge-cluster-cdv2.qe.lab.redhat.com:5000/restricted_olm/community-operators:v1
  serviceAccount: default
  tolerations:
    - operator: Exists



2. Create a CatalogSource that points to the address of the Pod you just created. This can be an ip address or a dns address, but must include the port (50051 by default):

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: disconnected-operator-catalog-community-fixed
  namespace: openshift-marketplace
spec:
  address: '[fd01::1:5088:b7ff:fe00:28]:50051'
  displayName: Community Operators
  sourceType: grpc

The Console UI does not let you create catalogs with the "address" field - you will either need to create with kubectl, or create a placeholder CatalogSource in the console and then edit it to remove the `image` field and use the `address` field instead.

If done correctly, the status of the catalog source should indicate a successful connection to OLM and indicate the the connection is `READY`.

Comment 13 Jian Zhang 2020-03-20 11:36:38 UTC

1, Create a 4.5 cluster that the fixed PR merged in.
mac:~ jianzhang$ oc adm release info registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-03-20-071514 --commits |grep lifecycle
  operator-lifecycle-manager                     https://github.com/operator-framework/operator-lifecycle-manager            a6162e46f31455d4f93b8215772b0dd8969652a0

mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-03-20-071514   True        False         34m     Cluster version is 4.5.0-0.nightly-2020-03-20-071514


2, Create a CatalogSource object with this "dhubchris/community-operators:v1" image. Its pod works well.

mac:~ jianzhang$ cat cs-1805410.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: bug-operator
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: dhubchris/community-operators:v1
  displayName: Bug Operators
  publisher: Red Hat

mac:~ jianzhang$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
bug-operator-47b4c                      1/1     Running   0          7m29s
certified-operators-75cbb755dd-wls6q    1/1     Running   0          60m
community-operators-5b6f745df-xbjt9     1/1     Running   0          60m
marketplace-operator-778449d4dd-sq7mz   1/1     Running   0          61m
redhat-marketplace-f98dbd4fb-twsqg      1/1     Running   0          60m
redhat-operators-dd9bcff79-vnwwg        1/1     Running   0          60m

mac:~ jianzhang$ oc get packagemanifest |grep -i bug
microcks                                     Bug Operators         7m58s
postgresql-operator-dev4devs-com             Bug Operators         7m58s
...


3, Check the Request CPU/Memory of this pod, no Limits. LGTM, verify it.
mac:~ jianzhang$ oc get pods -n openshift-marketplace bug-operator-47b4c  -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.129.2.8"
          ],
          "dns": {},
          "default-route": [
              "10.129.2.1"
          ]
      }]
    openshift.io/scc: anyuid
  creationTimestamp: "2020-03-20T11:25:49Z"
  generateName: bug-operator-
  labels:
    olm.catalogSource: bug-operator
  name: bug-operator-47b4c
  namespace: openshift-marketplace
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: false
    controller: false
    kind: CatalogSource
    name: bug-operator
    uid: aa7a762d-1658-42a4-8280-d79d8639e916
  resourceVersion: "33285"
  selfLink: /api/v1/namespaces/openshift-marketplace/pods/bug-operator-47b4c
  uid: 57d7ebf1-6269-4c02-a5fc-495af8ed6af9
spec:
  containers:
  - image: dhubchris/community-operators:v1
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=localhost:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: registry-server
    ports:
    - containerPort: 50051
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=localhost:50051
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    securityContext:
      capabilities:
        drop:
        - MKNOD
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-hctfc
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: default-dockercfg-hfgrp
  nodeName: ip-10-0-143-45.us-west-2.compute.internal
  nodeSelector:
    beta.kubernetes.io/os: linux
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    seLinuxOptions:
      level: s0:c23,c7
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - operator: Exists
  volumes:
  - name: default-token-hctfc
    secret:
      defaultMode: 420
      secretName: default-token-hctfc
...

Comment 15 errata-xmlrpc 2020-07-13 17:22:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.