Bug 1814457
| Summary: | community operator catalog image crashloop when applying CatalogSource | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Chris Doan <cdoan> | |
| Component: | OLM | Assignee: | Evan Cordell <ecordell> | |
| OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | high | CC: | acarter, augol, cvultur, jbasquil, jiazha, kuiwang, sasha, sburke | |
| Version: | 4.4 | Keywords: | TestBlocker | |
| Target Milestone: | --- | |||
| Target Release: | 4.5.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1814777 1814821 (view as bug list) | Environment: | ||
| Last Closed: | 2020-07-13 17:22:24 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1814821 | |||
|
Description
Chris Doan
2020-03-17 23:29:26 UTC
Steps: 1. oc adm catalog build for community catalog 2. oc adm catalog mirror 3. oc apply -f community-operators-manifests 4. oc patch OperatorHub cluster --type json ... 5. oc apply -f catalogsource.yml Please provide any logs from the crashlooping pod, any logs from the disconnected mirroring process, and, if possible, a copy of the image that is being referenced (registry.XXXX:5000/opcatalog/community-operators:v1) pod log:
[kni@r640-u01 ~]$ cat catalogimage.pod.log
time="2020-03-18T12:26:32Z" level=info msg="serving registry" database=/bundles.db port=50051
pod describe:
Name: my-community-operator-catalog-rk6sq
Namespace: openshift-marketplace
Priority: 0
Node: openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com/2620:52:0:1386::37
Start Time: Tue, 17 Mar 2020 19:03:39 -0400
Labels: olm.catalogSource=my-community-operator-catalog
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_address":"fd01::5:a006:cfff:fe00:3b/64","mac_address":"a2:06:cf:00:00:3b","gateway_ip":"fd01:0:0:5::1"}}
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"fd01::5:a006:cfff:fe00:3b"
],
"mac": "a2:06:cf:00:00:3b",
"dns": {}
}]
openshift.io/scc: privileged
Status: Running
IP: fd01::5:a006:cfff:fe00:3b
IPs:
IP: fd01::5:a006:cfff:fe00:3b
Containers:
registry-server:
Container ID: cri-o://0fc56da80f5115b283cd977640714daf768e02c5aab2d4f60761d04ad99928dd
Image: registry.qe1.kni.lab.eng.bos.redhat.com:5000/opcatalog/community-operators:v1
Image ID: registry.qe1.kni.lab.eng.bos.redhat.com:5000/opcatalog/community-operators@sha256:b017b2f8ddacc32d3d226449b0202d3bf5a85e9be2afa7b04af7148138bebd0f
Port: 50051/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 18 Mar 2020 08:26:32 -0400
Finished: Wed, 18 Mar 2020 08:26:45 -0400
Ready: False
Restart Count: 152
Limits:
cpu: 100m
memory: 100Mi
Requests:
cpu: 10m
memory: 50Mi
Liveness: exec [grpc_health_probe -addr=localhost:50051] delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [grpc_health_probe -addr=localhost:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-hzmcn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-hzmcn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-hzmcn
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 138m (x12 over 13h) kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com Liveness probe failed:
Warning Unhealthy 83m (x14 over 13h) kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com Readiness probe failed: command timed out
Warning Unhealthy 33m (x19 over 13h) kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com Readiness probe failed:
Normal Pulled 28m (x148 over 13h) kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com Container image "registry.qe1.kni.lab.eng.bos.redhat.com:5000/opcatalog/community-operators:v1" already present on machine
Warning BackOff 3m14s (x3531 over 13h) kubelet, openshift-worker-0.qe1.kni.lab.eng.bos.redhat.com Back-off restarting failed container
quay.io/cdoan/community-operators:v1 * since the image is created by the `oc adm catalog build...` not sure how valuable it is. * is there a way to limit the images that is included in the built catalog images? sorry about that. repushed image to: docker pull dhubchris/community-operators:v1 In my case, I am seeing this is on disconnected IPv6 environment on OCP 4.3 There is currently a limit on the size of the pod allowed as a catalogsource - 100mb. This limit is exceeded by the community catalog, so the container is being killed by kube. There is currently a fix prepped and ready to merge, please see the linked PR. Once the PR merges, we will backport to 4.3 so that no workarounds are required.
In the meantime, there is a somewhat straightforward workaround:
1. Create a Pod that points to the catalog image in the operator-marketplace namespace. In this example I am using the image from Constantin's cluster, you could also do this with dhubchris/community-operators:v1
kind: Pod
apiVersion: v1
metadata:
name: disconnected-operator-catalog-community-fixed
namespace: openshift-marketplace
labels:
olm.catalogSource: disconnected-operator-catalog-community
spec:
nodeSelector:
beta.kubernetes.io/os: linux
restartPolicy: Always
serviceAccountName: default
imagePullSecrets:
- name: default-dockercfg-nlhhd
enableServiceLinks: true
terminationGracePeriodSeconds: 30
containers:
- resources:
requests:
cpu: 10m
memory: 50Mi
readinessProbe:
exec:
command:
- grpc_health_probe
- '-addr=localhost:50051'
initialDelaySeconds: 5
timeoutSeconds: 5
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
name: registry-server
livenessProbe:
exec:
command:
- grpc_health_probe
- '-addr=localhost:50051'
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
ports:
- name: grpc
containerPort: 50051
protocol: TCP
imagePullPolicy: IfNotPresent
image: registry.ocp-edge-cluster-cdv2.qe.lab.redhat.com:5000/restricted_olm/community-operators:v1
serviceAccount: default
tolerations:
- operator: Exists
2. Create a CatalogSource that points to the address of the Pod you just created. This can be an ip address or a dns address, but must include the port (50051 by default):
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: disconnected-operator-catalog-community-fixed
namespace: openshift-marketplace
spec:
address: '[fd01::1:5088:b7ff:fe00:28]:50051'
displayName: Community Operators
sourceType: grpc
The Console UI does not let you create catalogs with the "address" field - you will either need to create with kubectl, or create a placeholder CatalogSource in the console and then edit it to remove the `image` field and use the `address` field instead.
If done correctly, the status of the catalog source should indicate a successful connection to OLM and indicate the the connection is `READY`.
1, Create a 4.5 cluster that the fixed PR merged in. mac:~ jianzhang$ oc adm release info registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-03-20-071514 --commits |grep lifecycle operator-lifecycle-manager https://github.com/operator-framework/operator-lifecycle-manager a6162e46f31455d4f93b8215772b0dd8969652a0 mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-03-20-071514 True False 34m Cluster version is 4.5.0-0.nightly-2020-03-20-071514 2, Create a CatalogSource object with this "dhubchris/community-operators:v1" image. Its pod works well. mac:~ jianzhang$ cat cs-1805410.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: bug-operator namespace: openshift-marketplace spec: sourceType: grpc image: dhubchris/community-operators:v1 displayName: Bug Operators publisher: Red Hat mac:~ jianzhang$ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE bug-operator-47b4c 1/1 Running 0 7m29s certified-operators-75cbb755dd-wls6q 1/1 Running 0 60m community-operators-5b6f745df-xbjt9 1/1 Running 0 60m marketplace-operator-778449d4dd-sq7mz 1/1 Running 0 61m redhat-marketplace-f98dbd4fb-twsqg 1/1 Running 0 60m redhat-operators-dd9bcff79-vnwwg 1/1 Running 0 60m mac:~ jianzhang$ oc get packagemanifest |grep -i bug microcks Bug Operators 7m58s postgresql-operator-dev4devs-com Bug Operators 7m58s ... 3, Check the Request CPU/Memory of this pod, no Limits. LGTM, verify it. mac:~ jianzhang$ oc get pods -n openshift-marketplace bug-operator-47b4c -o yaml apiVersion: v1 kind: Pod metadata: annotations: k8s.v1.cni.cncf.io/networks-status: |- [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.129.2.8" ], "dns": {}, "default-route": [ "10.129.2.1" ] }] openshift.io/scc: anyuid creationTimestamp: "2020-03-20T11:25:49Z" generateName: bug-operator- labels: olm.catalogSource: bug-operator name: bug-operator-47b4c namespace: openshift-marketplace ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: CatalogSource name: bug-operator uid: aa7a762d-1658-42a4-8280-d79d8639e916 resourceVersion: "33285" selfLink: /api/v1/namespaces/openshift-marketplace/pods/bug-operator-47b4c uid: 57d7ebf1-6269-4c02-a5fc-495af8ed6af9 spec: containers: - image: dhubchris/community-operators:v1 imagePullPolicy: IfNotPresent livenessProbe: exec: command: - grpc_health_probe - -addr=localhost:50051 failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 name: registry-server ports: - containerPort: 50051 name: grpc protocol: TCP readinessProbe: exec: command: - grpc_health_probe - -addr=localhost:50051 failureThreshold: 3 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 5 resources: requests: cpu: 10m memory: 50Mi securityContext: capabilities: drop: - MKNOD terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-hctfc readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: default-dockercfg-hfgrp nodeName: ip-10-0-143-45.us-west-2.compute.internal nodeSelector: beta.kubernetes.io/os: linux priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: seLinuxOptions: level: s0:c23,c7 serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations: - operator: Exists volumes: - name: default-token-hctfc secret: defaultMode: 420 secretName: default-token-hctfc ... Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |