1760608 – OLM pods have no resource limits set

Bug 1760608 - OLM pods have no resource limits set

Summary: OLM pods have no resource limits set

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Evan Cordell
QA Contact:	Salvatore Colangelo
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1780755 (view as bug list)
Depends On:
Blocks:	1780755
TreeView+	depends on / blocked

Reported:	2019-10-10 22:21 UTC by Christoph Blecker
Modified:	2020-01-30 14:06 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1780755 (view as bug list)
Environment:
Last Closed:	2020-01-30 14:06:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	operator-framework operator-lifecycle-manager pull 1075	'None'	closed	Bug 1760608: Add resource limits to all pods	2020-03-31 08:28:50 UTC
Github	operator-framework operator-lifecycle-manager pull 1142	'None'	closed	Bug 1760608: add resource limits to all OLM pods and the 0.13.0 release for OCP	2020-03-31 08:28:51 UTC
Github	operator-framework operator-lifecycle-manager pull 1149	'None'	closed	Bug 1760608: remove resource limits from packageserver	2020-03-31 08:28:50 UTC
Red Hat Bugzilla	1740857	high	CLOSED	catalog-operator consumes 11GB RSS	2021-02-22 00:41:40 UTC

Description Christoph Blecker 2019-10-10 22:21:28 UTC

Description of problem:
OLM pods lack a resource limit, which allows them to potentially use excessive amounts of memory.


How reproducible:
Memory leaks such as https://bugzilla.redhat.com/show_bug.cgi?id=1740857 are able to happen because there isn't a memory limit that would cause the pod to be restarted.


Additional info:
Requests were added in https://github.com/operator-framework/operator-lifecycle-manager/pull/955. Requests are not available in 4.1.z yet, and there are still no limits at all in the master branch.

Comment 1 Dan Geoffroy 2019-10-14 17:05:23 UTC

Moving to 4.3.  Will consider backport for both 4.1 and 4.2 after delivered to Master.

Comment 3 Daniel Sover 2019-11-11 15:45:54 UTC

I am going to gather some data from a running CI cluster on the OLM operator pod's usage of memory and CPU as more operators and CRs are introduced into the cluster. 

Once that data is available I'm going to share it with members of the OLM team and come up with some kind of summary metric that we can apply as limits when deploying OLM.

Comment 4 Daniel Sover 2019-11-19 15:57:59 UTC

https://github.com/operator-framework/operator-lifecycle-manager/pull/1142

Comment 6 Alexander Greene 2019-11-19 18:30:19 UTC

PR hasn't merged yet.

Comment 7 Jeff Peeler 2019-12-06 19:47:55 UTC

https://github.com/operator-framework/operator-lifecycle-manager/pull/1142 has merged, but happened around the time that release branches were cut (I think). So moving this to 4.4 and will clone to 4.3 (and update PR link).

Comment 9 Salvatore Colangelo 2019-12-24 17:10:37 UTC

Hi ,

 no limit show in pods: as told in https://github.com/operator-framework/operator-lifecycle-manager/pull/1179


[scolange@scolange ~]$ oc project openshift-operator-lifecycle-manager
Now using project "openshift-operator-lifecycle-manager" on server "https://api.juzhao-44.qe.devcluster.openshift.com:6443".

[scolange@scolange ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2019-12-20-210709   True        False         15h     Cluster version is 4.4.0-0.nightly-2019-12-20-210709


[scolange@scolange ~]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-6b4898f6f5-95pzx   1/1     Running   0          16h
olm-operator-6fb5fffb9-fvvgk        1/1     Running   0          16h
packageserver-6d5658d454-f659k      1/1     Running   0          16h
packageserver-6d5658d454-qkw5h      1/1     Running   0          16h
[scolange@scolange ~]$ oc get catalog-operator-6b4898f6f5-95pzx -o yaml
error: the server doesn't have a resource type "catalog-operator-6b4898f6f5-95pzx"
[scolange@scolange ~]$ oc get pod catalog-operator-6b4898f6f5-95pzx -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.129.0.17"
          ],
          "dns": {},
          "default-route": [
              "10.129.0.1"
          ]
      }]
  creationTimestamp: "2019-12-24T00:53:54Z"
  generateName: catalog-operator-6b4898f6f5-
  labels:
    app: catalog-operator
    pod-template-hash: 6b4898f6f5
  name: catalog-operator-6b4898f6f5-95pzx
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: catalog-operator-6b4898f6f5
    uid: abf073f4-9cea-4d8a-bf89-345bf2fc75a9
  resourceVersion: "4720"
  selfLink: /api/v1/namespaces/openshift-operator-lifecycle-manager/pods/catalog-operator-6b4898f6f5-95pzx
  uid: f2e34439-dfb6-4af4-8025-a7288b9a881b
spec:
  containers:
  - args:
    - -namespace
    - openshift-marketplace
    - -configmapServerImage=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cecac84c3cdc369e130ceb8107de07cd7c00399d28cfb681a3968e09d9094be0
    - -writeStatusName
    - operator-lifecycle-manager-catalog
    - -tls-cert
    - /var/run/secrets/serving-cert/tls.crt
    - -tls-key
    - /var/run/secrets/serving-cert/tls.key
    command:
    - /bin/catalog
    env:
    - name: RELEASE_VERSION
      value: 4.4.0-0.nightly-2019-12-20-210709
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ba931b3266d8bef0b61dfd64383ef4b5d36a50ec3091f2ca59fd2e17609aa60
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: catalog-operator
    ports:
    - containerPort: 8080
      protocol: TCP
    - containerPort: 8081
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 10m
        memory: 80Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/serving-cert
      name: serving-cert
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: olm-operator-serviceaccount-token-4h84j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-0-163-60.us-east-2.compute.internal
  nodeSelector:
    beta.kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: olm-operator-serviceaccount
  serviceAccountName: olm-operator-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: serving-cert
    secret:
      defaultMode: 420
      secretName: catalog-operator-serving-cert
  - name: olm-operator-serviceaccount-token-4h84j
    secret:
      defaultMode: 420
      secretName: olm-operator-serviceaccount-token-4h84j
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-12-24T00:55:36Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-12-24T00:57:23Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2019-12-24T00:57:23Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-12-24T00:55:36Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://c2850d92cecbdb66a2aae1b010c0eb1305fdb7e4f489fbe5701d33a8917df082
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ba931b3266d8bef0b61dfd64383ef4b5d36a50ec3091f2ca59fd2e17609aa60
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ba931b3266d8bef0b61dfd64383ef4b5d36a50ec3091f2ca59fd2e17609aa60
    lastState: {}
    name: catalog-operator
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2019-12-24T00:57:16Z"
  hostIP: 10.0.163.60
  phase: Running
  podIP: 10.129.0.17
  podIPs:
  - ip: 10.129.0.17
  qosClass: Burstable
  startTime: "2019-12-24T00:55:36Z"
[scolange@scolange ~]$ oc get pod olm-operator-6fb5fffb9-fvvgk -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
              "10.129.0.16"
          ],
          "dns": {},
          "default-route": [
              "10.129.0.1"
          ]
      }]
  creationTimestamp: "2019-12-24T00:53:54Z"
  generateName: olm-operator-6fb5fffb9-
  labels:
    app: olm-operator
    pod-template-hash: 6fb5fffb9
  name: olm-operator-6fb5fffb9-fvvgk
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: olm-operator-6fb5fffb9
    uid: 7b19acc6-c772-40ec-9105-b426901260b2
  resourceVersion: "4707"
  selfLink: /api/v1/namespaces/openshift-operator-lifecycle-manager/pods/olm-operator-6fb5fffb9-fvvgk
  uid: c85e49b9-8a9a-4685-a5a4-3d54f769f58c
spec:
  containers:
  - args:
    - -namespace
    - $(OPERATOR_NAMESPACE)
    - -writeStatusName
    - operator-lifecycle-manager
    - -writePackageServerStatusName
    - operator-lifecycle-manager-packageserver
    - -tls-cert
    - /var/run/secrets/serving-cert/tls.crt
    - -tls-key
    - /var/run/secrets/serving-cert/tls.key
    command:
    - /bin/olm
    env:
    - name: RELEASE_VERSION
      value: 4.4.0-0.nightly-2019-12-20-210709
    - name: OPERATOR_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: OPERATOR_NAME
      value: olm-operator
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ba931b3266d8bef0b61dfd64383ef4b5d36a50ec3091f2ca59fd2e17609aa60
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: olm-operator
    ports:
    - containerPort: 8080
      protocol: TCP
    - containerPort: 8081
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 10m
        memory: 160Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/serving-cert
      name: serving-cert
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: olm-operator-serviceaccount-token-4h84j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-0-163-60.us-east-2.compute.internal
  nodeSelector:
    beta.kubernetes.io/os: linux
    node-role.kubernetes.io/master: ""
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: olm-operator-serviceaccount
  serviceAccountName: olm-operator-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 120
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 120
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: serving-cert
    secret:
      defaultMode: 420
      secretName: olm-operator-serving-cert
  - name: olm-operator-serviceaccount-token-4h84j
    secret:
      defaultMode: 420
      secretName: olm-operator-serviceaccount-token-4h84j
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-12-24T00:55:36Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-12-24T00:57:21Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2019-12-24T00:57:21Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-12-24T00:55:36Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://bb157d1858a7a5098c5742347e9f2912a648637791cd2aa18d7732154c81624d
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ba931b3266d8bef0b61dfd64383ef4b5d36a50ec3091f2ca59fd2e17609aa60
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2ba931b3266d8bef0b61dfd64383ef4b5d36a50ec3091f2ca59fd2e17609aa60
    lastState: {}
    name: olm-operator
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2019-12-24T00:57:16Z"
  hostIP: 10.0.163.60
  phase: Running
  podIP: 10.129.0.16
  podIPs:
  - ip: 10.129.0.16
  qosClass: Burstable
  startTime: "2019-12-24T00:55:36Z"

Exspected result:

  resources:
    limits:
      cpu: 200m
      memory: 200Mi
      cpu: 400m
      memory: 400Mi

Actual result :
    
   resources:
      requests:
        cpu: 10m
        memory: 160Mi

Comment 10 Jian Zhang 2020-01-08 07:12:26 UTC

*** Bug 1780755 has been marked as a duplicate of this bug. ***

Comment 12 Evan Cordell 2020-01-30 14:06:19 UTC

We attempted to set some reasonable limits here, but during scale testing we found that even those could be hit and cause a problem.

We reverted this change to align with other openshift cluster operators, which do not set limits.

Note You need to log in before you can comment on or make changes to this bug.