Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1939979

Summary: Elasticsearch-operator can not start due to resource limitation
Product: OpenShift Container Platform Reporter: Giriyamma <gkarager>
Component: LoggingAssignee: ewolinet
Status: CLOSED CURRENTRELEASE QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.8CC: aos-bugs, brejones, cbaus, danili, satwsing
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1944048 1944049 (view as bug list) Environment:
Last Closed: 2021-03-26 19:28:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1944048, 1944049    
Attachments:
Description Flags
Elasticsearch memory leak none

Description Giriyamma 2021-03-17 11:52:49 UTC
Created attachment 1763979 [details]
Elasticsearch memory leak

Description of problem:
Noticing memory leak in elasticsearch operator

Version-Release number of selected component (if applicable):
Cluster version 4.8.0-0.nightly-2021-03-14-134919
elasticsearch-operator.5.0.1-22 

How reproducible:
Always

Steps to Reproduce:
1. Install 4.8 cluster
2. Install ES operator

Actual results:
$ oc get pods
NAME                                      READY   STATUS      RESTARTS   AGE
elasticsearch-operator-86dc488ff7-5lqf7   0/1     OOMKilled   0          25s
$oc get pods
NAME                                      READY   STATUS    RESTARTS   AGE
elasticsearch-operator-86dc488ff7-5lqf7   1/1     Running   2          79s


oc adm top pod elasticsearch-operator-5fc5cd598b-7vhj7
NAME                                      CPU(cores)   MEMORY(bytes)   
elasticsearch-operator-5fc5cd598b-7vhj7   1m           143Mi

Comment 1 Anping Li 2021-03-17 12:17:38 UTC
I can see the memrory fail from 134Mi to 89Mi.  there may be not a memory leak. Does the operator consume more memory during start? Could we give EO more memory(256Mi)?
NAME                                      CPU(cores)   MEMORY(bytes)   
elasticsearch-operator-5fc5cd598b-7vhj7   1m           143Mi         
NAME                                      CPU(cores)   MEMORY(bytes)   
elasticsearch-operator-5fc5cd598b-7vhj7   5m           89Mi

Comment 2 Brett Jones 2021-03-17 18:13:35 UTC
What indicator are you using to determine that this is a memory leak? Spikes in memory usage are normal as the application loads and processes during its lifecycle. A leak would be indicated by a ramp or growth that never dropped. This looks like standard operations to me.

Comment 3 Anping Li 2021-03-22 11:09:15 UTC
The elasticsearch-operator can't be started in OCP 4.8. if that isn't a memeoy leak, We shoud add more memory to elasticsearch-operator.

Comment 4 Anping Li 2021-03-22 11:11:08 UTC
$ oc get pods
NAME                                      READY   STATUS      RESTARTS   AGE
elasticsearch-operator-86dc488ff7-5lqf7   0/1     OOMKilled   0          25s
$oc get pods
NAME                                      READY   STATUS    RESTARTS   AGE
elasticsearch-operator-86dc488ff7-5lqf7   1/1     Running   2          79s

Comment 5 Anping Li 2021-03-22 11:12:13 UTC
@brejones, you can see the EO were restarted in 79s at comment4

Comment 6 Satwinder Singh 2021-03-23 14:41:03 UTC
Faced the same issue on ppc64le arch

# oc get pods -A | grep elas
openshift-operators-redhat                         elasticsearch-operator-666df89fc7-drlzz                           0/1     OOMKilled   3          2m22s

      elasticsearch-operator
    State:          Running
      Started:      Tue, 23 Mar 2021 10:28:24 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137

Comment 7 Dan Li 2021-03-23 17:54:52 UTC
Hi Jeff & logging team, this bug blocks the multi-arch Power/ppc64le team from installing Elasticsearch and Logging. Should it be set as "Blocker+" flag as it is blocking the completion of regression testing?

Comment 9 Carvel Baus 2021-03-24 12:34:57 UTC
How much memory is available on each node in the cluster where this is running?

Comment 10 Satwinder Singh 2021-03-24 13:43:19 UTC
We have 2 worker nodes, each with 64G memory
$ free -g
              total        used        free      shared  buff/cache   available
Mem:             63          12          43           0           8          51
Swap:             0           0           0
(In reply to Carvel Baus from comment #9)
> How much memory is available on each node in the cluster where this is
> running?

Comment 11 Carvel Baus 2021-03-24 16:55:53 UTC
(In reply to Satwinder Singh from comment #10)
> We have 2 worker nodes, each with 64G memory

A couple of assumptions I want to verify:

- masters are not schedulable
- masters have same 64G ram as workers


thanks

Comment 12 Satwinder Singh 2021-03-25 08:57:47 UTC
(In reply to Carvel Baus from comment #11)
> (In reply to Satwinder Singh from comment #10)
> > We have 2 worker nodes, each with 64G memory
> 
> A couple of assumptions I want to verify:
> 
> - masters are not schedulable
> - masters have same 64G ram as workers
> 
> 
> thanks

we have 3 masters, with 32G each
Elasticsearch operator is scheduled on one of the worker

# oc get pods -A -o wide | grep elastic
openshift-operators-redhat                         elasticsearch-operator-666df89fc7-drlzz                           0/1     CrashLoopBackOff   479        42h     10.128.2.76    worker-1   <none>           <none>

Comment 13 ewolinet 2021-03-25 19:11:02 UTC
I don't see any resources set for my locally deployed EO (built from master) [1].

Also, I have two containers as part of the elasticsearch operator and neither runs into issues.

Given we don't set a limit I'm not sure how we can run out of memory unless the node itself has run out of memory quota.

Can you confirm what your Elasticsearch Operator deployment looks like?

[1]
      - command:
        - elasticsearch-operator
        env:
        - name: WATCH_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.annotations['olm.targetNamespaces']
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: OPERATOR_NAME
          value: elasticsearch-operator
        - name: PROXY_IMAGE
          value: registry.ci.openshift.org/ocp/4.7:oauth-proxy
        - name: ELASTICSEARCH_PROXY
          value: registry.ci.openshift.org/logging/5.1:elasticsearch-proxy
        - name: ELASTICSEARCH_IMAGE
          value: registry.ci.openshift.org/logging/5.1:logging-elasticsearch6
        - name: KIBANA_IMAGE
          value: registry.ci.openshift.org/logging/5.1:logging-kibana6
        image: image-registry.openshift-image-registry.svc:5000/openshift/origin-elasticsearch-operator:latest
        imagePullPolicy: IfNotPresent
        name: elasticsearch-operator
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources: {}
...

Comment 14 Giriyamma 2021-03-26 10:12:59 UTC
@ewolinet Here is my elasticsearch Operator deployment:

    spec:
      containers:
      - command:
        - elasticsearch-operator
        env:
        - name: WATCH_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.annotations['olm.targetNamespaces']
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: OPERATOR_NAME
          value: elasticsearch-operator
        - name: PROXY_IMAGE
          value: registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4d76057c03bebf8d9ca5a936dc3e98da19fcf31a3b37652118088cc2a85a8831
        - name: ELASTICSEARCH_PROXY
          value: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel8@sha256:7dca103ae5da02d9e7ee5595a1247f263202bc0060599c134a90be4fb4ab508b
        - name: ELASTICSEARCH_IMAGE
          value: registry.redhat.io/openshift-logging/elasticsearch6-rhel8@sha256:a976929ef6f5a16c174376cb6163a7262a7bcfd64391e02fa20b8a251df94d14
        - name: KIBANA_IMAGE
          value: registry.redhat.io/openshift-logging/kibana6-rhel8@sha256:5e2584172c1c5d83171b5f029c46e8a6038803ae48a2ba47a9d4716daf93b0d1
        - name: OPERATOR_CONDITION_NAME
          value: elasticsearch-operator.5.0.1-23
        image: registry.redhat.io/openshift-logging/elasticsearch-rhel8-operator@sha256:22e1402e0e27d4706efbf88cbc26c88344616216bd1e01bb52b15ac560732efc
        imagePullPolicy: IfNotPresent
        name: elasticsearch-operator
        ports:
        - containerPort: 60000
          name: metrics
          protocol: TCP
        resources:
          limits:
            cpu: 200m
            memory: 128Mi
          requests:
            cpu: 100m
            memory: 64Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: elasticsearch-operator
      serviceAccountName: elasticsearch-operator
      terminationGracePeriodSeconds: 30

Comment 15 ewolinet 2021-03-26 19:28:19 UTC
Given this is for post EO 4.6 I'm going to close this out.

Giriyamma,

can you please open up Jira bugs to track this fix on 5.0 and 5.1?


5.1 pr: https://github.com/openshift/elasticsearch-operator/pull/683
5.0 pr: https://github.com/openshift/elasticsearch-operator/pull/684