Bug 1939979
| Summary: | Elasticsearch-operator can not start due to resource limitation | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Giriyamma <gkarager> | ||||
| Component: | Logging | Assignee: | ewolinet | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Anping Li <anli> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.8 | CC: | aos-bugs, brejones, cbaus, danili, satwsing | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1944048 1944049 (view as bug list) | Environment: | |||||
| Last Closed: | 2021-03-26 19:28:19 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1944048, 1944049 | ||||||
| Attachments: |
|
||||||
I can see the memrory fail from 134Mi to 89Mi. there may be not a memory leak. Does the operator consume more memory during start? Could we give EO more memory(256Mi)? NAME CPU(cores) MEMORY(bytes) elasticsearch-operator-5fc5cd598b-7vhj7 1m 143Mi NAME CPU(cores) MEMORY(bytes) elasticsearch-operator-5fc5cd598b-7vhj7 5m 89Mi What indicator are you using to determine that this is a memory leak? Spikes in memory usage are normal as the application loads and processes during its lifecycle. A leak would be indicated by a ramp or growth that never dropped. This looks like standard operations to me. The elasticsearch-operator can't be started in OCP 4.8. if that isn't a memeoy leak, We shoud add more memory to elasticsearch-operator. $ oc get pods NAME READY STATUS RESTARTS AGE elasticsearch-operator-86dc488ff7-5lqf7 0/1 OOMKilled 0 25s $oc get pods NAME READY STATUS RESTARTS AGE elasticsearch-operator-86dc488ff7-5lqf7 1/1 Running 2 79s @brejones, you can see the EO were restarted in 79s at comment4 Faced the same issue on ppc64le arch
# oc get pods -A | grep elas
openshift-operators-redhat elasticsearch-operator-666df89fc7-drlzz 0/1 OOMKilled 3 2m22s
elasticsearch-operator
State: Running
Started: Tue, 23 Mar 2021 10:28:24 -0400
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Hi Jeff & logging team, this bug blocks the multi-arch Power/ppc64le team from installing Elasticsearch and Logging. Should it be set as "Blocker+" flag as it is blocking the completion of regression testing? How much memory is available on each node in the cluster where this is running? We have 2 worker nodes, each with 64G memory
$ free -g
total used free shared buff/cache available
Mem: 63 12 43 0 8 51
Swap: 0 0 0
(In reply to Carvel Baus from comment #9)
> How much memory is available on each node in the cluster where this is
> running?
(In reply to Satwinder Singh from comment #10) > We have 2 worker nodes, each with 64G memory A couple of assumptions I want to verify: - masters are not schedulable - masters have same 64G ram as workers thanks (In reply to Carvel Baus from comment #11) > (In reply to Satwinder Singh from comment #10) > > We have 2 worker nodes, each with 64G memory > > A couple of assumptions I want to verify: > > - masters are not schedulable > - masters have same 64G ram as workers > > > thanks we have 3 masters, with 32G each Elasticsearch operator is scheduled on one of the worker # oc get pods -A -o wide | grep elastic openshift-operators-redhat elasticsearch-operator-666df89fc7-drlzz 0/1 CrashLoopBackOff 479 42h 10.128.2.76 worker-1 <none> <none> I don't see any resources set for my locally deployed EO (built from master) [1].
Also, I have two containers as part of the elasticsearch operator and neither runs into issues.
Given we don't set a limit I'm not sure how we can run out of memory unless the node itself has run out of memory quota.
Can you confirm what your Elasticsearch Operator deployment looks like?
[1]
- command:
- elasticsearch-operator
env:
- name: WATCH_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.annotations['olm.targetNamespaces']
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: OPERATOR_NAME
value: elasticsearch-operator
- name: PROXY_IMAGE
value: registry.ci.openshift.org/ocp/4.7:oauth-proxy
- name: ELASTICSEARCH_PROXY
value: registry.ci.openshift.org/logging/5.1:elasticsearch-proxy
- name: ELASTICSEARCH_IMAGE
value: registry.ci.openshift.org/logging/5.1:logging-elasticsearch6
- name: KIBANA_IMAGE
value: registry.ci.openshift.org/logging/5.1:logging-kibana6
image: image-registry.openshift-image-registry.svc:5000/openshift/origin-elasticsearch-operator:latest
imagePullPolicy: IfNotPresent
name: elasticsearch-operator
ports:
- containerPort: 8080
name: http
protocol: TCP
resources: {}
...
@ewolinet Here is my elasticsearch Operator deployment:
spec:
containers:
- command:
- elasticsearch-operator
env:
- name: WATCH_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.annotations['olm.targetNamespaces']
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: OPERATOR_NAME
value: elasticsearch-operator
- name: PROXY_IMAGE
value: registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4d76057c03bebf8d9ca5a936dc3e98da19fcf31a3b37652118088cc2a85a8831
- name: ELASTICSEARCH_PROXY
value: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel8@sha256:7dca103ae5da02d9e7ee5595a1247f263202bc0060599c134a90be4fb4ab508b
- name: ELASTICSEARCH_IMAGE
value: registry.redhat.io/openshift-logging/elasticsearch6-rhel8@sha256:a976929ef6f5a16c174376cb6163a7262a7bcfd64391e02fa20b8a251df94d14
- name: KIBANA_IMAGE
value: registry.redhat.io/openshift-logging/kibana6-rhel8@sha256:5e2584172c1c5d83171b5f029c46e8a6038803ae48a2ba47a9d4716daf93b0d1
- name: OPERATOR_CONDITION_NAME
value: elasticsearch-operator.5.0.1-23
image: registry.redhat.io/openshift-logging/elasticsearch-rhel8-operator@sha256:22e1402e0e27d4706efbf88cbc26c88344616216bd1e01bb52b15ac560732efc
imagePullPolicy: IfNotPresent
name: elasticsearch-operator
ports:
- containerPort: 60000
name: metrics
protocol: TCP
resources:
limits:
cpu: 200m
memory: 128Mi
requests:
cpu: 100m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: elasticsearch-operator
serviceAccountName: elasticsearch-operator
terminationGracePeriodSeconds: 30
Given this is for post EO 4.6 I'm going to close this out. Giriyamma, can you please open up Jira bugs to track this fix on 5.0 and 5.1? 5.1 pr: https://github.com/openshift/elasticsearch-operator/pull/683 5.0 pr: https://github.com/openshift/elasticsearch-operator/pull/684 |
Created attachment 1763979 [details] Elasticsearch memory leak Description of problem: Noticing memory leak in elasticsearch operator Version-Release number of selected component (if applicable): Cluster version 4.8.0-0.nightly-2021-03-14-134919 elasticsearch-operator.5.0.1-22 How reproducible: Always Steps to Reproduce: 1. Install 4.8 cluster 2. Install ES operator Actual results: $ oc get pods NAME READY STATUS RESTARTS AGE elasticsearch-operator-86dc488ff7-5lqf7 0/1 OOMKilled 0 25s $oc get pods NAME READY STATUS RESTARTS AGE elasticsearch-operator-86dc488ff7-5lqf7 1/1 Running 2 79s oc adm top pod elasticsearch-operator-5fc5cd598b-7vhj7 NAME CPU(cores) MEMORY(bytes) elasticsearch-operator-5fc5cd598b-7vhj7 1m 143Mi