Description of problem: The proxy container restarts frequently: $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-69c9d6f5bc-njb8q 1/1 Running 0 135m elasticsearch-cdm-soj6vq70-1-688ddf8c9f-87vt9 2/2 Running 12 135m elasticsearch-cdm-soj6vq70-2-64876b6bd-sld68 2/2 Running 12 140m elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww 2/2 Running 12 146m $ oc describe pod elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww proxy: Container ID: cri-o://9da3f8434af0659f409ba6690dfa1d6b3753cf10c8499e6a6b462381380a22b5 Image: registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:ae8c365911b5f2e24b9fd07129ce18594834997ec7c73bd2652dfd695e75746f Image ID: registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:1d0e41e6c7ffe968149d6b9b90b56081af15a3370fe3de607a84b3c20677dbc9 Ports: 60000/TCP, 60001/TCP Host Ports: 0/TCP, 0/TCP Args: --listening-address=:60000 --tls-cert=/etc/proxy/elasticsearch/logging-es.crt --tls-key=/etc/proxy/elasticsearch/logging-es.key --tls-client-ca=/etc/proxy/elasticsearch/admin-ca --metrics-listening-address=:60001 --metrics-tls-cert=/etc/proxy/secrets/tls.crt --metrics-tls-key=/etc/proxy/secrets/tls.key --upstream-ca=/etc/proxy/elasticsearch/admin-ca --cache-expiry=60s --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"} --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"} --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"} --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"} --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"} --auth-admin-role=admin_reader --auth-default-role=project_user State: Running Started: Mon, 21 Sep 2020 05:26:57 -0400 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 21 Sep 2020 05:15:06 -0400 Finished: Mon, 21 Sep 2020 05:26:56 -0400 Ready: True Restart Count: 12 Limits: memory: 64Mi Requests: cpu: 100m memory: 64Mi Environment: LOG_LEVEL: info Mounts: /etc/proxy/elasticsearch from certificates (rw) /etc/proxy/secrets from elasticsearch-metrics (rw) /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-99brw (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: elasticsearch-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: elasticsearch Optional: false elasticsearch-storage: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: elasticsearch-elasticsearch-cdm-soj6vq70-3 ReadOnly: false certificates: Type: Secret (a volume populated by a Secret) SecretName: elasticsearch Optional: false elasticsearch-metrics: Type: Secret (a volume populated by a Secret) SecretName: elasticsearch-metrics Optional: false elasticsearch-token-99brw: Type: Secret (a volume populated by a Secret) SecretName: elasticsearch-token-99brw Optional: false QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 144m 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Warning FailedScheduling 144m 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Normal Scheduled 140m Successfully assigned openshift-logging/elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww to qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Normal SuccessfulAttachVolume 140m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-73b52ca6-bf96-4a3a-a12c-15275b34267c" Normal AddedInterface 140m multus Add eth0 [10.129.2.5/23] Normal Pulled 140m kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Container image "registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:d083829ae9a4777f4f070acdd64298e1514e8b7895019186af22f8893656e475" already present on machine Normal Created 140m kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Created container elasticsearch Normal Started 140m kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Started container elasticsearch Warning Unhealthy 140m (x3 over 140m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] Warning Unhealthy 139m kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503] Warning BackOff 24m (x4 over 116m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Back-off restarting failed container Normal Pulled 115s (x13 over 140m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Container image "registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:ae8c365911b5f2e24b9fd07129ce18594834997ec7c73bd2652dfd695e75746f" already present on machine Normal Created 114s (x13 over 140m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Created container proxy Normal Started 114s (x13 over 140m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Started container proxy No error message in the proxy container: $ oc logs -c proxy elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww time="2020-09-21T07:32:21Z" level=info msg="mapping path \"/\" => upstream \"https://localhost:9200/\"" time="2020-09-21T07:32:21Z" level=info msg="HTTPS: listening on [::]:60001" time="2020-09-21T07:32:21Z" level=info msg="HTTPS: listening on [::]:60000" The worker node has enough resources: Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1936m (55%) 1 (28%) memory 10123Mi (72%) 8480Mi (61%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-gce-pd 0 0 log store configurations: logStore: elasticsearch: nodeCount: 3 redundancyPolicy: SingleRedundancy resources: requests: memory: 2Gi storage: size: 20Gi storageClassName: standard retentionPolicy: application: maxAge: 1d audit: maxAge: 2w infra: maxAge: 12h type: elasticsearch Version-Release number of selected component (if applicable): elasticsearch-operator.4.6.0-202009192030.p0 How reproducible: In some clusters, it's 100% reproducible, in some clusters, no such issue. Steps to Reproduce: 1. deploy logging 4.6 2. check ES pods 3. Actual results: Expected results: Additional info:
Created attachment 1715525 [details] Data sample every 1 seconds After ES pod was restarted, the proxy container memory increase from 4M to 5xM, wait for a while, the pod was OOMKilled again. Could you check if there is a Memory Leak in the proxy container?
(In reply to Anping Li from comment #1) > Created attachment 1715525 [details] > Data sample every 1 seconds > > After ES pod was restarted, the proxy container memory increase from 4M to > 5xM, wait for a while, the pod was OOMKilled again. * What is happening as far as querying or data ingestion when you see this happen? We don't see issues like this in testing otherwise which is perplexing. * Is it possible to oversize the memory to say 256m to see if there is a lower steady-state bounds
(In reply to Jeff Cantrill from comment #2) > (In reply to Anping Li from comment #1) > > Created attachment 1715525 [details] > > Data sample every 1 seconds > > > > After ES pod was restarted, the proxy container memory increase from 4M to > > 5xM, wait for a while, the pod was OOMKilled again. > > * What is happening as far as querying or data ingestion when you see this > happen? We don't see issues like this in testing otherwise which is > perplexing. > I didn't do anything, just keep the EFK pods running. > * Is it possible to oversize the memory to say 256m to see if there is a > lower steady-state bounds I updated the configurations to: logStore: elasticsearch: nodeCount: 3 proxy: resources: limits: memory: 256Mi requests: memory: 256Mi redundancyPolicy: SingleRedundancy resources: requests: memory: 2Gi storage: size: 20Gi storageClassName: standard retentionPolicy: application: maxAge: 1d audit: maxAge: 2w infra: maxAge: 3h type: elasticsearch I waited for 30 minutes, I didn't see the proxy container restart. $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-78b94df66d-wblpb 1/1 Running 0 73m elasticsearch-cdm-bi4udg5u-1-7fbdf69798-ddr9h 2/2 Running 0 34m elasticsearch-cdm-bi4udg5u-2-d788897df-d2ncv 2/2 Running 0 33m elasticsearch-cdm-bi4udg5u-3-785844ccfb-gdbmp 2/2 Running 0 32m Before I updated the configurations, I saw the proxy container restarted twice in 36 minutes: $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-78b94df66d-wblpb 1/1 Running 0 39m elasticsearch-cdm-bi4udg5u-1-64c8ff5df8-wvzvs 2/2 Running 2 38m elasticsearch-cdm-bi4udg5u-2-7dc94d485-tdgjs 2/2 Running 2 37m elasticsearch-cdm-bi4udg5u-3-6cc68f7b68-j46h9 2/2 Running 2 36m
AFAICS we have a pprof endpoint enabled for the proxy. I will take a turn a try to collect some profiles before providing a PR to bump the defaults to 256M.
Tested with elasticsearch-operator.4.6.0-202010081538.p0, hit this issue again. $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-8677fdc66b-ltqj8 1/1 Running 0 32m elasticsearch-cdm-a6mb9grm-1-57d8b9fb7b-jfh6d 2/2 Running 3 30m elasticsearch-cdm-a6mb9grm-2-7d89674bc5-rzq2s 2/2 Running 1 30m elasticsearch-cdm-a6mb9grm-3-675fdddc4b-spjbl 2/2 Running 1 30m $ oc describe pod elasticsearch-cdm-a6mb9grm-1-57d8b9fb7b-jfh6d proxy: Container ID: cri-o://6ec4f53a8db0a6dafb30a5ccdc552f739d862b933d996c557d1efea68de1c5e8 Image: registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:2c15d2bdc92d39f7919d478b01f243f7eaf08aa33d80e9e39c5777f5d90abd62 Image ID: registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:2c15d2bdc92d39f7919d478b01f243f7eaf08aa33d80e9e39c5777f5d90abd62 Ports: 60000/TCP, 60001/TCP Host Ports: 0/TCP, 0/TCP Args: --listening-address=:60000 --tls-cert=/etc/proxy/elasticsearch/logging-es.crt --tls-key=/etc/proxy/elasticsearch/logging-es.key --tls-client-ca=/etc/proxy/elasticsearch/admin-ca --metrics-listening-address=:60001 --metrics-tls-cert=/etc/proxy/secrets/tls.crt --metrics-tls-key=/etc/proxy/secrets/tls.key --upstream-ca=/etc/proxy/elasticsearch/admin-ca --cache-expiry=60s --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"} --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"} --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"} --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"} --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"} --auth-admin-role=admin_reader --auth-default-role=project_user State: Running Started: Fri, 09 Oct 2020 02:02:25 -0400 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Fri, 09 Oct 2020 01:57:59 -0400 Finished: Fri, 09 Oct 2020 02:02:02 -0400 Ready: True Restart Count: 3 Limits: memory: 64Mi Requests: cpu: 100m memory: 64Mi Environment: LOG_LEVEL: info Mounts: /etc/proxy/elasticsearch from certificates (rw) /etc/proxy/secrets from elasticsearch-metrics (rw) /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-z858l (ro)
@Qiaolin Could you please provide more information on what is so special on these clusters you hit this issue? I could not reproduce this issue sofar.
When I launching the cluster, I enabled ectd encryption and fips, besides, the cluster http proxy is enabled too. The cluster is deployed on GCP.
@Periklis Please see comment 14.
Created attachment 1720294 [details] Memory Consumption Metrics ES-Proxy with 64Mi Limits @qitang My investigation on the above QE cluster tells me that we have a stable solution with the fix [1]. The remaining issue here is that the default 64Mi are not a good default. Tuning the es-proxy memory resources to 256Mi via amending the ClusterLogging CR like this: > logStore: > elasticsearch: > nodeCount: 3 > proxy: > resources: > limits: > memory: 256Mi > requests: > memory: 256Mi gives me a stable setup: > ❯ oc get pod -l cluster-name=elasticsearch -w > NAME READY STATUS RESTARTS AGE > elasticsearch-cdm-jegayr09-1-94d46856-snkn6 2/2 Running 0 64m > elasticsearch-cdm-jegayr09-2-95b496dc9-dmnmf 2/2 Running 0 62m > elasticsearch-cdm-jegayr09-3-76c8d9d66-sshnf 2/2 Running 0 65m [1] https://github.com/openshift/elasticsearch-proxy/pull/52
I tested in 2 different clusters: cluster 1: only enabled fips -- hit this issue cluster 2: only enabled etcd encryption -- didn't hit this issue. To conclude, this issue happens when FIPS is enabled in the cluster.
*** Bug 1891569 has been marked as a duplicate of this bug. ***
*** Bug 1892362 has been marked as a duplicate of this bug. ***
Hi, I'm looking into exactly that problem right now and even after adjusting memory limits to 256Mi I can see OOMKilled messages, although at much lower rate (6-8 restarts overnight). Do I understand that change is still not merged and will fix the problem only when we get that in our image? That is on OCP 4.5.15 by the way. [kni@provisionhost-0-0 logging-templates]$ oc get pods -n openshift-logging NAME READY STATUS RESTARTS AGE cluster-logging-operator-545fb8c5fb-szgl4 1/1 Running 0 17h curator-1605097800-cnp62 0/1 Completed 0 36m elasticsearch-cdm-uaolnl4d-1-6dcb988cbc-6l2nh 2/2 Running 8 17h elasticsearch-cdm-uaolnl4d-2-c567cb486-xn74p 2/2 Running 6 17h elasticsearch-cdm-uaolnl4d-3-5f7678d99c-8bglh 2/2 Running 6 17h Cluster Logging: clusterlogging.4.5.0-202010311518.p0 ES: elasticsearch-operator.4.5.0-202010301114.p0
Created attachment 1728325 [details] configuration Ah, I see the problem, even though I'm trying to setup new limits with this config, the memory limit stays 64Mi, did I miss something in the configuration?
(In reply to Gurenko Alex from comment #25) > Hi, I'm looking into exactly that problem right now and even after adjusting > memory limits to 256Mi I can see OOMKilled messages, although at much lower > rate (6-8 restarts overnight). Do I understand that change is still not > merged and will fix the problem only when we get that in our image? That is > on OCP 4.5.15 by the way. > > [kni@provisionhost-0-0 logging-templates]$ oc get pods -n openshift-logging > > NAME READY STATUS RESTARTS > AGE > cluster-logging-operator-545fb8c5fb-szgl4 1/1 Running 0 > 17h > curator-1605097800-cnp62 0/1 Completed 0 > 36m > elasticsearch-cdm-uaolnl4d-1-6dcb988cbc-6l2nh 2/2 Running 8 > 17h > elasticsearch-cdm-uaolnl4d-2-c567cb486-xn74p 2/2 Running 6 > 17h > elasticsearch-cdm-uaolnl4d-3-5f7678d99c-8bglh 2/2 Running 6 > 17h > > > > Cluster Logging: clusterlogging.4.5.0-202010311518.p0 > ES: elasticsearch-operator.4.5.0-202010301114.p0 Yes the fix awaits approval by the patch manager to land into 4.6.z for now.
(In reply to Periklis Tsirakidis from comment #4) > AFAICS we have a pprof endpoint enabled for the proxy. I will take a turn a > try to collect some profiles before providing a PR to bump the defaults to > 256M. Configuring proxy request/limits for 4.5.z will be part of: https://bugzilla.redhat.com/show_bug.cgi?id=1894632
(In reply to Periklis Tsirakidis from comment #28) > (In reply to Periklis Tsirakidis from comment #4) > > AFAICS we have a pprof endpoint enabled for the proxy. I will take a turn a > > try to collect some profiles before providing a PR to bump the defaults to > > 256M. > > Configuring proxy request/limits for 4.5.z will be part of: > https://bugzilla.redhat.com/show_bug.cgi?id=1894632 Thanks!
I've tested 256Mi limit on a OCP4.6 cluster and I still got 1 container restart within 26h window: [kni@provisionhost-0-0 ~]$ oc get pods NAME READY STATUS RESTARTS AGE cluster-logging-operator-6c7d78ff74-nl7fj 1/1 Running 0 28h curator-1605623400-gg466 0/1 Completed 0 60m curator-1605627000-k9qvq 0/1 Completed 0 10s elasticsearch-cdm-4rnx5na5-1-686847cf6c-s5cmm 2/2 Running 1 26h elasticsearch-cdm-4rnx5na5-2-74678c4c56-7jb6x 2/2 Running 0 26h elasticsearch-cdm-4rnx5na5-3-5d7ff6594f-4h6hc 2/2 Running 0 26h I know it's much better, but I would expect that container wouldn't get OOMKilled at all. I see this particular BZ is still on post, so I assume the upstream patch didn't make it to the OLM? And just to be clear, with the patch applied, are we still expect to run CL instance with 256Mi limits?
(In reply to Gurenko Alex from comment #29) > (In reply to Periklis Tsirakidis from comment #28) > > (In reply to Periklis Tsirakidis from comment #4) > > > AFAICS we have a pprof endpoint enabled for the proxy. I will take a turn a > > > try to collect some profiles before providing a PR to bump the defaults to > > > 256M. > > > > Configuring proxy request/limits for 4.5.z will be part of: > > https://bugzilla.redhat.com/show_bug.cgi?id=1894632 > > Thanks! @agurenko FYI For 4.5 I am opening the following BZ [1] to backport these fixes. 4.6.z is slowly getting merged in the meantime, hope we can make it into the next merge window next week. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1899419
*** Bug 1899335 has been marked as a duplicate of this bug. ***
Verified with clusterlogging.4.6.0-202011221454.p0 and elasticsearch-operator.4.6.0-202011221454.p0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.6 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5117