1880960 – The ES proxy container always restarts when FIPS is enabled in the cluster.

Bug 1880960 - The ES proxy container always restarts when FIPS is enabled in the cluster.

Summary: The ES proxy container always restarts when FIPS is enabled in the cluster.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Periklis Tsirakidis
QA Contact:	Qiaoling Tang
Docs Contact:
URL:
Whiteboard:	logging-exploration
Duplicates (3):	1891569 1892362 1899335 (view as bug list)
Depends On:	1886856
Blocks:	1899419
TreeView+	depends on / blocked

Reported:	2020-09-21 09:34 UTC by Qiaoling Tang
Modified:	2024-03-25 16:32 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-30 16:27:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Data sample every 1 seconds (4.29 KB, application/gzip) 2020-09-21 11:45 UTC, Anping Li	no flags	Details
Memory Consumption Metrics ES-Proxy with 64Mi Limits (130.95 KB, image/png) 2020-10-09 15:18 UTC, Periklis Tsirakidis	no flags	Details
configuration (890 bytes, text/plain) 2020-11-11 13:12 UTC, Gurenko Alex	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-logging-operator pull 740	None	closed	[release-4.6] Bug 1880960: Increase es proxy default memory resources to 256Mi	2021-02-19 21:20:59 UTC
Github	openshift cluster-logging-operator pull 760	None	closed	[release-4.6] Bug 1880960: Add keep-alive backend for elasticsearch store	2021-02-19 21:20:59 UTC
Github	openshift elasticsearch-operator pull 528	None	closed	[release-4.6] Bug 1880960: Increase es proxy default memory resources to 256Mi	2021-02-19 21:21:00 UTC
Github	openshift elasticsearch-proxy pull 51	None	closed	Bug 1880960: Enable pprof endpoints on non-default serveMux	2021-02-19 21:21:00 UTC
Github	openshift elasticsearch-proxy pull 52	None	closed	Bug 1880960: Add limit for max idle conns per http transport config	2021-02-19 21:21:00 UTC
Github	openshift elasticsearch-proxy pull 61	None	closed	[release-4.6] Bug 1880960: Fix keep-alive configuration	2021-02-19 21:21:01 UTC
Github	openshift origin-aggregated-logging pull 2017	None	closed	[release-4.6] Bug 1880960: Add typhoeus for keep-alives in elasticsearch plugin	2021-02-19 21:21:01 UTC
Red Hat Product Errata	RHBA-2020:5117	None	None	None	2020-11-30 16:27:25 UTC

Description Qiaoling Tang 2020-09-21 09:34:52 UTC

Description of problem:

The proxy container restarts frequently:
$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-69c9d6f5bc-njb8q       1/1     Running     0          135m
elasticsearch-cdm-soj6vq70-1-688ddf8c9f-87vt9   2/2     Running     12         135m
elasticsearch-cdm-soj6vq70-2-64876b6bd-sld68    2/2     Running     12         140m
elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww   2/2     Running     12         146m

$ oc describe pod elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww 
  proxy:
    Container ID:  cri-o://9da3f8434af0659f409ba6690dfa1d6b3753cf10c8499e6a6b462381380a22b5
    Image:         registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:ae8c365911b5f2e24b9fd07129ce18594834997ec7c73bd2652dfd695e75746f
    Image ID:      registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:1d0e41e6c7ffe968149d6b9b90b56081af15a3370fe3de607a84b3c20677dbc9
    Ports:         60000/TCP, 60001/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --listening-address=:60000
      --tls-cert=/etc/proxy/elasticsearch/logging-es.crt
      --tls-key=/etc/proxy/elasticsearch/logging-es.key
      --tls-client-ca=/etc/proxy/elasticsearch/admin-ca
      --metrics-listening-address=:60001
      --metrics-tls-cert=/etc/proxy/secrets/tls.crt
      --metrics-tls-key=/etc/proxy/secrets/tls.key
      --upstream-ca=/etc/proxy/elasticsearch/admin-ca
      --cache-expiry=60s
      --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"}
      --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"}
      --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"}
      --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"}
      --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"}
      --auth-admin-role=admin_reader
      --auth-default-role=project_user
    State:          Running
      Started:      Mon, 21 Sep 2020 05:26:57 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 21 Sep 2020 05:15:06 -0400
      Finished:     Mon, 21 Sep 2020 05:26:56 -0400
    Ready:          True
    Restart Count:  12
    Limits:
      memory:  64Mi
    Requests:
      cpu:     100m
      memory:  64Mi
    Environment:
      LOG_LEVEL:  info
    Mounts:
      /etc/proxy/elasticsearch from certificates (rw)
      /etc/proxy/secrets from elasticsearch-metrics (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-99brw (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  elasticsearch-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      elasticsearch
    Optional:  false
  elasticsearch-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  elasticsearch-elasticsearch-cdm-soj6vq70-3
    ReadOnly:   false
  certificates:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch
    Optional:    false
  elasticsearch-metrics:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch-metrics
    Optional:    false
  elasticsearch-token-99brw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elasticsearch-token-99brw
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From                                                           Message
  ----     ------                  ----                  ----                                                           -------
  Warning  FailedScheduling        144m                                                                                 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Warning  FailedScheduling        144m                                                                                 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable.
  Normal   Scheduled               140m                                                                                 Successfully assigned openshift-logging/elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww to qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal
  Normal   SuccessfulAttachVolume  140m                  attachdetach-controller                                        AttachVolume.Attach succeeded for volume "pvc-73b52ca6-bf96-4a3a-a12c-15275b34267c"
  Normal   AddedInterface          140m                  multus                                                         Add eth0 [10.129.2.5/23]
  Normal   Pulled                  140m                  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Container image "registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:d083829ae9a4777f4f070acdd64298e1514e8b7895019186af22f8893656e475" already present on machine
  Normal   Created                 140m                  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Created container elasticsearch
  Normal   Started                 140m                  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Started container elasticsearch
  Warning  Unhealthy               140m (x3 over 140m)   kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000]
  Warning  Unhealthy               139m                  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503]
  Warning  BackOff                 24m (x4 over 116m)    kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Back-off restarting failed container
  Normal   Pulled                  115s (x13 over 140m)  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Container image "registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:ae8c365911b5f2e24b9fd07129ce18594834997ec7c73bd2652dfd695e75746f" already present on machine
  Normal   Created                 114s (x13 over 140m)  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Created container proxy
  Normal   Started                 114s (x13 over 140m)  kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal  Started container proxy 


No error message in the proxy container:
$ oc logs -c proxy elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww 
time="2020-09-21T07:32:21Z" level=info msg="mapping path \"/\" => upstream \"https://localhost:9200/\""
time="2020-09-21T07:32:21Z" level=info msg="HTTPS: listening on [::]:60001"
time="2020-09-21T07:32:21Z" level=info msg="HTTPS: listening on [::]:60000"

The worker node has enough resources:
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests       Limits
  --------                   --------       ------
  cpu                        1936m (55%)    1 (28%)
  memory                     10123Mi (72%)  8480Mi (61%)
  ephemeral-storage          0 (0%)         0 (0%)
  hugepages-1Gi              0 (0%)         0 (0%)
  hugepages-2Mi              0 (0%)         0 (0%)
  attachable-volumes-gce-pd  0              0

log store configurations:
    logStore:
      elasticsearch:
        nodeCount: 3
        redundancyPolicy: SingleRedundancy
        resources:
          requests:
            memory: 2Gi
        storage:
          size: 20Gi
          storageClassName: standard
      retentionPolicy:
        application:
          maxAge: 1d
        audit:
          maxAge: 2w
        infra:
          maxAge: 12h
      type: elasticsearch

Version-Release number of selected component (if applicable):
elasticsearch-operator.4.6.0-202009192030.p0

How reproducible:
In some clusters, it's 100% reproducible, in some clusters, no such issue.

Steps to Reproduce:
1. deploy logging 4.6
2. check ES pods 
3.

Actual results:


Expected results:


Additional info:

Comment 1 Anping Li 2020-09-21 11:45:12 UTC

Created attachment 1715525 [details]
Data sample every 1 seconds

After ES pod was restarted, the proxy container memory increase from 4M to 5xM, wait for a while, the pod was OOMKilled again.

Could you check if there is a Memory Leak in the proxy container?

Comment 2 Jeff Cantrill 2020-09-28 19:04:31 UTC

(In reply to Anping Li from comment #1)
> Created attachment 1715525 [details]
> Data sample every 1 seconds
> 
> After ES pod was restarted, the proxy container memory increase from 4M to
> 5xM, wait for a while, the pod was OOMKilled again.

* What is happening as far as querying or data ingestion when you see this happen?  We don't see issues like this in testing otherwise which is perplexing.

* Is it possible to oversize the memory to say 256m to see if there is a lower steady-state bounds

Comment 3 Qiaoling Tang 2020-09-29 06:23:06 UTC

(In reply to Jeff Cantrill from comment #2)
> (In reply to Anping Li from comment #1)
> > Created attachment 1715525 [details]
> > Data sample every 1 seconds
> > 
> > After ES pod was restarted, the proxy container memory increase from 4M to
> > 5xM, wait for a while, the pod was OOMKilled again.
> 
> * What is happening as far as querying or data ingestion when you see this
> happen?  We don't see issues like this in testing otherwise which is
> perplexing.
> 

I didn't do anything, just keep the EFK pods running.


> * Is it possible to oversize the memory to say 256m to see if there is a
> lower steady-state bounds

I updated the configurations to:
    logStore:
      elasticsearch:
        nodeCount: 3
        proxy:
          resources:
            limits:
              memory: 256Mi
            requests:
              memory: 256Mi
        redundancyPolicy: SingleRedundancy
        resources:
          requests:
            memory: 2Gi
        storage:
          size: 20Gi
          storageClassName: standard
      retentionPolicy:
        application:
          maxAge: 1d
        audit:
          maxAge: 2w
        infra:
          maxAge: 3h
      type: elasticsearch

I waited for 30 minutes, I didn't see the proxy container restart.
$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-78b94df66d-wblpb       1/1     Running     0          73m
elasticsearch-cdm-bi4udg5u-1-7fbdf69798-ddr9h   2/2     Running     0          34m
elasticsearch-cdm-bi4udg5u-2-d788897df-d2ncv    2/2     Running     0          33m
elasticsearch-cdm-bi4udg5u-3-785844ccfb-gdbmp   2/2     Running     0          32m

Before I updated the configurations, I saw the proxy container restarted twice in 36 minutes:
$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-78b94df66d-wblpb       1/1     Running     0          39m
elasticsearch-cdm-bi4udg5u-1-64c8ff5df8-wvzvs   2/2     Running     2          38m
elasticsearch-cdm-bi4udg5u-2-7dc94d485-tdgjs    2/2     Running     2          37m
elasticsearch-cdm-bi4udg5u-3-6cc68f7b68-j46h9   2/2     Running     2          36m

Comment 4 Periklis Tsirakidis 2020-09-29 06:39:50 UTC

AFAICS we have a pprof endpoint enabled for the proxy. I will take a turn a try to collect some profiles before providing a PR to bump the defaults to 256M.

Comment 12 Qiaoling Tang 2020-10-09 06:05:53 UTC

Tested with elasticsearch-operator.4.6.0-202010081538.p0, hit this issue again.

$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-8677fdc66b-ltqj8       1/1     Running     0          32m
elasticsearch-cdm-a6mb9grm-1-57d8b9fb7b-jfh6d   2/2     Running     3          30m
elasticsearch-cdm-a6mb9grm-2-7d89674bc5-rzq2s   2/2     Running     1          30m
elasticsearch-cdm-a6mb9grm-3-675fdddc4b-spjbl   2/2     Running     1          30m

$ oc describe pod elasticsearch-cdm-a6mb9grm-1-57d8b9fb7b-jfh6d
  proxy:
    Container ID:  cri-o://6ec4f53a8db0a6dafb30a5ccdc552f739d862b933d996c557d1efea68de1c5e8
    Image:         registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:2c15d2bdc92d39f7919d478b01f243f7eaf08aa33d80e9e39c5777f5d90abd62
    Image ID:      registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:2c15d2bdc92d39f7919d478b01f243f7eaf08aa33d80e9e39c5777f5d90abd62
    Ports:         60000/TCP, 60001/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --listening-address=:60000
      --tls-cert=/etc/proxy/elasticsearch/logging-es.crt
      --tls-key=/etc/proxy/elasticsearch/logging-es.key
      --tls-client-ca=/etc/proxy/elasticsearch/admin-ca
      --metrics-listening-address=:60001
      --metrics-tls-cert=/etc/proxy/secrets/tls.crt
      --metrics-tls-key=/etc/proxy/secrets/tls.key
      --upstream-ca=/etc/proxy/elasticsearch/admin-ca
      --cache-expiry=60s
      --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"}
      --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"}
      --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"}
      --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"}
      --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"}
      --auth-admin-role=admin_reader
      --auth-default-role=project_user
    State:          Running
      Started:      Fri, 09 Oct 2020 02:02:25 -0400
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 09 Oct 2020 01:57:59 -0400
      Finished:     Fri, 09 Oct 2020 02:02:02 -0400
    Ready:          True
    Restart Count:  3
    Limits:
      memory:  64Mi
    Requests:
      cpu:     100m
      memory:  64Mi
    Environment:
      LOG_LEVEL:  info
    Mounts:
      /etc/proxy/elasticsearch from certificates (rw)
      /etc/proxy/secrets from elasticsearch-metrics (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-z858l (ro)

Comment 13 Periklis Tsirakidis 2020-10-09 08:26:58 UTC

@Qiaolin

Could you please provide more information on what is so special on these clusters you hit this issue? I could not reproduce this issue sofar.

Comment 14 Qiaoling Tang 2020-10-09 08:32:57 UTC

When I launching the cluster, I enabled ectd encryption and fips, besides, the cluster http proxy is enabled too. The cluster is deployed on GCP.

Comment 17 Qiaoling Tang 2020-10-09 08:51:43 UTC

@Periklis 

Please see comment 14.

Comment 18 Periklis Tsirakidis 2020-10-09 15:18:00 UTC

Created attachment 1720294 [details]
Memory Consumption Metrics ES-Proxy with 64Mi Limits

@qitang 

My investigation on the above QE cluster tells me that we have a stable solution with the fix [1]. The remaining issue here is that the default 64Mi are not a good default. 

Tuning the es-proxy memory resources to 256Mi via amending the ClusterLogging CR like this:

>  logStore:
>    elasticsearch:
>      nodeCount: 3
>      proxy:
>        resources:
>          limits:
>            memory: 256Mi
>          requests:
>            memory: 256Mi


gives me a stable setup:

> ❯ oc get pod -l cluster-name=elasticsearch -w
> NAME                                           READY   STATUS    RESTARTS   AGE
> elasticsearch-cdm-jegayr09-1-94d46856-snkn6    2/2     Running   0          64m
> elasticsearch-cdm-jegayr09-2-95b496dc9-dmnmf   2/2     Running   0          62m
> elasticsearch-cdm-jegayr09-3-76c8d9d66-sshnf   2/2     Running   0          65m



[1] https://github.com/openshift/elasticsearch-proxy/pull/52

Comment 21 Qiaoling Tang 2020-10-14 04:12:47 UTC

I tested in 2 different clusters:

cluster 1: only enabled fips  -- hit this issue

cluster 2: only enabled etcd encryption -- didn't hit this issue.

To conclude, this issue happens when FIPS is enabled in the cluster.

Comment 23 Periklis Tsirakidis 2020-10-26 16:53:05 UTC

*** Bug 1891569 has been marked as a duplicate of this bug. ***

Comment 24 Periklis Tsirakidis 2020-11-09 10:35:10 UTC

*** Bug 1892362 has been marked as a duplicate of this bug. ***

Comment 25 Gurenko Alex 2020-11-11 13:08:19 UTC

Hi, I'm looking into exactly that problem right now and even after adjusting memory limits to 256Mi I can see OOMKilled messages, although at much lower rate (6-8 restarts overnight). Do I understand that change is still not merged and will fix the problem only when we get that in our image? That is on OCP 4.5.15 by the way. 

[kni@provisionhost-0-0 logging-templates]$ oc get pods -n openshift-logging                                                 
NAME                                            READY   STATUS      RESTARTS   AGE                                          
cluster-logging-operator-545fb8c5fb-szgl4       1/1     Running     0          17h                                          
curator-1605097800-cnp62                        0/1     Completed   0          36m                                          
elasticsearch-cdm-uaolnl4d-1-6dcb988cbc-6l2nh   2/2     Running     8          17h                                          
elasticsearch-cdm-uaolnl4d-2-c567cb486-xn74p    2/2     Running     6          17h                                          
elasticsearch-cdm-uaolnl4d-3-5f7678d99c-8bglh   2/2     Running     6          17h



Cluster Logging: clusterlogging.4.5.0-202010311518.p0
ES: elasticsearch-operator.4.5.0-202010301114.p0

Comment 26 Gurenko Alex 2020-11-11 13:12:38 UTC

Created attachment 1728325 [details]
configuration

Ah, I see the problem, even though I'm trying to setup new limits with this config, the memory limit stays 64Mi, did I miss something in the configuration?

Comment 27 Periklis Tsirakidis 2020-11-11 13:13:55 UTC

(In reply to Gurenko Alex from comment #25)
> Hi, I'm looking into exactly that problem right now and even after adjusting
> memory limits to 256Mi I can see OOMKilled messages, although at much lower
> rate (6-8 restarts overnight). Do I understand that change is still not
> merged and will fix the problem only when we get that in our image? That is
> on OCP 4.5.15 by the way. 
> 
> [kni@provisionhost-0-0 logging-templates]$ oc get pods -n openshift-logging 
> 
> NAME                                            READY   STATUS      RESTARTS
> AGE                                          
> cluster-logging-operator-545fb8c5fb-szgl4       1/1     Running     0       
> 17h                                          
> curator-1605097800-cnp62                        0/1     Completed   0       
> 36m                                          
> elasticsearch-cdm-uaolnl4d-1-6dcb988cbc-6l2nh   2/2     Running     8       
> 17h                                          
> elasticsearch-cdm-uaolnl4d-2-c567cb486-xn74p    2/2     Running     6       
> 17h                                          
> elasticsearch-cdm-uaolnl4d-3-5f7678d99c-8bglh   2/2     Running     6       
> 17h
> 
> 
> 
> Cluster Logging: clusterlogging.4.5.0-202010311518.p0
> ES: elasticsearch-operator.4.5.0-202010301114.p0

Yes the fix awaits approval by the patch manager to land into 4.6.z for now.

Comment 28 Periklis Tsirakidis 2020-11-11 13:15:28 UTC

(In reply to Periklis Tsirakidis from comment #4)
> AFAICS we have a pprof endpoint enabled for the proxy. I will take a turn a
> try to collect some profiles before providing a PR to bump the defaults to
> 256M.

Configuring proxy request/limits for 4.5.z will be part of: https://bugzilla.redhat.com/show_bug.cgi?id=1894632

Comment 29 Gurenko Alex 2020-11-12 14:17:57 UTC

(In reply to Periklis Tsirakidis from comment #28)
> (In reply to Periklis Tsirakidis from comment #4)
> > AFAICS we have a pprof endpoint enabled for the proxy. I will take a turn a
> > try to collect some profiles before providing a PR to bump the defaults to
> > 256M.
> 
> Configuring proxy request/limits for 4.5.z will be part of:
> https://bugzilla.redhat.com/show_bug.cgi?id=1894632

Thanks!

Comment 30 Gurenko Alex 2020-11-17 15:32:20 UTC

I've tested 256Mi limit on a OCP4.6 cluster and I still got 1 container restart within 26h window:

[kni@provisionhost-0-0 ~]$ oc get pods
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-6c7d78ff74-nl7fj       1/1     Running     0          28h
curator-1605623400-gg466                        0/1     Completed   0          60m
curator-1605627000-k9qvq                        0/1     Completed   0          10s
elasticsearch-cdm-4rnx5na5-1-686847cf6c-s5cmm   2/2     Running     1          26h
elasticsearch-cdm-4rnx5na5-2-74678c4c56-7jb6x   2/2     Running     0          26h
elasticsearch-cdm-4rnx5na5-3-5d7ff6594f-4h6hc   2/2     Running     0          26h

I know it's much better, but I would expect that container wouldn't get OOMKilled at all. I see this particular BZ is still on post, so I assume the upstream patch didn't make it to the OLM?

And just to be clear, with the patch applied, are we still expect to run CL instance with 256Mi limits?

Comment 32 Periklis Tsirakidis 2020-11-19 07:50:57 UTC

(In reply to Gurenko Alex from comment #29)
> (In reply to Periklis Tsirakidis from comment #28)
> > (In reply to Periklis Tsirakidis from comment #4)
> > > AFAICS we have a pprof endpoint enabled for the proxy. I will take a turn a
> > > try to collect some profiles before providing a PR to bump the defaults to
> > > 256M.
> > 
> > Configuring proxy request/limits for 4.5.z will be part of:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1894632
> 
> Thanks!

@agurenko 

FYI For 4.5 I am opening the following BZ [1] to backport these fixes. 4.6.z is slowly getting merged in the meantime, hope we can make it into the next merge window next week.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1899419

Comment 33 Jeff Cantrill 2020-11-20 19:24:42 UTC

*** Bug 1899335 has been marked as a duplicate of this bug. ***

Comment 36 Qiaoling Tang 2020-11-24 23:53:18 UTC

Verified with clusterlogging.4.6.0-202011221454.p0 and elasticsearch-operator.4.6.0-202011221454.p0

Comment 38 errata-xmlrpc 2020-11-30 16:27:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.6 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5117

Note You need to log in before you can comment on or make changes to this bug.