This bug was initially created as a copy of Bug #1880960 I am copying this bug because: Description of problem: The proxy container restarts frequently: $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-69c9d6f5bc-njb8q 1/1 Running 0 135m elasticsearch-cdm-soj6vq70-1-688ddf8c9f-87vt9 2/2 Running 12 135m elasticsearch-cdm-soj6vq70-2-64876b6bd-sld68 2/2 Running 12 140m elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww 2/2 Running 12 146m $ oc describe pod elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww proxy: Container ID: cri-o://9da3f8434af0659f409ba6690dfa1d6b3753cf10c8499e6a6b462381380a22b5 Image: registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:ae8c365911b5f2e24b9fd07129ce18594834997ec7c73bd2652dfd695e75746f Image ID: registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:1d0e41e6c7ffe968149d6b9b90b56081af15a3370fe3de607a84b3c20677dbc9 Ports: 60000/TCP, 60001/TCP Host Ports: 0/TCP, 0/TCP Args: --listening-address=:60000 --tls-cert=/etc/proxy/elasticsearch/logging-es.crt --tls-key=/etc/proxy/elasticsearch/logging-es.key --tls-client-ca=/etc/proxy/elasticsearch/admin-ca --metrics-listening-address=:60001 --metrics-tls-cert=/etc/proxy/secrets/tls.crt --metrics-tls-key=/etc/proxy/secrets/tls.key --upstream-ca=/etc/proxy/elasticsearch/admin-ca --cache-expiry=60s --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"} --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"} --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"} --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"} --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"} --auth-admin-role=admin_reader --auth-default-role=project_user State: Running Started: Mon, 21 Sep 2020 05:26:57 -0400 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 21 Sep 2020 05:15:06 -0400 Finished: Mon, 21 Sep 2020 05:26:56 -0400 Ready: True Restart Count: 12 Limits: memory: 64Mi Requests: cpu: 100m memory: 64Mi Environment: LOG_LEVEL: info Mounts: /etc/proxy/elasticsearch from certificates (rw) /etc/proxy/secrets from elasticsearch-metrics (rw) /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-99brw (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: elasticsearch-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: elasticsearch Optional: false elasticsearch-storage: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: elasticsearch-elasticsearch-cdm-soj6vq70-3 ReadOnly: false certificates: Type: Secret (a volume populated by a Secret) SecretName: elasticsearch Optional: false elasticsearch-metrics: Type: Secret (a volume populated by a Secret) SecretName: elasticsearch-metrics Optional: false elasticsearch-token-99brw: Type: Secret (a volume populated by a Secret) SecretName: elasticsearch-token-99brw Optional: false QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 144m 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Warning FailedScheduling 144m 0/6 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had volume node affinity conflict, 2 node(s) were unschedulable. Normal Scheduled 140m Successfully assigned openshift-logging/elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww to qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Normal SuccessfulAttachVolume 140m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-73b52ca6-bf96-4a3a-a12c-15275b34267c" Normal AddedInterface 140m multus Add eth0 [10.129.2.5/23] Normal Pulled 140m kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Container image "registry.redhat.io/openshift4/ose-logging-elasticsearch6@sha256:d083829ae9a4777f4f070acdd64298e1514e8b7895019186af22f8893656e475" already present on machine Normal Created 140m kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Created container elasticsearch Normal Started 140m kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Started container elasticsearch Warning Unhealthy 140m (x3 over 140m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 000] Warning Unhealthy 139m kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Readiness probe failed: Elasticsearch node is not ready to accept HTTP requests yet [response code: 503] Warning BackOff 24m (x4 over 116m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Back-off restarting failed container Normal Pulled 115s (x13 over 140m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Container image "registry.redhat.io/openshift4/ose-elasticsearch-proxy@sha256:ae8c365911b5f2e24b9fd07129ce18594834997ec7c73bd2652dfd695e75746f" already present on machine Normal Created 114s (x13 over 140m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Created container proxy Normal Started 114s (x13 over 140m) kubelet, qitang2-d7vhm-worker-a-kgwwq.c.openshift-qe.internal Started container proxy No error message in the proxy container: $ oc logs -c proxy elasticsearch-cdm-soj6vq70-3-5d54fb5d65-vbkww time="2020-09-21T07:32:21Z" level=info msg="mapping path \"/\" => upstream \"https://localhost:9200/\"" time="2020-09-21T07:32:21Z" level=info msg="HTTPS: listening on [::]:60001" time="2020-09-21T07:32:21Z" level=info msg="HTTPS: listening on [::]:60000" The worker node has enough resources: Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1936m (55%) 1 (28%) memory 10123Mi (72%) 8480Mi (61%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-gce-pd 0 0 log store configurations: logStore: elasticsearch: nodeCount: 3 redundancyPolicy: SingleRedundancy resources: requests: memory: 2Gi storage: size: 20Gi storageClassName: standard retentionPolicy: application: maxAge: 1d audit: maxAge: 2w infra: maxAge: 12h type: elasticsearch Version-Release number of selected component (if applicable): elasticsearch-operator.4.6.0-202009192030.p0 How reproducible: In some clusters, it's 100% reproducible, in some clusters, no such issue. Steps to Reproduce: 1. deploy logging 4.6 2. check ES pods 3. Actual results: Expected results: Additional info:
The fix doesn't work, the default request memory of proxy container is still `64Mi`. $ oc get cl instance -oyaml logStore: elasticsearch: nodeCount: 3 redundancyPolicy: SingleRedundancy resources: requests: memory: 2Gi storage: size: 20Gi storageClassName: standard retentionPolicy: application: maxAge: 1d audit: maxAge: 2w infra: maxAge: 3h type: elasticsearch $ oc get es elasticsearch -oyaml nodeSpec: proxyResources: limits: memory: 64Mi requests: cpu: 100m memory: 64Mi resources: requests: memory: 2Gi Image: quay.io/openshift/origin-elasticsearch-operator@sha256:1641e7698155161b3fadc3079c0d4abdbdd7107cc89a4642bb8e58309b8d9700
I manually set the memory to 256Mi, the proxy container restarted after running for about 3 hours. proxy: Container ID: cri-o://90af77f73f240cfe3a1be7e3a840671a9cd8da932909aade141681574a9dfe89 Image: quay.io/openshift/origin-elasticsearch-proxy:4.7.0 Image ID: quay.io/openshift/origin-elasticsearch-proxy@sha256:f8aeb4d918dc4a6d77641af58586fc39ec339c1f95b167d1f62aec5164e5da1f Ports: 60000/TCP, 60001/TCP Host Ports: 0/TCP, 0/TCP Args: --listening-address=:60000 --tls-cert=/etc/proxy/elasticsearch/logging-es.crt --tls-key=/etc/proxy/elasticsearch/logging-es.key --tls-client-ca=/etc/proxy/elasticsearch/admin-ca --metrics-listening-address=:60001 --metrics-tls-cert=/etc/proxy/secrets/tls.crt --metrics-tls-key=/etc/proxy/secrets/tls.key --upstream-ca=/etc/proxy/elasticsearch/admin-ca --cache-expiry=60s --auth-backend-role=admin_reader={"namespace": "default", "verb": "get", "resource": "pods/log"} --auth-backend-role=prometheus={"verb": "get", "resource": "/metrics"} --auth-backend-role=jaeger={"verb": "get", "resource": "/jaeger", "resourceAPIGroup": "elasticsearch.jaegertracing.io"} --auth-backend-role=elasticsearch-operator={"namespace": "*", "verb": "*", "resource": "*", "resourceAPIGroup": "logging.openshift.io"} --auth-backend-role=index-management={"namespace":"openshift-logging", "verb": "*", "resource": "indices", "resourceAPIGroup": "elasticsearch.openshift.io"} --auth-admin-role=admin_reader --auth-default-role=project_user State: Running Started: Mon, 12 Oct 2020 01:05:56 -0400 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Sun, 11 Oct 2020 21:52:49 -0400 Finished: Mon, 12 Oct 2020 01:05:55 -0400 Ready: True Restart Count: 1 Limits: memory: 256Mi Requests: cpu: 100m memory: 256Mi Environment: LOG_LEVEL: info Mounts: /etc/proxy/elasticsearch from certificates (rw) /etc/proxy/secrets from elasticsearch-metrics (rw) /var/run/secrets/kubernetes.io/serviceaccount from elasticsearch-token-rfc2v (ro) $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-6b87bdfcd-2nzfc 1/1 Running 0 4h7m elasticsearch-cdm-pwwtd9g6-1-564d87bdfd-9m7zh 2/2 Running 1 3h39m elasticsearch-cdm-pwwtd9g6-2-98c975c59-2ptpp 2/2 Running 1 3h38m elasticsearch-cdm-pwwtd9g6-3-599cdf5df6-9wcf9 2/2 Running 1 3h36m
The ES pods has been running for 13 hours, and the proxy container didn't restart. $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-54d478856f-h2fnv 1/1 Running 0 13h elasticsearch-cdm-djppeosj-1-5c9f6ff64-t7r8z 2/2 Running 0 13h elasticsearch-cdm-djppeosj-2-7d75b56bff-xkbsg 2/2 Running 0 13h elasticsearch-cdm-djppeosj-3-7f4f86bd86-s5t8r 2/2 Running 0 13h elasticsearch-delete-app-1604448000-9wxpr 0/1 Completed 0 2m26s elasticsearch-delete-audit-1604448000-crt2l 0/1 Completed 0 2m26s elasticsearch-delete-infra-1604448000-xp2v2 0/1 Completed 0 2m26s $ oc get csv NAME DISPLAY VERSION REPLACES PHASE clusterlogging.4.7.0-202011021919.p0 Cluster Logging 4.7.0-202011021919.p0 Succeeded elasticsearch-operator.4.7.0-202011030448.p0 Elasticsearch Operator 4.7.0-202011030448.p0 Succeeded The default value of proxy container requested memory is 256Mi now. managementState: Managed nodeSpec: proxyResources: limits: memory: 256Mi requests: cpu: 100m memory: 256Mi So verified this bz.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0652